The Linux binfmt subsystem

https://anadoxin.org/blog/the-linux-binfmt-subsystem.html

Sat, 16 June 2012 :: #linux :: #kernel

The binfmt subsystem is a special mechanism which allows the kernel to extend its ability to recognize different executable binary formats. It's invoked by the kernel when the user wants to execute a file with the executable (+x) flag, and its purpose is to help the kernel to understand the binary structure of chosen file. This structure is crucial in determining which segments are responsible for holding the program code and program data. Only by knowing such information it is possible to interpret in a proper (and deterministic) way.

Some history

On UNIX systems, the executable file format went through a quite big evolution, similar to what the Windows file formats did. On UNIX'es, one of the oldest form of executable formats was the a.out file, which was replaced by more functional and user-friendly COFF format, which was — in turn — replaced by the ELF format. ELF got its name from Executable and Linkable Format, or — in a different time span — Extensible Linking Format, and it rules the executable format kingdom until this very day.

Additionally, the variety of executable formats can be seen in more visible colors when you start to consider the fact, that ELF structure is slightly different on x86 and amd64 architectures. The problem is that the system needs to support backward compatibility issues when running the software; thus it needs to support both old formats, and new ones, to be regarded as usable. This is where the binfmt subsystem comes in; it's an easy way to allow the kernel to choose which binary interpreter is the proper one for chosen executable file.

Overview

The file execution process begins in the exec function, implemented in your local libc library. Every function from the exec family at some point will enter the execve() function. It's realized using the 0x80 interrupt (or by sysenter instruction, or sometimes syscall), which is a standard way of implementing a system service. The 0x80 interrupt, along with a number 11 in the command register (more on this later) lets you switch the user-mode execution context in favor of the kernel-mode execution context (ring 0). The interrupt, when the command argument number is 11, is realized in the sys_execve() function in the kernel, which resides inside arch/x86/kernel/process.c file. So, to sum it up: the code flow, after calling the execve() function inside libc, during invocation of interrupt 0x80, goes through a “gate” living inside the IDT, which tells the processor to start allowing more privileged instructions than standard ring 3 ones (along with a bunch of other information), and jumps to the sys_execve() function in the kernel (also, for completion: if the int 80h instruction is implemented by sysenter or syscall instructions, the IDT part of this transition is replaced by a read of special MSR registers allowing the relocation of code flow pointer into the acquired address). The sys_execve() function does the boring job of copying the arguments from ring 3 address space into current ring 0 space, and invokes another function: do_execve(). This one is just a temporary place for us, not doing anything of importance related to our subject. The crucial part is the invocation of do_execve_common(), which actually is doing something important, and interesting ;).

More or less, this function is designed to verify the security context of the process, and to make sure that only the few selected handles will be inherited by the new process, which at this point is yet to be created. It initializes the memory management structures, which is of most importance for us. To successfully perform this task, it invokes the search_binary_handler() function, which tries to find a proper binary file interpreter, which knows how to walk through the structure of our executable file.

In the kernel, every binary interpreter is implemented in its own source file, like this:

fs/binfmt_aout.c,
fs/binfmt_elf.c,
fs/binfmt_script.c,
etc.

Every interpreter sticks to a few important rules, which allow them to live inside the kernel ecosystem. One of the most important ones is to invoke the register_binfmt() function, which notifies the kernel about the existence of the interpreter. The function will pass the linux_binfmt structure, which contains the load_binary field containing the pointer to a callback which gets invoked by the kernel when it wants to match the interpreter with a specified binary file. If the file structure is not what the interpreter understands (like trying to use the ELF interpreter with a PE executable), it returns the -ENOEXEC error, to notify the binfmt subsystem about a bad match. Then, the kernel will try another interpreter from its pool of interpreters, until a final failure (no interpreters match), or a success.

In other words: each built-in interpreter will issue a load_binary structure to the kernel, to register itself in the kernel pool of interpreters. Interpreters which live inside the Loadable Kernel Modules will also do the same stuff; it means that the kernel constantly maintains a special table of interpreters (the pool), and uses it to find an interpreter suitable for current task. I thought it would be a nice idea to be able to peek though this table, to know what executable interpreters are allowed by my current operating system ;)

The Loadable Kernel Module (LKM)

Before we dive into creation of our LKM, let's think of what the module should do. The draft plan has been already created, but we need to dig some details to know what structure fields are important for us and need reading, to get the information we need. For this purpose it would be a good idea to quickly go through the search_binary_handler(), which could be named as a core of binfmt subsystem.

The function starts with a for loop after a block of sanity checks. The loop uses the list_for_each_entry macro, which is a standard (for the kernel) method of iterating over a double-linked list. The first argument (fmt) is the iterator; this is a variable which points to a new row of data — it's valid in the whole scope inside the for expression, but not afterwards. The second argument, & formats, is a pointer to the list we want to iterate on. Third argument is a result of some peculiarities of the C language and the container_of macro, and it's a way of adjusting the compiler to properly cast the type of the value in a type-safe fashion, to which the iterator points to. More or less -- it's not important at the moment, right now it's just a “glue code” for us. The important thing is that this for loop iterates over the formats array, and allows the code inside its scope to access the member objects, which are of the linux_binfmt type. Every iteration reads the load_binary field, which is a pointer to a callback, invokes it, and checks the return value. So, in other words, formats table is a table which holds information about all the currently registered binary file interpreters. By dumping the contents of this table we can get the list of binary formats that are possible to be invoked on our system.

Let's write a LKM, which — after insmod'ing — will find its way into the formats array and dump its contents to the user, but in a form, which would be understandable by her without any addidional postprocessing, so it could be possible to only use cat utility to read the contents of the file. Unfortunately, there is a slight problem with that; the module can't simply display something on the screen. Such user interaction level is beyond the abstraction level imposed by the module subsystem. The exchange of information between the LKM and the user is done in a different way; best thing is to use a combination of ring 3 application and ring 0 module. A normal ring 3 application will send a question to the module, which the module will process, rendering the answer to a simple flat byte buffer, which will in turn be sent to the application as a response. Then the application will be able to display the buffer on the screen by using printf(), a message box, or similar, whatever method is needed by the user. The question sending part could be resolved by using the procfs subsystem, which is a standard Linux way of sending information between ring 3 and ring 0 in a full-duplex fashion. We will get into that shortly, but for now it'll suffice to say that it's based on all of those /proc files, which could be read by simple tools like cat. These strings, which you can observe while cat'ing one of the files from /proc, are generated inside the kernel, sent through the standard I/O mechanisms to ring 3 application, to be returned by standard I/O functions like read(), and finally displayed on the screen.

Notabene, there is a shortcut we can use, when we want to quickly send some information from the LKM to the user. It's name is printk and it's very similar to standard printf. Two important details which you should note is that it normally only works on the first physical console (ctrl+alt+f1; though this can be configured later), and for it to work, you must first activate it by echo'ing 9 9 9 9 to /proc/sys/kernel/printk. These numbers control the printk logs and they specify the log levels, which kernel has to pass into the console output. Log levels higher than the ones specified in this control file are ignored and passed to /dev/null instead of the console (well, not exactly to /dev/null, but I think you get the idea). Sometimes, by using daemons like klogd, the content of printk's buffer is stored inside system log files, like /var/log/kern.log, so you can also check if that's the case for your system. Working or not, this metod is valid only for communication between ring 0 and the developer, and you shouldn't use it on production code. You can peek into my other post, the Device Enumeration on PCI bus, to see how printk can be used. It's small and uncomplicated, so it shouldn't generate any further questions on this subject.

OK, so here's the plan.

Create the module skeleton, which is compiled into the binary module file, accepted by the insmod utility,
Find our way into the tables array, and interpret its contents,
Create our special /proc entry, which will be used to exchange information between ring 0 and ring 3.

Here's a short demonstration, which is more friendly to the eye than naked text, and catches the point of what we are to do:

Before I venture to the second point, I suggest to take a closer look to the third one. The characteristic traits of the /proc filesystem handling can influence the method which we will use to read the formats array. Not as much to completely change the approach, but different enough to be worthy for a dedicated description ;).

The /proc filesystem

Some time ago I already wrote some stuff about this subject, along with some information about process dumping on Linux systems in ring 3 (unfortunately, this blog post is available only in Polish language, at this moment at least). We're going to take a look at the similar picture of /proc filesystem interface, but this time we'll do it from ring 0 perspective.

A driver that wishes to create a new file in the /proc directory has to call the create_proc_entry function. It takes three arguments: the name of the file we wish to create, the access rights to the file (encoded in a standard notation, 666 for rwxrwxrwx, 755 for rwxr-xr-x), and the pointer to the parent directory in the /proc filesystem. Last argument is only used when we want to create a file in a subdirectory in /proc; if we don't plan on having any subdirectories, it's safe to just pass NULL to the third argument. Our example will not use any directories, so we will use the function like this:

proc_kformats = create_proc_entry(“kformats”, 755, NULL);

The global variable named proc_kformats, which is of type proc_dir_entry, will contain a pointer to a new file descriptor, or NULL, if there will be some problem with file creation (make sure you propely handle the fauly logic, f.e. by returning -EFAULT). Before the system will actualy “show” the file to the user, you will still need to provide some more information. The kernel will need a list of callback functions, which are executed in case of a specific event being triggered in the system. These callbacks should handle the open(), read(), lseek(), and close() functions.

Here's an example. Standard use case for the cat command can look like the following: first, after executing a cat /proc/kformats command, the /proc/kformats file is being opened by the open() function, which is implemented in your system libc. Then, the current read pointer is being set, just as by invoking the lseek() function. Then the cat command reads the file by using the read() function until it hits the end of file. After breaking the read loop, it closes the file with the close() function. This is of course a very simplified description of what the cat program does, but is accurate enough for our example. So, by matching the callback list inside the LKM and the list of actions which are going to be executed by cat, you can instruct the kernel that if it gets an open() request on our file, it should invoke the my_driver_proc_file_open() function from our LKM's adress space. This function should, of course, implement an “opening” logic for our file. When cat invokes lseek(), the kernel will use my_driver_proc_file_lseek() function from our LKM, which should implement a logic to set the offset from the input data source to a specified value (only if our data source supports such operation).

The “list of actions”, “list of callbacks” is a variable of file_operations type, and its definition can look like this:

static struct file_operations kformats_file_ops = {
    .owner = THIS_MODULE,
    .open = kformats_open,
    .read = kformats_read,
    .llseek = kformats_lseek,
    .release = kformats_release
};

Make sure to remember that if you don't “bind” the file descriptor (which is a structure, and it's what the create_proc_entry() function returned) to the kformats_file_ops, no callback from this structure will be invoked. The “binding” part is very easy, and you can do it by taking the kformats_file_ops address, and putting it into proc_fops field to the structure returned by create_proc_entry(). After this operation the kernel will know how to route the I/O request from the /proc/kformats file into your module, then into your structure, and finally into your callback handler. You can resort to the source code to see how it's done in the practice.

This approach could be described as a poor man's object oriented programming, and it's very popular across the whole kernel — mostly because it suits the purpose very well. The file_operations structure, as well as many others, use a constant convention of function assignment to structure fields. This means that the fields are becoming callbacks, ready to be invoked in different places in the code. Object oriented purists will probably scream and attack us by saying that of course that's not a real object oriented programming, but we can simply ignore them, because everyone who expects object orientation from the C language should be treated like that :).

The main problem which everyone faces when implementing the read() function is the need to manually manage data buffers, which are to hold our data. In other words, when we have a structure, which we'd like to copy to memory accessible by ring 3 application, we can't simply take the pointer to the beginning of the structure and copy sizeof(structure) bytes from it. That's because when implementing the read() function we are told (by the kernel) which part of the structure needs to be copied, and how much bytes we're allowed to copy. Sometimes kernel tells us to copy the whole structure in one run, sometimes we have to split it on two parts, sometimes copy it in chunks of 1 byte — it depends on the arguments we'll get from the kernel. Every call to read() should increment an internal (also implemented by our code) counter which points to the actual position (“cursor”) in our data stream. When you consider functions like lseek(), or ftell(), you'll probably be able to figure that they operate on this cursor position: first one sets the cursor to a specified value, and the second one returns the current value of it. Maybe all of this could be an easy thing to do, but sometimes interaction with the hardware (which is what the LKM were designed to do, afterall) can be tricky, also data cache'ing can increase the complexity level of the problem.

This is why the kernel lends us a helping hand by providing us an abstraction level for the basic I/O functions. The level is implemented slightly higher than the file_operations structure and it uses special callbacks, which are to be used, instead of basic read(), write(), llseek() and close(). This mechanism uses a different set of functions which are somewhat similar when it comes to execution, but conceptually they operate on lines, instead of bytes, like the basic mechanism. For the sake of stimulating your imagination, you can imagine yourself the putc() function, which writes 1 character to the screen, and printf() function, which is of course more advanced. printf() eventually will call putc() at some point, but the truth is, we don't need to know how putc() works; we can print stuff on the screen by just knowing how printf() works. Same thing here; The seq subsystem of which I'm writing about operates on lines, so we only need to know how produce the next line in the stream. When we'll feed this line into the seq's “flow”, it will implement a proper read() callback so the kernel will get the whole line, without needing to take care of buffering issues, cursors, pointers, etc. That's plain convinience! Of course, at some point it'll be very useful to know the internals of the standard I/O mechanism, but if you want, you can postpone learning of it (the shorter the better).

So, our declaration will change a little:

static struct file_operations kformats_file_ops = {
    .owner = THIS_MODULE,
    .open = kformats_open,
    .read = seq_read,
    .llseek = seq_lseek,
    .release = seq_release
};

You can probably see very well that the read() function is implemented in seq_read() (which is already done and you don't need to change it), llseek() is replaced by seq_lseek(), same thing with release() and seq_release(). So, the only function which needs to be implemented in this structure, is the body of open() function. And even then, it's very likely that lots of your open() functions will look like this:

static int kformats_open(struct inode *inode, struct file *file) {
    return seq_open(file, & kformats_seq_ops);
}

It's not complicated, but it's important. Its only role is invocation of seq_open() function, which registers this file reading session with the seq subsystem. The subsystem needs to be initialized with the seq_operations structure, which is passed into the second argument of seq_open(). Here's what this structure looks like:

static struct seq_operations kformats_seq_ops = {
    .start = seq_file_start,
    .next = seq_file_next,
    .stop = seq_file_stop,
    .show = seq_file_show
};

So what is the purpose of those fields? The .start callback gets called when seq wants to start the streaming process. .next will be used to notify your code that it needs to prepare itself to be able to provide another line of data. .show will be called when the line will be rendered, and finally .stop will be a notification for finishing the streaming process. So, in practice, in .open you initialize your structures, in .next you position the “cursor” on to the next item to read, in .show you read the data from the “cursor's” position, and in .finish you deinitialize everything that you did in the initialization step. Every callback function in one of the arguments will contain the pointer to the file descriptor you're going to manipulate on, so implementations of these functions will have to use that information instead of any global variables you might have. Every call to these functions will have to be reentrant, or at least thread-aware, so best practice would be to not use any of the shared data at all. Unfortunately, in our case, we have to use the reference to shared data (the formats array), so we will have to implement a proper locking code to make our implementations thread-aware, and the invocation of our module safe both to the system and to our module.

Now it's finally the time to think about the process of getting into the formats array, from the seq's interface point of view. The .start callback will initialize the access to the structure, by locking the mutexes guarding the array (so the structure won't change when we're walking over it). The .next callback will increase the “cursor” (“iterator”) forward by 1 position. .show callback will read the element pointed to by the cursor (iterator), and will convert binary data into the human-readable form (by using printf(), for example). Finally, the .stop callback will have to unlock the guard mutex, because if we'd leave it locked, it would hang the whole system up (the formats array is used every time a process is created, after all). Before we enter the realm of implementation, I'd like to write about two more things, which are important in our case. The first one is a method which we'll use to find the formats table in the memory labirynth, and the second one is how we actually handle that mutex part, that is: how to fit into the multithreaded system, and find out the exact moment when we're able to saftely read the structure, without risking invalid memory access exceptions.

Dynamic symbol localization — kallsyms

The kernel must have this subsystem compiled in. Luckily, it seems that most systems fulfill this prerequisite. It's very easy to check if that's the case for your system; it's enough to check if the /proc/kallsyms file exists.

(2:1003)$ ls -la /proc/kallsyms
-r--r--r-- 1 root root 0 09-13 21:58 /proc/kallsyms

The contents of this file is the same thing (or should be) as the contents of the /boot/System.map file, and it basically is a list of variable names, function names and their current addresses in the memory. This addresses are generated automatically by the compiler when compiling the kernel, and their value depends on many things, but I think it's safe to assume that it's mostly the optimization level which the compiler is using, and the number of things compiled in the kernel. These settings are so peculiar that it's safe to assume, that the kernel on every installation is different. To be more specific, most differences occur between distributions, not between versions of one distro, but still there can be many versions of the kernel in different time span (thanks to automatic updates) in the boundary of one distribution. Of course, when we take into account distributions such as Gentoo, Sabayon, or similar, or distros on which the user compiles her own kernel, the “different kernel on every installation” assumption makes sense. This means we simply can't just write down the memory location of our variable, which we'll find on our system. If we'd do that then the module would probably work only on our system, and that's only before automatic updates will replace the kernel to newer version. We need a more robust solution for looking up variables in the memory.

The subsystem which will be perfect for our case is called kallsyms. The main function which we're going to use is named kallsyms_lookup_name() and it's designed to do lookup the variable name, which we'll supply in its first argument. The return value is the current address of this symbol in current memory, so it's possible to dereference it, and just use it (most of the time, not counting dynamic allocations). So, using kallsyms, looking up the address of the formats array is trivial:

	g_formats_list = (struct list_head *) kallsyms_lookup_name("formats");

The code above will save the address of the formats table in the g_formats_list variable, which is declared in our module. The type of g_formats_list is a standard double-linked list, used in various places across the whole kernel. You can look that information up on Google, because it goes slightly our of scope of the subject of this post.

Synchronization

Writing a kernel module is similar to writing a plugin to a bigger system (not necessarily an operating system). Many aspects of the runtime environment in which our module is working is based on an event-driven architecture and there can be many reasons why our code will execute. Sometimes it will be called from one thread, sometimes from the other thread, and sometimes it'll be called from two (or more) threads at the same time. On a single-CPU machine, the first instance will be interrupted and put into sleep, so other instance can jump into its place and begin execution for a specified amount of time. Then, the control is again reclaimed by the first module, during the time when the second one sleeps. On a multi-CPU machine, many copies of one function can run at the same time — that copies will operate on the same data sets. It's best to assume that an unlimited number of threads, laid out on an unlimited set of CPUs is running at the same time. So, where does that bring us?

To the same point that any other multithreaded application brings us. So, we simply use mutexes and spinlocks (along with some variations, like: mutex locking the code only on write requests), atomic counters (atomic_t — atomic means “not divisable”, or “uniform”, not explosive!), but there are also more sophisticated methods of synchronization, like RCU — read-copy update (where the write request copies the data into another memory location, updating the reference pointers where its needed — this way the reading thread is not locked and can operate on the old data set, and the new thread and every otherthread which will access the data set after this point, will operate on new data set).

Check this article if you want to learn something about kernel mechanisms for synchronization. This article is slightly out of date, because the Big Kernel Lock it describes is no longer in the kernel, but still it's a valuable piece of information for everyone who wants to learn about the subject.

So where's the common point of this and our project? Well, the point is: it's very important. Consider this example: you take the address of the formats table. Then you start to iterate on every element of this table, printing the address of the element on the screen. Very little can go wrong with this easy example, right? Nope, wrong, if you include another thread in this scenario. When during the first thread's iteration phase, the second thread will try to remove an item, it can lead us into a situation where the first thread accesses the memory that has just been freed by the second thread. Of course, the first thread, when dereferencing the iterator, won't detect any errors, because the memory still exists and it's valid; but even one instruction later, the first thread's code can be preempted, or interrupted, and the second thread will jump in, where it'll do its removal logic. Then, the first thread will resume its execution, and this is where we are at: the first thread, in the block of code that assumes the memory is valid (because it already did the checks), but the memory is not valid. Kernel panic ahead! So where's the problem and how to fix it? The problem is that the second thread did the removal stuff, but it didn't knew that somebody else is reading the same data set at the same time. So, it would be good to have some kind of mechanism, so that threads can broadcast the information about the data set they're reading or writing to, so other threads can wait until this access is finished. After waiting, they broadcast the read, or modification signal themselves, so other threads won't start to read what they intend to modify. After removing, or writing their part, they broadcast the signal that the data set is free to process again. This is called locking, or blocking a data set (structure, or anything that's a variable).

So, if you'd like to acquire a data set with exclusive access, you can use, for example, a Mutex. The broadcasting part is called locking a structure, or entering a critical section. Many structures, and variables, in the kernel have their respective mutex objects, which are used to lock the structure, so it's accessed only by one thread at a time. It's no different in our case of the formats table; the locking object for this table is a read write lock (which is a mutex) in the form of the binfmt_lock variable. I suggest you look it up in the kenrel sources to know where it's located and learn the pattern for this, because — as I already wrote — it's popular among other kernel structures. To add things up, before you do anything with the formats table, you need to lock the binfmt_lock object according to your action. When you're finished with whatever you do with the formats table, just unlock the mutex, so other threads can process this table, and you're done. I've also told you that many kernel structures have their own respective locking objects; but it's up to you to find them. There's no standard “slot” in the structure you can put the synchronization objects in; you have to read the source code and figure out which ones to use, by yourself.

The address of the binfmt_lock mutex can be acquired in the same way you acquire the formats table's address:

g_binfmt_lock = (rwlock_t *) kallsyms_lookup_name("binfmt_lock");

You can read-lock this mutex by using the read_lock() function, and read-unlock it by using read_unlock().

Implementation

I think I've managed to describe most of the theoretical aspects of the approach I'll be using in the code: you should know what the binfmt system is, and what we are to do. You also should know the place that holds the data which is of most importance for us and you should know how to access that data. Last section described the method of reading the data, and the first one described how to present the data to the user.

So, let's go into some details.

Getting the address of the formats table and its synchronization mutex, binfmt_lock. If the kallsyms subsystem is not compiled in the kernel, the module won't allow itself to be loaded. The insmod program will return an error. But in the optimistic case (which is the majority), here's the code.

static int __init kefdump_init(void) {
    g_formats_list = (struct list_head *) kallsyms_lookup_name("formats");
    g_binfmt_lock = (rwlock_t *) kallsyms_lookup_name("binfmt_lock");

    if(! g_formats_list || ! g_binfmt_lock)
            return -EFAULT;

The module creates the virtual file inside a virtual file system /proc:

proc_kformats = create_proc_entry("kformats", 755, NULL);
if(! proc_kformats) {
    printk(KERN_ERR "can't create_proc_entry\n");
    return -ENOMEM;
}

The action list of the kformats file is defined by the kformats_file_ops structure. This structure uses a convention imposed by the seq subsystem, so the read, llseek and release fields are pointing to the seq domain.

proc_kformats->proc_fops = & kformats_file_ops;

Initialization is complete, the module is loaded.

	TRACE0("kformats installed");
	return 0;
}

Now, the user needs to read the /proc/kformats file, so we can analyze further what the module does. The cat /proc/kformats command should do the trick. The cat command first opens the file (by using the fopen() function, for example, which in turn uses the open() syscall), then it reads the data from it, and closes the file in the epilogue of its execution (the close() syscall). Let's consult that with our action structure:

static struct file_operations kformats_file_ops = {
    .owner = THIS_MODULE,
    .open = kformats_open,
    .read = seq_read,
    .llseek = seq_lseek,
    .release = seq_release
};

So, first the kformats_open will be invoked, because cat opens the file.

static int kformats_open(struct inode *inode, struct file *file) {
    return seq_open(file, & kformats_seq_ops);
}

We initialize the seq subsystem here. This is the control structure for seq:

static struct seq_operations kformats_seq_ops = {
    .start = seq_file_start,
    .next = seq_file_next,
    .stop = seq_file_stop,
    .show = seq_file_show
};

What it means is that the control of serving the /proc/kformats file is given to the seq filesystem, so we stick to its rules; then we only need to implement a simple loop-like construct to be able to fully implement serving of the file contents. .start is the beginning of the loop, and it's the initialization of data structure we want to print. In our case we will lock the binfmt_lock here, so no other thread will be able to write-access the formats table, not even the operating system itself (remember, by writing the kernel module, you're writing the operating system itself). Read-access will be allowed for anyone. That's the advantage of using the rwlock instead of a conventional lock.

static void *seq_file_start(struct seq_file *s, loff_t *pos) {
    TRACE0("lock");
    read_lock(g_binfmt_lock);

    if(* pos >= 1)
        return 0;
    else
        return g_formats_list->next;
}

The function will return a pointer to the first item in the table. By the way; if the user will request read from f.e. a middle of the file, the function will return zero. This is because our module doesn't support starting reading from any other element that is not the first element.

The cat program, after opening the file, will start reading from it. This will result in triggering the next and show actions, all the way down until the next function will return NULL — signalling the end of the data. This is our implementation of the next function. Its job is to increment the data iterator together with fetching the pointer to the next element, and here's how it does this:

static void *seq_file_next(struct seq_file *s, void *iterator, loff_t *pos) {
	struct list_head *next = NULL;

	(* pos)++;
	next = ((struct list_head *) iterator)->next;

	if(next != g_formats_list) {
		return next;
	} else
		return NULL;
}

The formats table uses the standard double-linked list structure. The characteristic part of this table is that the last element is linked with the first one; so by incrementing the iterator, while standing on the last element, we'll end up on the first element again. This is the reason for our expression above: if the next var is the same as g_formats_list, it means that we're on the beginning again, so we already got past the end of the table, so we can break the loop. We return NULL to signal that we got the end of the data. When the kernel will see the NULL value, it won't call the next function anymore.

static int seq_file_show(struct seq_file *s, void *iterator) {
    struct linux_binfmt *fmt = NULL;
    char namebuf[512] = { 0 };

    fmt = container_of(iterator, struct linux_binfmt, lh);
    sprint_symbol(namebuf, (unsigned long) fmt->load_binary);
    seq_printf(s, "%s\n", namebuf);

    return 0;
}

So, nothing complicated here. The sprint_symbol() is a function that is a logical mirror of the kallsyms_lookup_name — that is, it's designed to perform the opposite — it will convert the address of the symbol to the symbol name. Of course, the address must be valid in order to do it. The container_of macro is a method of iterator dereferencing, and I encourage you to read about it.

That's about it. In case you'd like to share your thoughts or point out any errors, feel free to use the “comment” link below.

Source Codes (5kb)