The Linux binfmt subsystem
The binfmt subsystem is a special mechanism which allows the kernel to extend its ability to recognize different executable binary formats. It's invoked by the kernel when the user wants to execute a file with the executable (+x) flag, and its purpose is to help the kernel to understand the binary structure of chosen file. This structure is crucial in determining which segments are responsible for holding the program code and program data. Only by knowing such information it is possible to interpret in a proper (and deterministic) way.
Some history
On UNIX systems, the executable file format went through a quite big evolution, similar to what the Windows file formats did. On UNIX'es, one of the oldest form of executable formats was the a.out file, which was replaced by more functional and user-friendly COFF format, which was — in turn — replaced by the ELF format. ELF got its name from Executable and Linkable Format, or — in a different time span — Extensible Linking Format, and it rules the executable format kingdom until this very day.
Additionally, the variety of executable formats can be seen in more visible colors when you start to consider the fact, that ELF structure is slightly different on x86 and amd64 architectures. The problem is that the system needs to support backward compatibility issues when running the software; thus it needs to support both old formats, and new ones, to be regarded as usable. This is where the binfmt subsystem comes in; it's an easy way to allow the kernel to choose which binary interpreter is the proper one for chosen executable file.
Overview
The file execution process begins in the exec
function, implemented in your
local libc library. Every function from the exec
family at some point will
enter the execve()
function. It's realized using the 0x80 interrupt (or by
sysenter instruction, or sometimes syscall), which is a standard way of
implementing a system service. The 0x80 interrupt, along with a number 11 in the
command register (more on this later) lets you switch the user-mode execution
context in favor of the kernel-mode execution context (ring 0). The interrupt,
when the command argument number is 11, is realized in the sys_execve()
function
in the kernel, which resides inside arch/x86/kernel/process.c
file. So, to sum
it up: the code flow, after calling the execve()
function inside libc, during
invocation of interrupt 0x80, goes through a “gate” living inside the IDT, which
tells the processor to start allowing more privileged instructions than standard
ring 3 ones (along with a bunch of other information), and jumps to the
sys_execve()
function in the kernel (also, for completion: if the int 80h
instruction is implemented by sysenter or syscall instructions, the IDT part of
this transition is replaced by a read of special MSR registers allowing the
relocation of code flow pointer into the acquired address). The sys_execve()
function does the boring job of copying the arguments from ring 3 address space
into current ring 0 space, and invokes another function: do_execve()
. This one
is just a temporary place for us, not doing anything of importance
related to our subject. The crucial part is the invocation of
do_execve_common()
, which actually is doing something important, and interesting
;).
More or less, this function is designed to verify the security context of the
process, and to make sure that only the few selected handles will be inherited
by the new process, which at this point is yet to be created. It initializes the
memory management structures, which is of most importance for us. To
successfully perform this task, it invokes the search_binary_handler()
function,
which tries to find a proper binary file interpreter, which knows how to walk
through the structure of our executable file.
In the kernel, every binary interpreter is implemented in its own source file, like this:
fs/binfmt_aout.c,
fs/binfmt_elf.c,
fs/binfmt_script.c,
etc.
Every interpreter sticks to a few important rules, which allow them to live
inside the kernel ecosystem. One of the most important ones is to invoke the
register_binfmt()
function, which notifies the kernel about the existence of the
interpreter. The function will pass the linux_binfmt
structure, which contains
the load_binary
field containing the pointer to a callback which gets invoked by
the kernel when it wants to match the interpreter with a specified binary file.
If the file structure is not what the interpreter understands (like trying to
use the ELF interpreter with a PE executable), it returns the -ENOEXEC
error, to
notify the binfmt subsystem about a bad match. Then, the kernel will try another
interpreter from its pool of interpreters, until a final failure (no
interpreters match), or a success.
In other words: each built-in interpreter will issue a load_binary
structure to
the kernel, to register itself in the kernel pool of interpreters. Interpreters
which live inside the Loadable Kernel Modules will also do the same stuff; it
means that the kernel constantly maintains a special table of interpreters (the
pool), and uses it to find an interpreter suitable for current task. I thought
it would be a nice idea to be able to peek though this table, to know what
executable interpreters are allowed by my current operating system ;)
The Loadable Kernel Module (LKM)
Before we dive into creation of our LKM, let's think of what the module should
do. The draft plan has been already created, but we need to dig some details to
know what structure fields are important for us and need reading, to get the
information we need. For this purpose it would be a good idea to quickly go
through the search_binary_handler()
, which could be named as a core of binfmt
subsystem.
The function starts with a for loop after a block of sanity checks. The loop
uses the list_for_each_entry
macro, which is a standard (for the kernel) method
of iterating over a double-linked list. The first argument (fmt
) is the
iterator; this is a variable which points to a new row of data — it's valid in
the whole scope inside the for expression, but not afterwards. The second
argument, & formats
, is a pointer to the list we want to iterate on. Third
argument is a result of some peculiarities of the C language and the
container_of
macro, and it's a way of adjusting the compiler to properly cast
the type of the value in a type-safe fashion, to which the iterator points to.
More or less -- it's not important at the moment, right now it's just a “glue
code” for us. The important thing is that this for loop iterates over the
formats array, and allows the code inside its scope to access the member
objects, which are of the linux_binfmt
type. Every iteration reads the
load_binary
field, which is a pointer to a callback, invokes it, and checks the
return value. So, in other words, formats
table is a table which holds
information about all the currently registered binary file interpreters. By
dumping the contents of this table we can get the list of binary formats that
are possible to be invoked on our system.
Let's write a LKM, which — after insmod
'ing — will find its way into the formats
array and dump its contents to the user, but in a form, which would be
understandable by her without any addidional postprocessing, so it could be
possible to only use cat
utility to read the contents of the file.
Unfortunately, there is a slight problem with that; the module can't simply
display something on the screen. Such user interaction level is beyond the
abstraction level imposed by the module subsystem. The exchange of information
between the LKM and the user is done in a different way; best thing is to use a
combination of ring 3 application and ring 0 module. A normal ring 3 application
will send a question to the module, which the module will process, rendering the
answer to a simple flat byte buffer, which will in turn be sent to the
application as a response. Then the application will be able to display the
buffer on the screen by using printf()
, a message box, or similar, whatever
method is needed by the user. The question sending part could be resolved by
using the procfs
subsystem, which is a standard Linux way of sending information
between ring 3 and ring 0 in a full-duplex fashion. We will get into that
shortly, but for now it'll suffice to say that it's based on all of those
/proc
files, which could be read by simple tools like cat
. These strings, which you
can observe while cat
'ing one of the files from /proc
, are generated inside the
kernel, sent through the standard I/O mechanisms to ring 3 application, to be
returned by standard I/O functions like read()
, and finally displayed on the
screen.
Notabene, there is a shortcut we can use, when we want to quickly send some
information from the LKM to the user. It's name is printk
and it's very similar
to standard printf. Two important details which you should note is that it
normally only works on the first physical console (ctrl+alt+f1
; though this can
be configured later), and for it to work, you must first activate it by echo'ing
9 9 9 9
to /proc/sys/kernel/printk
. These numbers control the printk logs and
they specify the log levels, which kernel has to pass into the console output.
Log levels higher than the ones specified in this control file are ignored and
passed to /dev/null
instead of the console (well, not exactly to /dev/null
, but
I think you get the idea). Sometimes, by using daemons like klogd
, the content
of printk
's buffer is stored inside system log files, like /var/log/kern.log
, so
you can also check if that's the case for your system. Working or not, this
metod is valid only for communication between ring 0 and the developer, and you
shouldn't use it on production code. You can peek into my other post, the Device Enumeration on PCI bus,
to see how printk
can be used. It's small and uncomplicated, so
it shouldn't generate any further questions on this subject.
OK, so here's the plan.
- Create the module skeleton, which is compiled into the binary module file, accepted by the insmod utility,
- Find our way into the tables array, and interpret its contents,
- Create our special /proc entry, which will be used to exchange information between ring 0 and ring 3.
Here's a short demonstration, which is more friendly to the eye than naked text, and catches the point of what we are to do:
Before I venture to the second point, I suggest to take a closer look
to the third one. The characteristic traits of the /proc
filesystem handling can
influence the method which we will use to read the formats
array. Not as much to
completely change the approach, but different enough to be worthy for a
dedicated description ;).
The /proc filesystem
Some time ago I already wrote some stuff about this subject, along with some information about process dumping on Linux systems in ring 3 (unfortunately, this blog post is available only in Polish language, at this moment at least). We're going to take a look at the similar picture of /proc filesystem interface, but this time we'll do it from ring 0 perspective.
A driver that wishes to create a new file in the /proc
directory has to call the
create_proc_entry
function. It takes three arguments: the name of the file we
wish to create, the access rights to the file (encoded in a standard notation,
666 for rwxrwxrwx
, 755 for rwxr-xr-x
), and the pointer to the parent directory
in the /proc filesystem. Last argument is only used when we want to create a
file in a subdirectory in /proc
; if we don't plan on having any subdirectories,
it's safe to just pass NULL
to the third argument. Our example will not use any
directories, so we will use the function like this:
proc_kformats = create_proc_entry(“kformats”, 755, NULL);
The global variable named proc_kformats
, which is of type proc_dir_entry
, will
contain a pointer to a new file descriptor, or NULL
, if there will be some
problem with file creation (make sure you propely handle the fauly logic, f.e.
by returning -EFAULT
). Before the system will actualy “show” the file to the
user, you will still need to provide some more information. The kernel will need
a list of callback functions, which are executed in case of a specific event
being triggered in the system. These callbacks should handle the open()
, read()
,
lseek()
, and close()
functions.
Here's an example. Standard use case for the cat
command can look like the
following: first, after executing a cat /proc/kformats
command, the
/proc/kformats
file is being opened by the open()
function, which is implemented
in your system libc. Then, the current read pointer is being set, just as by
invoking the lseek()
function. Then the cat
command reads the file by using the
read()
function until it hits the end of file. After breaking the read loop, it
closes the file with the close()
function. This is of course a very simplified
description of what the cat program does, but is accurate enough for our
example. So, by matching the callback list inside the LKM and the list of
actions which are going to be executed by cat
, you can instruct the kernel that
if it gets an open()
request on our file, it should invoke the
my_driver_proc_file_open()
function from our LKM's adress space. This function
should, of course, implement an “opening” logic for our file. When cat
invokes
lseek()
, the kernel will use my_driver_proc_file_lseek()
function from our LKM,
which should implement a logic to set the offset from the input data source to a
specified value (only if our data source supports such operation).
The “list of actions”, “list of callbacks” is a variable of file_operations
type, and its definition can look like this:
static struct file_operations kformats_file_ops = {
.owner = THIS_MODULE,
.open = kformats_open,
.read = kformats_read,
.llseek = kformats_lseek,
.release = kformats_release
};
Make sure to remember that if you don't “bind” the file descriptor (which is a
structure, and it's what the create_proc_entry()
function returned) to the
kformats_file_ops
, no callback from this structure will be invoked. The
“binding” part is very easy, and you can do it by taking the kformats_file_ops
address, and putting it into proc_fops
field to the structure returned by
create_proc_entry()
. After this operation the kernel will know how to route the
I/O request from the /proc/kformats
file into your module, then into your
structure, and finally into your callback handler. You can resort to the source
code to see how it's done in the practice.
This approach could be described as a poor man's object oriented programming,
and it's very popular across the whole kernel — mostly because it suits the
purpose very well. The file_operations
structure, as well as many others, use a
constant convention of function assignment to structure fields. This means that
the fields are becoming callbacks, ready to be invoked in different places in
the code. Object oriented purists will probably scream and attack us by saying
that of course that's not a real object oriented programming, but we can simply
ignore them, because everyone who expects object orientation from the C language
should be treated like that :).
The main problem which everyone faces when implementing the read()
function is
the need to manually manage data buffers, which are to hold our data. In other
words, when we have a structure, which we'd like to copy to memory accessible by
ring 3 application, we can't simply take the pointer to the beginning of the
structure and copy sizeof(structure)
bytes from it. That's because when
implementing the read()
function we are told (by the kernel) which part of the
structure needs to be copied, and how much bytes we're allowed to copy.
Sometimes kernel tells us to copy the whole structure in one run, sometimes we
have to split it on two parts, sometimes copy it in chunks of 1 byte — it
depends on the arguments we'll get from the kernel. Every call to read()
should
increment an internal (also implemented by our code) counter which points to the
actual position (“cursor”) in our data stream. When you consider functions like
lseek()
, or ftell()
, you'll probably be able to figure that they operate on this
cursor position: first one sets the cursor to a specified value, and the second
one returns the current value of it. Maybe all of this could be an easy thing to
do, but sometimes interaction with the hardware (which is what the LKM were
designed to do, afterall) can be tricky, also data cache'ing can increase the
complexity level of the problem.
This is why the kernel lends us a helping hand by providing us an abstraction
level for the basic I/O functions. The level is implemented slightly higher than
the file_operations
structure and it uses special callbacks, which are to be
used, instead of basic read()
, write()
, llseek()
and close()
. This mechanism
uses a different set of functions which are somewhat similar when it comes to
execution, but conceptually they operate on lines, instead of bytes, like the
basic mechanism. For the sake of stimulating your imagination, you can imagine
yourself the putc()
function, which writes 1 character to the screen, and
printf()
function, which is of course more advanced. printf()
eventually will
call putc()
at some point, but the truth is, we don't need to know how
putc()
works; we can print stuff on the screen by just knowing how printf()
works. Same
thing here; The seq
subsystem of which I'm writing about operates on lines, so
we only need to know how produce the next line in the stream. When we'll feed
this line into the seq
's “flow”, it will implement a proper read()
callback so
the kernel will get the whole line, without needing to take care of buffering
issues, cursors, pointers, etc. That's plain convinience! Of course, at some
point it'll be very useful to know the internals of the standard I/O mechanism,
but if you want, you can postpone learning of it (the shorter the better).
So, our declaration will change a little:
static struct file_operations kformats_file_ops = {
.owner = THIS_MODULE,
.open = kformats_open,
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release
};
You can probably see very well that the read()
function is implemented in
seq_read()
(which is already done and you don't need to change it), llseek()
is
replaced by seq_lseek()
, same thing with release()
and seq_release()
. So, the
only function which needs to be implemented in this structure, is the body of
open() function. And even then, it's very likely that lots of your open()
functions will look like this:
static int kformats_open(struct inode *inode, struct file *file) {
return seq_open(file, & kformats_seq_ops);
}
It's not complicated, but it's important. Its only role is invocation of
seq_open()
function, which registers this file reading session with the seq
subsystem. The subsystem needs to be initialized with the seq_operations
structure, which is passed into the second argument of seq_open()
. Here's what
this structure looks like:
static struct seq_operations kformats_seq_ops = {
.start = seq_file_start,
.next = seq_file_next,
.stop = seq_file_stop,
.show = seq_file_show
};
So what is the purpose of those fields? The .start
callback gets called when seq
wants to start the streaming process. .next
will be used to notify your code that
it needs to prepare itself to be able to provide another line of data. .show
will
be called when the line will be rendered, and finally .stop
will be a
notification for finishing the streaming process. So, in practice, in .open
you
initialize your structures, in .next
you position the “cursor” on to the next
item to read, in .show
you read the data from the “cursor's” position, and in
.finish
you deinitialize everything that you did in the initialization step.
Every callback function in one of the arguments will contain the pointer to the
file descriptor you're going to manipulate on, so implementations of these
functions will have to use that information instead of any global variables you
might have. Every call to these functions will have to be reentrant, or at least
thread-aware, so best practice would be to not use any of the shared data at
all. Unfortunately, in our case, we have to use the reference to shared data
(the formats
array), so we will have to implement a proper locking code to make
our implementations thread-aware, and the invocation of our module safe both to
the system and to our module.
Now it's finally the time to think about the process of getting into the formats
array, from the seq
's interface point of view. The .start
callback will
initialize the access to the structure, by locking the mutexes guarding the
array (so the structure won't change when we're walking over it). The .next
callback will increase the “cursor” (“iterator”) forward by 1 position. .show
callback will read the element pointed to by the cursor (iterator), and will
convert binary data into the human-readable form (by using printf(), for
example). Finally, the .stop
callback will have to unlock the guard mutex,
because if we'd leave it locked, it would hang the whole system up (the
formats
array is used every time a process is created, after all). Before we enter the
realm of implementation, I'd like to write about two more things, which are
important in our case. The first one is a method which we'll use to find the
formats
table in the memory labirynth, and the second one is how we actually
handle that mutex part, that is: how to fit into the multithreaded system, and
find out the exact moment when we're able to saftely read the structure, without
risking invalid memory access exceptions.
Dynamic symbol localization — kallsyms
The kernel must have this subsystem compiled in. Luckily, it seems that most
systems fulfill this prerequisite. It's very easy to check if that's the case
for your system; it's enough to check if the /proc/kallsyms
file exists.
(2:1003)$ ls -la /proc/kallsyms
-r--r--r-- 1 root root 0 09-13 21:58 /proc/kallsyms
The contents of this file is the same thing (or should be) as the contents of the /boot/System.map file, and it basically is a list of variable names, function names and their current addresses in the memory. This addresses are generated automatically by the compiler when compiling the kernel, and their value depends on many things, but I think it's safe to assume that it's mostly the optimization level which the compiler is using, and the number of things compiled in the kernel. These settings are so peculiar that it's safe to assume, that the kernel on every installation is different. To be more specific, most differences occur between distributions, not between versions of one distro, but still there can be many versions of the kernel in different time span (thanks to automatic updates) in the boundary of one distribution. Of course, when we take into account distributions such as Gentoo, Sabayon, or similar, or distros on which the user compiles her own kernel, the “different kernel on every installation” assumption makes sense. This means we simply can't just write down the memory location of our variable, which we'll find on our system. If we'd do that then the module would probably work only on our system, and that's only before automatic updates will replace the kernel to newer version. We need a more robust solution for looking up variables in the memory.
The subsystem which will be perfect for our case is called kallsyms. The main
function which we're going to use is named kallsyms_lookup_name()
and it's
designed to do lookup the variable name, which we'll supply in its first
argument. The return value is the current address of this symbol in current
memory, so it's possible to dereference it, and just use it (most of the time,
not counting dynamic allocations). So, using kallsyms, looking up the address of
the formats array is trivial:
g_formats_list = (struct list_head *) kallsyms_lookup_name("formats");
The code above will save the address of the formats
table in the
g_formats_list
variable, which is declared in our module. The type of g_formats_list
is a
standard double-linked list, used in various places across the whole kernel. You
can look that information up on Google, because it goes slightly our of scope of
the subject of this post.
Synchronization
Writing a kernel module is similar to writing a plugin to a bigger system (not necessarily an operating system). Many aspects of the runtime environment in which our module is working is based on an event-driven architecture and there can be many reasons why our code will execute. Sometimes it will be called from one thread, sometimes from the other thread, and sometimes it'll be called from two (or more) threads at the same time. On a single-CPU machine, the first instance will be interrupted and put into sleep, so other instance can jump into its place and begin execution for a specified amount of time. Then, the control is again reclaimed by the first module, during the time when the second one sleeps. On a multi-CPU machine, many copies of one function can run at the same time — that copies will operate on the same data sets. It's best to assume that an unlimited number of threads, laid out on an unlimited set of CPUs is running at the same time. So, where does that bring us?
To the same point that any other multithreaded application brings us. So, we simply use mutexes and spinlocks (along with some variations, like: mutex locking the code only on write requests), atomic counters (atomic_t — atomic means “not divisable”, or “uniform”, not explosive!), but there are also more sophisticated methods of synchronization, like RCU — read-copy update (where the write request copies the data into another memory location, updating the reference pointers where its needed — this way the reading thread is not locked and can operate on the old data set, and the new thread and every otherthread which will access the data set after this point, will operate on new data set).
Check this article if you want to learn something about kernel mechanisms for synchronization. This article is slightly out of date, because the Big Kernel Lock it describes is no longer in the kernel, but still it's a valuable piece of information for everyone who wants to learn about the subject.
So where's the common point of this and our project? Well, the point is: it's very important. Consider this example: you take the address of the formats table. Then you start to iterate on every element of this table, printing the address of the element on the screen. Very little can go wrong with this easy example, right? Nope, wrong, if you include another thread in this scenario. When during the first thread's iteration phase, the second thread will try to remove an item, it can lead us into a situation where the first thread accesses the memory that has just been freed by the second thread. Of course, the first thread, when dereferencing the iterator, won't detect any errors, because the memory still exists and it's valid; but even one instruction later, the first thread's code can be preempted, or interrupted, and the second thread will jump in, where it'll do its removal logic. Then, the first thread will resume its execution, and this is where we are at: the first thread, in the block of code that assumes the memory is valid (because it already did the checks), but the memory is not valid. Kernel panic ahead! So where's the problem and how to fix it? The problem is that the second thread did the removal stuff, but it didn't knew that somebody else is reading the same data set at the same time. So, it would be good to have some kind of mechanism, so that threads can broadcast the information about the data set they're reading or writing to, so other threads can wait until this access is finished. After waiting, they broadcast the read, or modification signal themselves, so other threads won't start to read what they intend to modify. After removing, or writing their part, they broadcast the signal that the data set is free to process again. This is called locking, or blocking a data set (structure, or anything that's a variable).
So, if you'd like to acquire a data set with exclusive access, you can use, for
example, a Mutex. The broadcasting part is called locking a structure, or
entering a critical section. Many structures, and variables, in the kernel have
their respective mutex objects, which are used to lock the structure, so it's
accessed only by one thread at a time. It's no different in our case of the
formats
table; the locking object for this table is a read write lock (which is
a mutex) in the form of the binfmt_lock
variable. I suggest you look it up in
the kenrel sources to know where it's located and learn the pattern for this,
because — as I already wrote — it's popular among other kernel structures. To
add things up, before you do anything with the formats
table, you need to lock
the binfmt_lock
object according to your action. When you're finished with
whatever you do with the formats
table, just unlock the mutex, so other threads
can process this table, and you're done. I've also told you that many kernel
structures have their own respective locking objects; but it's up to you to find
them. There's no standard “slot” in the structure you can put the
synchronization objects in; you have to read the source code and figure out
which ones to use, by yourself.
The address of the binfmt_lock
mutex can be acquired in the same way you acquire
the formats
table's address:
g_binfmt_lock = (rwlock_t *) kallsyms_lookup_name("binfmt_lock");
You can read-lock this mutex by using the read_lock()
function, and read-unlock
it by using read_unlock()
.
Implementation
I think I've managed to describe most of the theoretical aspects of the approach I'll be using in the code: you should know what the binfmt system is, and what we are to do. You also should know the place that holds the data which is of most importance for us and you should know how to access that data. Last section described the method of reading the data, and the first one described how to present the data to the user.
So, let's go into some details.
Getting the address of the formats
table and its synchronization mutex,
binfmt_lock
. If the kallsyms subsystem is not compiled in the kernel, the
module won't allow itself to be loaded. The insmod program will return an
error. But in the optimistic case (which is the majority), here's the code.
static int __init kefdump_init(void) {
g_formats_list = (struct list_head *) kallsyms_lookup_name("formats");
g_binfmt_lock = (rwlock_t *) kallsyms_lookup_name("binfmt_lock");
if(! g_formats_list || ! g_binfmt_lock)
return -EFAULT;
The module creates the virtual file inside a virtual file system /proc
:
proc_kformats = create_proc_entry("kformats", 755, NULL);
if(! proc_kformats) {
printk(KERN_ERR "can't create_proc_entry\n");
return -ENOMEM;
}
The action list of the kformats
file is defined by the kformats_file_ops
structure. This structure uses a convention imposed by the seq
subsystem, so the
read
, llseek
and release
fields are pointing to the seq
domain.
proc_kformats->proc_fops = & kformats_file_ops;
Initialization is complete, the module is loaded.
TRACE0("kformats installed");
return 0;
}
Now, the user needs to read the /proc/kformats
file, so we can analyze further
what the module does. The cat /proc/kformats
command should do the trick. The
cat
command first opens the file (by using the fopen()
function, for example,
which in turn uses the open()
syscall), then it reads the data from it, and
closes the file in the epilogue of its execution (the close()
syscall). Let's
consult that with our action structure:
static struct file_operations kformats_file_ops = {
.owner = THIS_MODULE,
.open = kformats_open,
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release
};
So, first the kformats_open
will be invoked, because cat
opens the file.
static int kformats_open(struct inode *inode, struct file *file) {
return seq_open(file, & kformats_seq_ops);
}
We initialize the seq
subsystem here. This is the control structure for seq
:
static struct seq_operations kformats_seq_ops = {
.start = seq_file_start,
.next = seq_file_next,
.stop = seq_file_stop,
.show = seq_file_show
};
What it means is that the control of serving the /proc/kformats
file is given to
the seq
filesystem, so we stick to its rules; then we only need to implement a
simple loop-like construct to be able to fully implement serving of the file
contents. .start
is the beginning of the loop, and it's the initialization of
data structure we want to print. In our case we will lock the binfmt_lock
here,
so no other thread will be able to write-access the formats
table, not even the
operating system itself (remember, by writing the kernel module, you're writing
the operating system itself). Read-access will be allowed for anyone. That's
the advantage of using the rwlock instead of a conventional lock.
static void *seq_file_start(struct seq_file *s, loff_t *pos) {
TRACE0("lock");
read_lock(g_binfmt_lock);
if(* pos >= 1)
return 0;
else
return g_formats_list->next;
}
The function will return a pointer to the first item in the table. By the way; if the user will request read from f.e. a middle of the file, the function will return zero. This is because our module doesn't support starting reading from any other element that is not the first element.
The cat
program, after opening the file, will start reading from it. This will
result in triggering the next
and show
actions, all the way down until the next
function will return NULL
— signalling the end of the data. This is our
implementation of the next
function. Its job is to increment the data iterator
together with fetching the pointer to the next
element, and here's how it does
this:
static void *seq_file_next(struct seq_file *s, void *iterator, loff_t *pos) {
struct list_head *next = NULL;
(* pos)++;
next = ((struct list_head *) iterator)->next;
if(next != g_formats_list) {
return next;
} else
return NULL;
}
The formats
table uses the standard double-linked list structure. The
characteristic part of this table is that the last element is linked with the
first one; so by incrementing the iterator, while standing on the last element,
we'll end up on the first element again. This is the reason for our expression
above: if the next
var is the same as g_formats_list
, it means that we're on the
beginning again, so we already got past the end of the table, so we can break
the loop. We return NULL
to signal that we got the end of the data. When the
kernel will see the NULL
value, it won't call the next function anymore.
static int seq_file_show(struct seq_file *s, void *iterator) {
struct linux_binfmt *fmt = NULL;
char namebuf[512] = { 0 };
fmt = container_of(iterator, struct linux_binfmt, lh);
sprint_symbol(namebuf, (unsigned long) fmt->load_binary);
seq_printf(s, "%s\n", namebuf);
return 0;
}
So, nothing complicated here. The sprint_symbol()
is a function that is a
logical mirror of the kallsyms_lookup_name
— that is, it's designed to perform
the opposite — it will convert the address of the symbol to the symbol name. Of
course, the address must be valid in order to do it. The container_of
macro is a
method of iterator dereferencing, and I encourage you to read about it.
That's about it. In case you'd like to share your thoughts or point out any errors, feel free to use the “comment” link below.
Source Codes (5kb)