Walking around a buggy LLDB plugin API in C++

https://anadoxin.org/blog/buggy-lldb-api-cpp

Sun, 10 July 2022 :: #lldb :: #cpp

LLDB allows the user to write plugins in Python and C++. Because I'm a fan of static typing when programming, I'm often finding myself avoiding Python at all costs. When trying to write a plugin for LLDB this wasn't an exception. I'm not a real fan of C++ either, but I can find numerious advantages of the language, so that choice was a no-brainer for me.

The problem is that apparently I'm in the minority. It seems that LLDB API is bugged and nobody has created any real fix for this. I've reached out to LLDB devs about the issue, which was acknowledged, then I've created an LLDB fork with a proposal of the fix, but nobody followed up, so I guess fixing the C++ API is not on anyone's priority list.

Another problem with the C++ API is that you can do more in the Python API. E.g. when you'll try to write custom renderers for some types in your app, it will quickly be apparent that it's possible only in the Python API. Fortunately I did not need to venture into this territory (yet), so that wasn't a big issue for me.

Notabene: The LLVM codebase, while well-written, is a terrible project to work on in the free time. The amount of code in the project can slow down even the strongest machines, the re-compilation of a trivial fix can take several minutes, and that's when you have 32 GB of RAM (I've actually ended up buying additional RAM to be able to compile this stuff). Linking phase for binaries in debug mode can easily work as a separate stress test for Phoronix testsuites, and resulting binaries have multiple gigabytes in size. One trick to make this less heavy is to actually use release mode when building LLVM. This will use the CPU more (because of more optimizations), but will generate binaries that take a lot less space than debug versions. As this wouldn't be enough, few years ago maintainers decided to join the LLDB repository into the big LLVM repo, so when trying to build LLDB by ourselves, we have to deal with the full LLVM codebase. Oh well. Fortunately, when writing LLDB plugins we don't have to compile the LLDB itself.

Writing the plugin

In order to write a plugin in C++, we need two things:

LLDB's headers, on Linux often found in /usr/include/lldb, and on macOS I'm not sure what is the full path -- search for LLDB.h and SBBreakpoint.h files, and use that directory,
LLDB's library file, on Linux in /usr/lib/liblldb.so, and on macOS found in /Library/Developer/CommandLineTools/Library/PrivateFrameworks/LLDB.framework/Versions/A/LLDB.

This is a template of the plugin you can use:

#include <SBCommandInterpreter.h>
#include <SBCommandReturnObject.h>
#include <SBDebugger.h>

#define API __attribute__((used))

namespace lldb {
    API bool PluginInitialize(lldb::SBDebugger debugger) {
        printf("hello from a plugin\n");
        return true;
    }
}

Compilation is possible by using the commandline (although I recommend using CMake or Meson to drive the compilation process, this way you will be able to autodiscover paths to LLDB headers and guess the proper library name during linking, depending on the OS):

$ c++ -shared -fPIC main.cpp -o libmain.so -I /usr/include/lldb/API -llldb

And finally this is how you run it:

$ lldb -o 'plugin load libmain.so' -o 'q'
(lldb) plugin load libmain.so
hello from a plugin
(lldb) q

An empty plugin that does nothing is not a very useful thing to have, so let's register a new command into lldb so that we can use it during debugging sessions.

#include <SBCommandInterpreter.h>
#include <SBCommandReturnObject.h>
#include <SBDebugger.h>

#define API __attribute__((used))

class NyanCatCommand : public lldb::SBCommandPluginInterface {
public:
    bool DoExecute(lldb::SBDebugger debugger, char **commands, lldb::SBCommandReturnObject& result) override {
        result.Printf("hello from nyan cat\n");
        return true;
    }
};

namespace lldb {
    API bool PluginInitialize(lldb::SBDebugger debugger) {
        lldb::SBCommandInterpreter interpreter = debugger.GetCommandInterpreter();
        lldb::SBCommand foo = interpreter.AddMultiwordCommand("custom", "my custom commands");
        foo.AddCommand("nyancat", new NyanCatCommand(), "a nyancat command");
        return true;
    }
}

Runtime:

$ c++ -shared -fPIC main.cpp -o libmain.so -I /usr/include/lldb/API -llldb && lldb -o 'plugin load libmain.so' -o 'custom nyancat' -o 'q'
(lldb) plugin load libmain.so
(lldb) custom nyancat
hello from nyan cat
(lldb) q

So, we're adding a custom command group, and we add a nyancat command into this group. The result is that after loading the plugin, you will be able to write a custom nyancat command, and LLDB will invoke the logic of the method DoExecute() from the NyanCatCommand class.

Let's do something useful now. Let's evaluate some expression from the plugin; for example, let's change our command that instead of "hello from nyan cat" our command will display current value of the RIP register.

class NyanCatCommand : public lldb::SBCommandPluginInterface {
public:
    bool DoExecute(lldb::SBDebugger debugger, char **commands, lldb::SBCommandReturnObject& result) override {
        auto tgt = debugger.GetSelectedTarget().GetProcess().GetSelectedThread().GetSelectedFrame();
        auto rip = tgt.EvaluateExpression("$rip").GetValueAsUnsigned();
        result.Printf("RIP is: 0x%016lx\n", rip);
        result.SetStatus(lldb::ReturnStatus::eReturnStatusSuccessFinishNoResult);
        return true;
    }
};

In order to better present how the expression evaluation works, let's create a small test app:

#include <iostream>

int main() {
    printf("hello world\n");
    return 0;
}

compilation:

$ c++ -g testapp.cpp -o testapp

Let's load this testapp in lldb, load our plugin and check how our command will work:

$ lldb -o 'plugin load libmain.so' testapp
(lldb) target create "testapp"
Current executable set to '/home/antek/dev/cpp/lldb_plugin_example/testapp' (x86_64).
(lldb) plugin load libmain.so
(lldb) b main
Breakpoint 1: where = testapp`main + 4 at testapp.cpp:4:11, address = 0x000000000000115d
(lldb) r
Process 1429591 launched: '/home/antek/dev/cpp/lldb_plugin_example/testapp' (x86_64)
Process 1429591 stopped
* thread #1, name = 'testapp', stop reason = breakpoint 1.1
frame #0: 0x000055555555515d testapp`main at testapp.cpp:4:11
1    #include <iostream>
2
3    int main() {
-> 4     printf("hello world\n");
5        return 0;
6    }
(lldb) custom nyancat
RIP is: 0x000000005555515d
(lldb) n
hello world
Process 1429591 stopped
* thread #1, name = 'testapp', stop reason = step over
frame #0: 0x000055555555516c testapp`main at testapp.cpp:5:12
2
3    int main() {
4        printf("hello world\n");
-> 5     return 0;
6    }
(lldb) custom nyancat
RIP is: 0x000000005555516c
(lldb) q

Works OK! So where's the bug?

The bug

As always, the devil is in the edge cases. Let's modify our testapp a bit and add some threads.

#include <iostream>
#include <thread>

int main() {
    auto t1 = std::thread([] () {
        printf("thread 1\n");
    });

    auto t2 = std::thread([] () {
        printf("thread 2\n"); // line 10
    });

    t1.join();
    t2.join();
    return 0;
}

Let's set a breakpoint on line 10 and poke around the session a bit:

$ lldb -o 'plugin load libmain.so' testapp
(lldb) target create "testapp"
Current executable set to '/home/antek/dev/cpp/lldb_plugin_example/testapp' (x86_64).
(lldb) plugin load libmain.so
(lldb) b testapp.cpp:10
Breakpoint 1: where = testapp`operator() + 12 at testapp.cpp:10:15, address = 0x0000000000001204
(lldb) r
Process 1438230 launched: '/home/antek/dev/cpp/lldb_plugin_example/testapp' (x86_64)
thread 1
Process 1438230 stopped
* thread #2, name = 'testapp', stop reason = breakpoint 1.1
frame #0: 0x0000555555555204 testapp`operator(__closure=0x000055555556b008) at testapp.cpp:10:15
7        });
8
9        auto t2 = std::thread([] () {
-> 10           printf("thread 2\n");
11       });
12
13       t1.join();
(lldb) custom nyancat
RIP is: 0x0000555555555204
(lldb) p/x $rip
(unsigned long) $1 = 0x0000555555555204
(lldb)

So far our command works exactly as expected, since both custom nyancat and p/x $rip are displaying the same value of the rip register.

However, LLDB also allows to provide commands to execute after a breakpoint has been hit. Let's try this now:

$ lldb -o 'plugin load libmain.so' testapp
(lldb) target create "testapp"
Current executable set to '/home/antek/dev/cpp/lldb_plugin_example/testapp' (x86_64).
(lldb) plugin load libmain.so
(lldb) br set -y testapp.cpp:10 -C "p/x $rip" -C "custom nyancat"
Breakpoint 1: where = testapp`operator() + 12 at testapp.cpp:10:15, address = 0x0000000000001204
(lldb) r
Process 1438808 launched: '/home/antek/dev/cpp/lldb_plugin_example/testapp' (x86_64)
thread 1
(lldb)  p/x $rip
(unsigned long) $0 = 0x0000555555555204
(lldb)  custom nyancat
RIP is: 0x00007ffff7889119
Process 1438808 stopped
* thread #2, name = 'testapp', stop reason = breakpoint 1.1
frame #0: 0x0000555555555204 testapp`operator(__closure=0x000055555556b008) at testapp.cpp:10:15
7        });
8
9        auto t2 = std::thread([] () {
-> 10           printf("thread 2\n");
11       });
12
13       t1.join();
(lldb) p/x $rip
(unsigned long) $2 = 0x0000555555555204
(lldb)

As you can see, our custom nyancat has printed a completely differet RIP value than the p/x $rip option. Our nyancat command has printed 0x00007ffff7889119, while in reality the rip register should be 0x0000555555555204. What's happening?

I'll simply give you the solution. Our testapp has 3 threads (main, thread 1 and thread 2). We've stopped in thread 2, but LLDB feeds the context of the first available thread for our plugin's DoExecute() function. So it doesn't seem to be a fault with our plugin, but with lldb itself. It's easy to verify that this is the case:

(lldb) thread list
Process 1438808 stopped
thread #1: tid = 1438808, 0x00007ffff7889119 libc.so.6`___lldb_unnamed_symbol3451 + 201, name = 'testapp'
* thread #2: tid = 1438828, 0x0000555555555204 testapp`operator(__closure=0x000055555556b008) at testapp.cpp:10:15, name = 'testapp', stop reason = breakpoint 1.1
(lldb) thread select 1
* thread #1, name = 'testapp'
frame #0: 0x00007ffff7889119 libc.so.6`___lldb_unnamed_symbol3451 + 201
libc.so.6`___lldb_unnamed_symbol3451:
->  0x7ffff7889119 <+201>: mov    edi, ebx
0x7ffff788911b <+203>: mov    qword ptr [rsp + 0x8], rax
0x7ffff7889120 <+208>: call   0x7ffff7888a40            ; ___lldb_unnamed_symbol3440
0x7ffff7889125 <+213>: mov    rax, qword ptr [rsp + 0x8]
(lldb) p/x $rip
(unsigned long) $3 = 0x00007ffff7889119

When switching context to the first available thread, it's possible to verify that its rip value is indeed 0x00007ffff7889119, the same value our custom nyancat command has displayed, despite the fact that it was working with supposed context of thread 2. This means that the culprit is the GetSelectedThread() function, which returns the wrong thread ID.

Walkaround

So, does the Python API have this problem? Yes and no. If we call EvaluateExpression() using Python API the same way as we've been using in our C++ plugin:

auto tgt = debugger.GetSelectedTarget().GetProcess().GetSelectedThread().GetSelectedFrame();
auto rip = tgt.EvaluateExpression("$rip").GetValueAsUnsigned();

then it will have the same problem. But the fix for Python API is that DoExecute() function is enhanced with addidional argument exe_ctx of type ExecutionContext. Using this new exe_ctx argument to get the current frame, and calling EvaluateExpression() on this frame will free us from the bug.

But in the C++ API we don't have the luxury of this additional parameter. As I've written previously, my fork adds this new argument, and it seems to be a valid solution for the problem (or at least it's similar to how it's been handled in the Python API).

But, since it's pretty unlikely you'll be using my fork, because it's not trivial to compile own lldb and it's pretty cluncky to use when compiled successfully, there's another walkaround which is not very pretty, but it seems to be working.

The fix is based on the fact that internal LLDB commands are using different structures than what is being sent to the plugin. Internal LLDB commands (like print, thread) work fine even from the context of breakpoint callbacks. So it's actually possible to utilise some internal command to get the proper thread ID instead of relying on the GetSelectedThread() function. One of the functions that can help us is the thread info command, which will return information about current thread. It's possible to redirect its output to a stream, then parse this stream to extract the thread ID. Then, instead of GetSelectedThread() we can use GetThreadByID() and that's it.

Here is the "working" handler:

class NyanCatCommand : public lldb::SBCommandPluginInterface {
public:
    bool DoExecute(lldb::SBDebugger debugger, char **commands, lldb::SBCommandReturnObject& result) override {
        auto interp = debugger.GetCommandInterpreter();

        lldb::SBCommandReturnObject commandResult;
        auto succ = interp.HandleCommand("thread info", commandResult, false);

        static std::regex tid("^thread .*?: tid = (.*?), .*");
        std::smatch m;
        std::string s = commandResult.GetOutput();

        if (std::regex_search(s, m, tid)) {
            const auto threadIdStr = m[1].str();
            auto threadId = strtoul(threadIdStr.c_str(), nullptr, 10);
            auto tgt = debugger.GetSelectedTarget().GetProcess().GetThreadByID(threadId).GetSelectedFrame();
            auto rip = tgt.EvaluateExpression("$rip").GetValueAsUnsigned();

            result.Printf("RIP is: 0x%016lx\n", rip);
        }

        result.SetStatus(lldb::ReturnStatus::eReturnStatusSuccessFinishNoResult);
        return true;
    }
};

Let's verify it:

$ c++ -shared -fPIC main.cpp -o libmain.so -I /usr/include/lldb/API -llldb && lldb -o 'plugin load libmain.so' testapp
(lldb) target create "testapp"
Current executable set to '/home/antek/dev/cpp/lldb_plugin_example/testapp' (x86_64).
(lldb) plugin load libmain.so
(lldb) br set -y testapp.cpp:10 -C "p/x $rip" -C "custom nyancat"
Breakpoint 1: where = testapp`operator() + 12 at testapp.cpp:10:15, address = 0x0000000000001204
(lldb) r
Process 1449487 launched: '/home/antek/dev/cpp/lldb_plugin_example/testapp' (x86_64)
thread 1
(lldb)  p/x $rip
(unsigned long) $0 = 0x0000555555555204
(lldb)  custom nyancat
RIP is: 0x0000555555555204
Process 1449487 stopped
* thread #2, name = 'testapp', stop reason = breakpoint 1.1
frame #0: 0x0000555555555204 testapp`operator(__closure=0x000055555556b008) at testapp.cpp:10:15
7        });
8
9        auto t2 = std::thread([] () {
-> 10           printf("thread 2\n");
11       });
12
13       t1.join();
(lldb) q

Seems to work.

Not as pretty as a proper fix, but since the second one doesn't exist, it has to do.