Mike's Weblog

dlclose(): Not Even Once

In another life1 I had the (mis)fortune of working on a project where I needed to implement a PKCS-112 library in C++3 that acted as a shim to translate signature requests from PKCS-11's interface to that of a another service on the box available over D-Bus.

If that last paragraph made sense to you then I'm so sorry.

In either case, while going about making this horrible thing become a reality I ran into a fun bug that seemed to happen every so often. After the application loading the library had finished reading the certificates and keys provided by the library, and performed a signing operation, it would crash with a SIGBUS or a SEGFAULT. This was unusual, because it really wasn't meant to do that.

After an hour or so of hopless printf debugging I finally Did The Right Thing and reached for good old gdb.

Sure enough gdb caught the address fault, but (being somewhat unfamiliar with gdb at the time4) it took me a while to figure out what exactly was causing the address fault. Eventually I presented gdb with a query, whose response I hope to never see again.


(gdb) print *$pc
Cannot access memory at address 0xe8e7a948
    

To translate: my program counter (the number the CPU uses to keep track of where it's heading) was pointing to memory that isn't accessible. Understandably this makes it difficult for the CPU to figure out what the next instruction it should run is.

After some more debugging (honestly I don't remember this part, but I'm sure it involved hitting my desk at least a few times an hour) I came to the realization that the thread that mysteriously was trying to execute memory that semed to have nothing behind it was coming from a worker thread from the GDBus library, a library that's part of GLib's GIO set of libararies, that made using D-Bus easyeasier.

Did GLib betray me? Are the GNOME devs actively out to eat my lunch by having GIO jump to wild far-flung addresses in an attempt to ruin my day? These questions can't possibly be answered, but what I do know is that we eventually have to get back to the title of this post, and we've reached that point now.

dlclose() is the deceptively pleasant and even banal sounding counterpart to dlopen(). But don't be fooled, dlclose() hates you. It hates you, it hates your family, and it will eat your lunch if given the chance.

You see, PKCS-11 was implemented as a C library meant to be dynamically opened with dlopen(), which was the style at the time5. dlopen() helpfully loads an arbitrary .so file and its runtime dependencies into the current address space, and provides a handle to get functions within it. dlclose(), on the other hand, mercilessly rips the aforementioned .so file and its dependencies right out of mapped memory no matter how you or your dependencies feel about it6.

Which brings us back to GLib. As it turns out, GLib's worker thread was still running when dlclose() was called on my library. So GLib messed up right? Well, I didn't really tell it its instructions might be unmapped right out from under it. And GLib doesn't have a way for me to tell it "hey, you got like 1 second, be ready to be completely unmapped from memory, sg ok?".

And really, it's fair it doesn't do that. It's hard enough to write a C library that can handle all the weird edge cases around what process its workers are in, or how threads may or may not be working in any particular state. Add on top of that needing to have complex code to clean up all your state and prepare to be unmapped at any moment in time, just isn't worth it.

Heck, even widely used libraries that are, in theory, meant to be able to be properly dlclose()'d don't always get this right. Take this example I found in openCryptoki in which every time one dlcose()es the library a file descriptor is leaked7.

So how do we fix this? It's simple, never dlclose(). Just don't do it. I know you might think you're some Real Smart Programmer That Really Actually Knows What They're Doing, but even if you're right the benefits simply aren't there. Once you've dlopen()'d you've opened pandora's box, there's no way you're goin to cleanly stuff all those bits back inside. And what do you gain when you run dlclose() anyway? You maybe get back a few megabytes of address space. Not memory, address space. If you're on a 16 bit machine, sure save that address space (how the heck are you using a POSIX-compliant dlclose() to begin with?!?). But if you're in the modern 64-bit world you should never use dlclose().

But what if you're in my unfortunate position where you're writing the poor little library that just wanted to live free with its instructions mapped until the day of exit()ing. Well it's pretty simple, actually, you just dlopen() the library that's being dlopen()'d.

By doing this you bump the reference counter for your library and its dependencies and put an end to any possibility of your instruction space being unceremoniously unmapped out from under you again. Unless whatever application is using you is poorly written and is double (or triple!) closing your library for some reason. In that case, maybe just dlopen() yourself a few times. 10 should do.

[1] about five years ago

[2] yuck

[3] yuck++

[4] not to imply present familiarity

[5] 1995

[6] interestingly, the POSIX spec for dlclose() mentions this approach of ripping out the library and its dependencies is completely optional. musl, for exmaple, opts to take the "do nothing because we don't have to" approach. what some may consider an inefficiency in musl's implementation, I consider a strategy deep in wisdom (or laziness, but in this case it's serendipitous laziness)

[7] i have no idea if this was ever patched (i nih'd my way out of needing to use openCryptoki as you might guess from this post) but if you have a long-running process that happens to dlopen() and dlclose() openCryptoki over its lifetime and you're running into file descriptor exhaustions, you might want to fix that (or better yet, as mentioned above, just stop using dlclose())