• Using a C library from Java

    Recently I’ve been considering making Java bindings to an open-source C library.

    It’s such a pain though.

    Native binding to C library

    Traditionally you’d do this with JNI:

    1. Compile the C library,
    2. Write some JNI glue in C and Java,
    3. Package it all up into a JAR

    Writing JNI isn’t trivial. I experimented with it and there are a lot of gotchas around memory management and string handling. I’m confident I could manage it but it’s a lot of work even to just move a UTF-8 string from C into Java without leaking memory or mis-handling exceptions.

    Java 22 reached General Availability recently (March 2024) and it includes the first non-preview release of the Java Foreign Function and Memory (FFM) API, which is like a libffi or Python ctypes mechanism for Java - which Java Native Access (JNA) also already provided.

    With that approach, you don’t write any glue code in C: Instead, you describe the C library’s exports in Java and use FFM/JNA to access them.

    So then, your process looks like:

    1. Compile the C library,
    2. Write FFM/JNA glue in Java,
    3. Package it all up into a JAR

    It’s still not perfect, though. In C, you can have platform- and implementation-dependent definitions of primitive types, standard library types and typedefs, etc. In C, these are resolved at compile time, so JNI gives you the opportunity to adapt to the platform’s specifications in a general way.

    I wrote about this problem before: Using setjmp/longjmp from Java.

    setjmp is a pretty obscure example, though. Here’s an easier example: long int is 64 bits on Linux x86_64, but 32 bits on Windows x86_64, and also on both Linux & Windows x86_32. So if you want to call unsigned long strtoul(...), you need to know how big unsigned long is at runtime when you’re describing strtoul to FFM/JNA.

    In theory, types and sizes will vary depending on:

    • Operating system (Linux, Windows, macOS, …)
    • C library (glibc, musl, MSVC, …)
    • Compiler (gcc, clang, Visual C++, …)

    The above typically choose different behaviour depending on CPU architecture (x86_32, x86_64, 64-bit ARM, …)

    Compile the C library

    The C library you’re wrapping also needs to be compiled to match all of the above too.

    Most Linux distributions standardize on glibc, but musl is also common (like on Alpine Linux, which is extremely common in Docker images).

    Practically speaking, you’ll need to link dynamically against the same C library that’s being used on the system. If you bring another C library in (like through static linking, or including it as a pre-packaged dynamic library), you’re likely to encounter conflicts with the system-installed library.

    You can avoid that problem with statically-linked executables, but libraries are not executables, so they have less control over their immediate execution environment. That is, they need to stay compatible with other libraries that are also linked into the same project, which are certainly using the system C library.

    More precisely, you need to link against a version of the C library that is ABI-compatible with the one that’s present at runtime: If I compile my library against glibc 2.13 x86_64 Linux, I can be pretty confident it’ll run on glibc 2.15 x86_64 Linux, because glibc is backward-compatible. However, glibc is not forward-compatible, so it won’t run on glibc 2.10 x86_64 Linux. And of course, it won’t work on x86_32, musl, etc.

    This doesn’t apply to JUST the C library, but any dependency you need to link against. That could include a C++ standard library or other more exotic dependencies, depending on the library you’re trying to wrap.

    And since Java is used on so many different platforms…

    … you end up having to compile your library for every possible combination you’re willing to support.

    Look at the matrix of SQLite-JDBC supported operating systems:

    They compile the SQLite C library separately for each of those targets! You can peek at the platform targets in the Makefile for a hint on how they’re cross-compiling. I find their approach very impressive, but it sure seems like a lot of work to maintain!

    Translate to JVM

    Emscripten compiles C code and libraries to Javascript. At a very high level, it does this by providing C standard library functionality (primarily, the functionality provided by the OS kernel), and using JS/WASM as the compilation target.

    In JS, you basically have no choice: You can’t run native code in the Javascript sandbox, so you have to provide everything as JS code.

    You’d expect a performance hit for this, but that’s OK for a lot of libraries. This is true for me, too: The library I want to provide in Java provides unique functionality, and doesn’t necessarily need to run fast.

    So I’d like to use a similar approach in Java:

    1. Compile a C library to JVM bytecode (or even plain Java code),
    2. Write glue to provide a better Java-style interface,
    3. Package it all up as a JAR

    This is totally feasible. I’ve found two projects that use translation to achieve this:

    • LLJVM translates LLVM IR (bitcode) to Java bytecode and provides C library via newlib & custom Java. Inactive since ~2010.
    • NestedVM translates MIPS binaries to Java bytecode. GCC can create MIPS binaries. Inactive since ~2009 with some more recent updates available on a fork.

    A lot of the discourse I’ve read focuses on how people want to call native C libraries from Java because the native code is expected to perform better, so this kind of “translation” approach typically gets dismissed: Why write your high-performance code in C in the first place if you’re going to run it in the JVM?

    The library I want to wrap:

    1. Doesn’t require high performance
    2. Wasn’t written by me, so I didn’t have the choice to write it in Java vs. C
    3. Has unique and specialized functionality that’s hard to replicate
    4. Already works in Emscripten

    So I’d love to try a translation approach and see how it works. It has some pretty significant advantages over using a native library:

    1. No difficulty compiling for all platform, OS, CPU architecture, compiler, and standard library configurations
    2. No difficulty porting, running, and testing on obscure platforms/configurations
    3. Low maintenance: Java code, even compiled, typically ages well and works unmodified for years/decades

    It sounds absolutely delightful!


  • Uncommon zsh shell techniques, part 1

    Some of these work in other shells too, but I only use zsh these days.

    Anonymous functions

    () { cp "$1" /tmp/ } filename
    

    This works the same as cp filename /tmp/, but it’s more convenient in some cases:

    • When you’re running the same command on many filenames, and want to use command history (up + enter) to modify the filename. You don’t have to position the cursor onto the filename mid-command - it’s just at the end.
    • When you use the argument multiple times, or need to use variable modifiers on the input:
      # Print the directory containing the passed file
      () { echo "${1:A:h}" } file.csv
      
      # Transcode to mp3, unless the source is already mp3
      () { newfn="${1:r}.mp3"; if [[ "$1" != "$newfn" ]]; then echo "$1 -> $newfn"; ffmpeg -i "$1" "$newfn"; fi } foo.flac
      

    It’s also safe to use with spaces & other sensitive characters.

    -c with variables

    # List all files without extensions
    find . -type f -exec zsh -c 'printf "%s\n" "${1:r}"' . '{}' ';'
    

    It’s tempting and common to place {} directly into the argument passed to zsh -c, like this:

    # DON'T DO THIS!
    find . -type f -exec zsh -c 'printf "%s\n" "{}"' ';'
    

    This will cause problems for filenames that contain special characters like ", because find (and many other programs) won’t escape them. How could it? It doesn’t know what escaping strategy to use, because it depends on the command you’re invoking. For example, we’re using zsh here, but if you were writing inline Python code you’d need to escape the string following Python rules instead of zsh.

    By passing the argument to zsh -c instead, you can use $1 in zsh as a variable with all the safety that comes along with that. You also get to use variable modifiers like :r.

    Note also:

    • I passed . to act as the $0 argument to the command-line script. I’m not using the value of it in the script, but I need to pass it so that the filename is passed as $1.
    • I used printf instead of echo because echo will try to handle filenames like -n as an argument.

    Globbing flags and qualifiers

    I find the zsh documentation on filename generation pretty hard to read, but here are some examples I use that might help:

    Globbing flags

    Globbing flags appear right before the part of the glob you want to apply them on. I usually apply them to the whole pattern so I put them right at the start.

    These require extended_glob (see docs) to be set.

    # Match all .jpg files, matched case-insensitively (so it also includes
    # *.JPG, *.Jpg, etc.), like the option nocaseglob.
    setopt extended_glob
    echo (#i)*.jpg
    

    Glob qualifiers

    Glob qualifiers are suffixes that modify how the glob works.

    # List all jpg and gif files. No matches = no arguments.
    echo *.jpg(N) *.gif(N)
    

    Adding (N) to a glob string makes it expand to no arguments if there are no matches (same as the null_glob option). Without this, you’ll typically either pass the raw argument *.jpg(N) if there are no matches, or zsh won’t run the command and will raise an error instead.

    The exact default behaviour depends on the setting of options null_glob, nomatch, and null_glob.

    Together

    You can use them together:

    # Match GIF, JPG, JPEG, HEIF, and AVIF extensions
    # with case-insensitive matching,
    # and run with no arguments if no files match.
    setopt extended_glob
    echo "Image files:" (#i)*.(gif|jpe#g|heif|avif|png)(N)
    

  • Bluetooth codec scripts for pulseaudio

    I made some scripts to help me see what codecs are supported by my Bluetooth audio devices, and select the one I want.

    My devices were coming up with the sbc codec which is the most basic codec, but they support higher-bitrate codecs. Selection is a little clumsy on my headless, ssh-access-only Linux box that I’m playing audio from.

    My devices support for example:

    sbc: SBC
    sbc_xq_453: SBC XQ 453kbps
    sbc_xq_512: SBC XQ 512kbps
    sbc_xq_552: SBC XQ 552kbps
    

    I’m surprised my devices only support sbc codecs and not aac/mp3/whatever else. Actually, I don’t know what’s even typical! Do other operating systems use other codecs? I don’t know! Maybe I’ll try to find out what codecs these devices use on macOS or Windows someday.

    It’s also possible I’m not seeing other options here because pulseaudio only supports sbc for Bluetooth. I have read that pipewire has better Bluetooth codec support, but I’m not currently willing to swap a working audio setup (pulseaudio) for one that might need tweaking (pipewire).

    The scripts are available on GitHub.

    Current mood: 😀 accomplished
    Current music: Big Wreck - Hey Mama


  • My first 4K monitor, on Windows

    I just got a pair of 4K monitors - one for a Mac Mini, and one for Windows.

    The Mac is hooked up over HDMI and I use it purely for desktop applications. It works fine.

    But on Windows, I’ve encountered a surprising number of issues.

    Problem 1: No display during boot

    I connected the monitor using DisplayPort because it seemed most appropriate to my video card, a GeForce GTX 960. It has 3 DisplayPort ports, and only one HDMI port; and I didn’t know if the HDMI port supported 4K at 60 Hz (it does), but I knew the DisplayPort ports would do it.

    I swapped the monitor while the computer was on, and everything was fine… but when I rebooted, I had no display.

    FIX: It wasn’t super easy to find information about this but eventually I found a post that pointed me towards an NVIDIA firmware update tool for DisplayPort 1.3 and 1.4 displays that fixes the issue:

    Without the update, systems that are connected to a DisplayPort 1.3 / 1.4 monitor could experience blank screens on boot until the OS loads, or could experience a hang on boot.

    Problem 2: Euro Truck Simulator 2 stuck minimized

    UPDATE: Fixed: My Epson scanner software includes a tray icon. If I kill the process, this problem goes away. I guess it’s stealing focus when the resolution & scale change? Even though it isn’t actually showing a window? 🙄

    My video card can’t handle 4K resolutions at a reasonable framerate, so I’m running games at 1080p. Also, I often stream games to my living room TV using Steam, and it’s a 1080p TV so it fits better.

    When I launch Euro Truck Simulator 2, it immediately minimizes into the background, and any attempt to restore it brings it up for a brief moment but then it goes minimized again.

    It doesn’t happen if one of the following is true:

    • ETS2 is run at the desktop resolution—but at 4K it takes a severe framerate hit… or
    • Windows display scale is set to 100%—but at 27” 4K, 150% is far more usable. This is the workaround I’m using but I wish I didn’t have to!

    I don’t know if this is an ETS2 problem specifically, or a Windows problem. I assume other games will be affected too, but I’ve only tried Cities: Skylines and it has no such issue. ETS2 actually changes the desktop resolution for fullscreen, while Cities: Skylines uses a borderless mode that leaves the desktop resolution unchanged; this might explain the difference.

    Problem 3: 1080p not pixel perfect

    A 4K monitor can theoretically upscale 1080p using pixel doubling, where each 1080p pixel is displayed with four 4k pixels (doubled in both X and Y axes). I want this because it looks clear and perfect, as though I’m using a 1080p monitor…

    … but my particular monitor (LG 27UL550-W) doesn’t do this - it performs smoothing/interpolation of some sort on the upscale, and as a result it looks blurry.

    I feel that my GPU drivers should be able to render at 1080p but output at 4K, but if it can I haven’t found out how.

    UPDATE: Integer scaling is available in the NVIDIA control panel for Turing-architecture GPUs (GeForce 16xx, GeForce 20xx and up). I have a 960 so outta luck!!

    Problem 4: DisplayPort disconnects when monitor off

    When I turn off the monitor, the computer sees that as a disconnected display. This is a well-known hotplug detection feature.

    This isn’t really a problem when I’m sitting in front of it, but I like to stream games from the computer to my living room TV.

    When I do that, I want to turn off the locally-attached display. If I do, then games basically don’t work - they see no display connected and aren’t able to select a display resolution because there’s no display to switch on. So streaming just doesn’t work at all.

    Even if I’m not streaming, I prefer to have direct control over the power of my display, instead of having to use the display sleep timer to shut it off.

    WORKAROUND: Use HDMI, but long-term I’m probably just gonna have to live with this problem because I understand some features require DisplayPort, like FreeSync. Some monitors have an option to turn off while appearing connected to the computer, but mine doesn’t!

    Other thoughts

    These are all small-ish problems. Some of them have workarounds or whatever, but like, they’re all surprising issues that I feel shouldn’t happen at all. And I’ve only had the monitor for one day!

    Hardware

    • MSI NVIDIA GeForce GTX 960
    • MSI Z270-A Pro motherboard
    • Windows 10 up-to-date
    • NVIDIA drivers up-to-date

    My old monitor is a 2560x1440 panel connected over dual-link DVI. It exhibited none of the above problems, but I used it at 100% scale, at native resolution, and without DisplayPort.


  • Un-mangling some mangled unicode

    Recently I got some data from an external source that I’m to review and correct prior to use. One of the things I’ve been addressing is weird Unicode encoding stuff.

    For example:

    b'O\xc3\x82\xc2\x80\xc2\x99SAMPLA'
    

    Clearly this is supposed to have an apostrophe , but how on earth did it get turned into \xc3\x82\xc2\x80\xc2\x99?

    After poking at it with different coding systems for a while, I finally figured it out:

    # Mangle input
    ('’'
        .encode('utf-8')    # b'\xe2\x80\x99'
        .decode('latin-1')  # 'â\x80\x99'
        .upper()            # 'Â\x80\x99'
        .encode('utf-8')    # b'\xc3\x82\xc2\x80\xc2\x99'
    )
    

    I’ve never seen mangled Unicode get passed through .upper() before. I wasn’t around to see this data get created in the first place, but my guess is something like this happened:

    1. Software A accepted the input O’SAMPLA
    2. Software A exported the data using UTF-8 encoding
    3. Software B imported the data but incorrectly interpreted it using Latin-1 encoding
    4. Software B uppercased the data (typical for this software)
    5. Software B exported the data using UTF-8 encoding

    Here’s the reverse, to restore the original data:

    # Fix mangled input
    (b'O\xc3\x82\xc2\x80\xc2\x99SAMPLA'
        .decode('utf-8')    # 'OÂ\x80\x99SAMPLA'
        .lower()            # 'oâ\x80\x99sampla'
        .encode('latin-1')  # b'o\xe2\x80\x99sampla'
        .decode('utf-8')    # 'o’sampla'
        .upper()            # 'O’SAMPLA'
    )
    

    This works for this particular input because  needs to become â before the latin-1/utf-8 interpretation steps, but I don’t consider it appropriate to assume this will work for all inputs. Some inputs may not have been affected at all by upper(), and it would be incorrect to apply lower() to them.

    Unfortunately I can’t predict with total confidence whether applying lower() is appropriate for each input, so this data is gonna require manual review.


subscribe via RSS