• On generating structured data from templates

    Recently I chatted with a friend about generating structured data from templates. Specifically, he observed that Jekyll’s atom feed is generated from an XML template. He posted about his experience.

    I’ve long felt that it’s extremely challenging to correctly use templates to generate structured data, meaning files like source code - HTML/XML/C/etc. I don’t particularly like templating, but I acknowledge its value, especially in HTML templating where you’re mostly writing static markup to represent the page layout. In fact, template-engine substitutions in HTML templates are only a small part of most HTML template files.

    The problem I have is that the template engine doesn’t actually know the context for how it’s making substitutions. No HTML template engine parses the HTML and decides what escaping rules are appropriate for each piece of data being placed into the generated output.

    This post is kinda rambly, but I have some lightly-organized thoughts about templating that I wanted to put into words.

    A typical problem

    Let me give an example:

    <div id="name">Hello, {{ email }}!</div>
    <script>
        database.saveEmail('{{ email }}');
    </script>
    

    If email is 'Baz' <foo.bar@example.com>, what happens?

    I see two typical options, both of which result in the wrong result:

    • The value of email is substituted directly into the template with no modification. The resulting output is incorrect:

        <!-- WRONG! <foo...> is not a valid HTML tag! -->
        <div id="name">Hello, 'Baz' <foo.bar@example.com>!</div>
        <script>
            // WRONG! Incorrectly quoted; leads to syntax error!
            database.saveEmail(''Baz' <foo.bar@example.com>');
        </script>
      
    • The value of email is escaped for HTML. Django and many other template engines will do this for regular strings.

        <!-- Correct: The string is escaped for inclusion as plain text in HTML. -->
        <div id="name">Hello, 'Baz' &lt;foo.bar@example.com&gt;!</div>
        <script>
            // WRONG! Still incorrectly quoted, and now we're incorrectly
            // storing HTML entities in the database instead of the
            // original string!
            database.saveEmail(''Baz' &lt;foo.bar@example.com&gt;');
        </script>
      

    Comparison to SQL

    The problem is similar to SQL injection, although the stakes are usually a lot lower. The typical solution for avoiding SQL injection bugs when writing SQL is by using SQL-aware interpolation functions provided by our SQL libraries:

    db.execute('SELECT * FROM foo WHERE email = ?', email)
    

    In SQL, this works because SQL allows every value in a query to be represented as a string literal with consistent quoting & escaping rules, regardless of the data type of the value being represented:

    select '3'::integer + '4'::integer;
    -- Result: 7
    

    But SQL-aware interpolation only works for data. Other SQL syntax and identifiers cannot be represented as string data, so this code:

    # This won't work!
    table_name = "user_table"
    db.execute("SELECT COUNT(*) FROM ?", user_table)
    

    results in an invalid query:

    SELECT COUNT(*) FROM 'user_table';
    -- ERROR:  syntax error at or near "'user_table'"
    

    In SQL, you often avoid this by marking the interpolated string as “safe”, which indicates that you’ve already verified that it won’t lead to problems if it’s substituted raw:

    table_name = "user_table"
    db.execute("SELECT COUNT(*) FROM ?", AsIs(user_table))
    

    You do have to be really careful that table_name doesn’t include anything malicious, since its contents will be interpreted as valid SQL syntax. I might even suggest that we should be able to tag it as a different kind of identifier, like TableName(table_name), so the interpolating code can validate/quote/escape it for use ONLY as a table name.

    Usage contexts

    The main problem I see with HTML and other languages is that there are way more different kinds of contexts that variables get substituted into.

    Above, I showed that one escaping rule isn’t sufficient when a variable gets substituted into both HTML and JavaScript.

    A common solution here is to indicate the context to your template engine - perhaps using a filter:

    <div id="name">Hello, {{ email | html }}!</div>
    <script>
        database.saveEmail('{{ email | javascript_string }}');
    </script>
    

    This works OK when the number of different usage contexts is small, like in this example, but I don’t like that you have to remember to use the right filter every time you code in a substitution. If the default behavior is to escape for HTML, you’ll start omitting the | html part, and then it’s easy to accidentally miss the | javascript_string filter because it’s used so much less-frequently.

    And if you do miss it, will you even notice? You’ll only see problems with strings that contain syntax that that’s meaningful to JavaScript. So it becomes a bug that happens infrequently, which makes it harder to find later on. This is actually also a problem that SQL interpolation would suffer from:

    # Don't do this! It's SO UNSAFE!
    # But it results in working code, and doesn't even break for most typical
    # inputs, and that's almost worse!
    
    email = request.GET['email']     # eg. foo@example.com
    db.execute(f"SELECT * FROM foo WHERE email = '{email}'")
    

    Too many contexts

    If you have a lot of different contexts that substitutions need to be placed into, it can be arduous to make sure they’re all correct:

    # NOTE: Function {{ fn_name | for_comment }} is generated from a template.
    def {{ fn_name | for_identifier }}():
        num1 = {{ num1 | for_number }}
        num_squared = num1 * num1
    
        logging.debug(r'{{ name | for_raw_string }}')
    
        print(f'Hey {{ name | for_string }}, your squared number {{ num1 | for_string }} is {num1}!')
    

    The idea here is that you need to escape/quote/etc. values depending on how they’re being used. Like, in a string, \ should be escaped to \\, but that would be inappropriate in a raw string. Numbers shouldn’t have spaces or anything in them.

    Not using templates

    There are libraries that allow you to generate HTML by writing code in your host language. This is comparable to generating JSON using a JSON library, or XML using an XML library, and it can also be a pain:

    doc = htmltag('html')
    body = htmltag('body').add_to(doc)
    
    # Static elements are too much work to create.
    div = htmltag('div', {
        'id': 'outermost'
    }).add_to(body)
    
    # Attribute values are too much work, but they'll be properly escaped on output.
    textinput = htmltag('input', {
        'type': 'text',
        'name': 'user_email',
        'value': email_address
    }.add_to(div)
    
    # When htmltext is rendered to HTML, its contents are escaped.
    htmltext(f'Hello, {email}!').add_to(div)
    
    response.send(doc.render_to_html())
    

    In my experience, nobody wants to write HTML in anything other than an HTML file. Totally understandable - editors have good syntax highlighting, feedback, etc. for HTML, and you’re mostly writing static HTML anyway with only a few substitutions here and there.

    What do I want anyway

    I don’t really know. I think templating has substantial problems.

    I feel like the fact that frameworks usually have default settings that “just work” for most cases, so it’s easy to get complacent in less-common situations. Or, people fail to gain an understanding how templating works, and the gotchas when they need to substitute different contexts, like JavaScript code.

    Even if you’re well-aware of the limitations and gotchas, it’s also easy to make a mistake and not notice until some unusual text shows up and breaks your output.


    By the way:

    <script>
        database.saveEmail('{{ email | javascript_string }}');
    </script>
    

    It’s inappropriate to replace < and > with HTML entities in JavaScript strings, so they’ll get substituted verbatim in the string. What happens if email contains the text </script>?


  • Using a C library from Java

    Recently I’ve been considering making Java bindings to an open-source C library.

    It’s such a pain though.

    Native binding to C library

    Traditionally you’d do this with JNI:

    1. Compile the C library,
    2. Write some JNI glue in C and Java,
    3. Package it all up into a JAR

    Writing JNI isn’t trivial. I experimented with it and there are a lot of gotchas around memory management and string handling. I’m confident I could manage it but it’s a lot of work even to just move a UTF-8 string from C into Java without leaking memory or mis-handling exceptions.

    Java 22 reached General Availability recently (March 2024) and it includes the first non-preview release of the Java Foreign Function and Memory (FFM) API, which is like a libffi or Python ctypes mechanism for Java - which Java Native Access (JNA) also already provided.

    With that approach, you don’t write any glue code in C: Instead, you describe the C library’s exports in Java and use FFM/JNA to access them.

    So then, your process looks like:

    1. Compile the C library,
    2. Write FFM/JNA glue in Java,
    3. Package it all up into a JAR

    It’s still not perfect, though. In C, you can have platform- and implementation-dependent definitions of primitive types, standard library types and typedefs, etc. In C, these are resolved at compile time, so JNI gives you the opportunity to adapt to the platform’s specifications in a general way.

    I wrote about this problem before: Using setjmp/longjmp from Java.

    setjmp is a pretty obscure example, though. Here’s an easier example: long int is 64 bits on Linux x86_64, but 32 bits on Windows x86_64, and also on both Linux & Windows x86_32. So if you want to call unsigned long strtoul(...), you need to know how big unsigned long is at runtime when you’re describing strtoul to FFM/JNA.

    In theory, types and sizes will vary depending on:

    • Operating system (Linux, Windows, macOS, …)
    • C library (glibc, musl, MSVC, …)
    • Compiler (gcc, clang, Visual C++, …)

    The above typically choose different behaviour depending on CPU architecture (x86_32, x86_64, 64-bit ARM, …)

    Compile the C library

    The C library you’re wrapping also needs to be compiled to match all of the above too.

    Most Linux distributions standardize on glibc, but musl is also common (like on Alpine Linux, which is extremely common in Docker images).

    Practically speaking, you’ll need to link dynamically against the same C library that’s being used on the system. If you bring another C library in (like through static linking, or including it as a pre-packaged dynamic library), you’re likely to encounter conflicts with the system-installed library.

    You can avoid that problem with statically-linked executables, but libraries are not executables, so they have less control over their immediate execution environment. That is, they need to stay compatible with other libraries that are also linked into the same project, which are certainly using the system C library.

    More precisely, you need to link against a version of the C library that is ABI-compatible with the one that’s present at runtime: If I compile my library against glibc 2.13 x86_64 Linux, I can be pretty confident it’ll run on glibc 2.15 x86_64 Linux, because glibc is backward-compatible. However, glibc is not forward-compatible, so it won’t run on glibc 2.10 x86_64 Linux. And of course, it won’t work on x86_32, musl, etc.

    This doesn’t apply to JUST the C library, but any dependency you need to link against. That could include a C++ standard library or other more exotic dependencies, depending on the library you’re trying to wrap.

    And since Java is used on so many different platforms…

    … you end up having to compile your library for every possible combination you’re willing to support.

    Look at the matrix of SQLite-JDBC supported operating systems:

    They compile the SQLite C library separately for each of those targets! You can peek at the platform targets in the Makefile for a hint on how they’re cross-compiling. I find their approach very impressive, but it sure seems like a lot of work to maintain!

    Translate to JVM

    Emscripten compiles C code and libraries to Javascript. At a very high level, it does this by providing C standard library functionality (primarily, the functionality provided by the OS kernel), and using JS/WASM as the compilation target.

    In JS, you basically have no choice: You can’t run native code in the Javascript sandbox, so you have to provide everything as JS code.

    You’d expect a performance hit for this, but that’s OK for a lot of libraries. This is true for me, too: The library I want to provide in Java provides unique functionality, and doesn’t necessarily need to run fast.

    So I’d like to use a similar approach in Java:

    1. Compile a C library to JVM bytecode (or even plain Java code),
    2. Write glue to provide a better Java-style interface,
    3. Package it all up as a JAR

    This is totally feasible. I’ve found two projects that use translation to achieve this:

    • LLJVM translates LLVM IR (bitcode) to Java bytecode and provides C library via newlib & custom Java. Inactive since ~2010.
    • NestedVM translates MIPS binaries to Java bytecode. GCC can create MIPS binaries. Inactive since ~2009 with some more recent updates available on a fork.

    A lot of the discourse I’ve read focuses on how people want to call native C libraries from Java because the native code is expected to perform better, so this kind of “translation” approach typically gets dismissed: Why write your high-performance code in C in the first place if you’re going to run it in the JVM?

    The library I want to wrap:

    1. Doesn’t require high performance
    2. Wasn’t written by me, so I didn’t have the choice to write it in Java vs. C
    3. Has unique and specialized functionality that’s hard to replicate
    4. Already works in Emscripten

    So I’d love to try a translation approach and see how it works. It has some pretty significant advantages over using a native library:

    1. No difficulty compiling for all platform, OS, CPU architecture, compiler, and standard library configurations
    2. No difficulty porting, running, and testing on obscure platforms/configurations
    3. Low maintenance: Java code, even compiled, typically ages well and works unmodified for years/decades

    It sounds absolutely delightful!


  • Uncommon zsh shell techniques, part 1

    Some of these work in other shells too, but I only use zsh these days.

    Anonymous functions

    () { cp "$1" /tmp/ } filename
    

    This works the same as cp filename /tmp/, but it’s more convenient in some cases:

    • When you’re running the same command on many filenames, and want to use command history (up + enter) to modify the filename. You don’t have to position the cursor onto the filename mid-command - it’s just at the end.
    • When you use the argument multiple times, or need to use variable modifiers on the input:
      # Print the directory containing the passed file
      () { echo "${1:A:h}" } file.csv
      
      # Transcode to mp3, unless the source is already mp3
      () { newfn="${1:r}.mp3"; if [[ "$1" != "$newfn" ]]; then echo "$1 -> $newfn"; ffmpeg -i "$1" "$newfn"; fi } foo.flac
      

    It’s also safe to use with spaces & other sensitive characters.

    -c with variables

    # List all files without extensions
    find . -type f -exec zsh -c 'printf "%s\n" "${1:r}"' . '{}' ';'
    

    It’s tempting and common to place {} directly into the argument passed to zsh -c, like this:

    # DON'T DO THIS!
    find . -type f -exec zsh -c 'printf "%s\n" "{}"' ';'
    

    This will cause problems for filenames that contain special characters like ", because find (and many other programs) won’t escape them. How could it? It doesn’t know what escaping strategy to use, because it depends on the command you’re invoking. For example, we’re using zsh here, but if you were writing inline Python code you’d need to escape the string following Python rules instead of zsh.

    By passing the argument to zsh -c instead, you can use $1 in zsh as a variable with all the safety that comes along with that. You also get to use variable modifiers like :r.

    Note also:

    • I passed . to act as the $0 argument to the command-line script. I’m not using the value of it in the script, but I need to pass it so that the filename is passed as $1.
    • I used printf instead of echo because echo will try to handle filenames like -n as an argument.

    Globbing flags and qualifiers

    I find the zsh documentation on filename generation pretty hard to read, but here are some examples I use that might help:

    Globbing flags

    Globbing flags appear right before the part of the glob you want to apply them on. I usually apply them to the whole pattern so I put them right at the start.

    These require extended_glob (see docs) to be set.

    # Match all .jpg files, matched case-insensitively (so it also includes
    # *.JPG, *.Jpg, etc.), like the option nocaseglob.
    setopt extended_glob
    echo (#i)*.jpg
    

    Glob qualifiers

    Glob qualifiers are suffixes that modify how the glob works.

    # List all jpg and gif files. No matches = no arguments.
    echo *.jpg(N) *.gif(N)
    

    Adding (N) to a glob string makes it expand to no arguments if there are no matches (same as the null_glob option). Without this, you’ll typically either pass the raw argument *.jpg(N) if there are no matches, or zsh won’t run the command and will raise an error instead.

    The exact default behaviour depends on the setting of options null_glob, nomatch, and null_glob.

    Together

    You can use them together:

    # Match GIF, JPG, JPEG, HEIF, and AVIF extensions
    # with case-insensitive matching,
    # and run with no arguments if no files match.
    setopt extended_glob
    echo "Image files:" (#i)*.(gif|jpe#g|heif|avif|png)(N)
    

  • Bluetooth codec scripts for pulseaudio

    I made some scripts to help me see what codecs are supported by my Bluetooth audio devices, and select the one I want.

    My devices were coming up with the sbc codec which is the most basic codec, but they support higher-bitrate codecs. Selection is a little clumsy on my headless, ssh-access-only Linux box that I’m playing audio from.

    My devices support for example:

    sbc: SBC
    sbc_xq_453: SBC XQ 453kbps
    sbc_xq_512: SBC XQ 512kbps
    sbc_xq_552: SBC XQ 552kbps
    

    I’m surprised my devices only support sbc codecs and not aac/mp3/whatever else. Actually, I don’t know what’s even typical! Do other operating systems use other codecs? I don’t know! Maybe I’ll try to find out what codecs these devices use on macOS or Windows someday.

    It’s also possible I’m not seeing other options here because pulseaudio only supports sbc for Bluetooth. I have read that pipewire has better Bluetooth codec support, but I’m not currently willing to swap a working audio setup (pulseaudio) for one that might need tweaking (pipewire).

    The scripts are available on GitHub.

    Current mood: 😀 accomplished
    Current music: Big Wreck - Hey Mama


  • My first 4K monitor, on Windows

    I just got a pair of 4K monitors - one for a Mac Mini, and one for Windows.

    The Mac is hooked up over HDMI and I use it purely for desktop applications. It works fine.

    But on Windows, I’ve encountered a surprising number of issues.

    Problem 1: No display during boot

    I connected the monitor using DisplayPort because it seemed most appropriate to my video card, a GeForce GTX 960. It has 3 DisplayPort ports, and only one HDMI port; and I didn’t know if the HDMI port supported 4K at 60 Hz (it does), but I knew the DisplayPort ports would do it.

    I swapped the monitor while the computer was on, and everything was fine… but when I rebooted, I had no display.

    FIX: It wasn’t super easy to find information about this but eventually I found a post that pointed me towards an NVIDIA firmware update tool for DisplayPort 1.3 and 1.4 displays that fixes the issue:

    Without the update, systems that are connected to a DisplayPort 1.3 / 1.4 monitor could experience blank screens on boot until the OS loads, or could experience a hang on boot.

    Problem 2: Euro Truck Simulator 2 stuck minimized

    UPDATE: Fixed: My Epson scanner software includes a tray icon. If I kill the process, this problem goes away. I guess it’s stealing focus when the resolution & scale change? Even though it isn’t actually showing a window? 🙄

    My video card can’t handle 4K resolutions at a reasonable framerate, so I’m running games at 1080p. Also, I often stream games to my living room TV using Steam, and it’s a 1080p TV so it fits better.

    When I launch Euro Truck Simulator 2, it immediately minimizes into the background, and any attempt to restore it brings it up for a brief moment but then it goes minimized again.

    It doesn’t happen if one of the following is true:

    • ETS2 is run at the desktop resolution—but at 4K it takes a severe framerate hit… or
    • Windows display scale is set to 100%—but at 27” 4K, 150% is far more usable. This is the workaround I’m using but I wish I didn’t have to!

    I don’t know if this is an ETS2 problem specifically, or a Windows problem. I assume other games will be affected too, but I’ve only tried Cities: Skylines and it has no such issue. ETS2 actually changes the desktop resolution for fullscreen, while Cities: Skylines uses a borderless mode that leaves the desktop resolution unchanged; this might explain the difference.

    Problem 3: 1080p not pixel perfect

    A 4K monitor can theoretically upscale 1080p using pixel doubling, where each 1080p pixel is displayed with four 4k pixels (doubled in both X and Y axes). I want this because it looks clear and perfect, as though I’m using a 1080p monitor…

    … but my particular monitor (LG 27UL550-W) doesn’t do this - it performs smoothing/interpolation of some sort on the upscale, and as a result it looks blurry.

    I feel that my GPU drivers should be able to render at 1080p but output at 4K, but if it can I haven’t found out how.

    UPDATE: Integer scaling is available in the NVIDIA control panel for Turing-architecture GPUs (GeForce 16xx, GeForce 20xx and up). I have a 960 so outta luck!!

    Problem 4: DisplayPort disconnects when monitor off

    When I turn off the monitor, the computer sees that as a disconnected display. This is a well-known hotplug detection feature.

    This isn’t really a problem when I’m sitting in front of it, but I like to stream games from the computer to my living room TV.

    When I do that, I want to turn off the locally-attached display. If I do, then games basically don’t work - they see no display connected and aren’t able to select a display resolution because there’s no display to switch on. So streaming just doesn’t work at all.

    Even if I’m not streaming, I prefer to have direct control over the power of my display, instead of having to use the display sleep timer to shut it off.

    WORKAROUND: Use HDMI, but long-term I’m probably just gonna have to live with this problem because I understand some features require DisplayPort, like FreeSync. Some monitors have an option to turn off while appearing connected to the computer, but mine doesn’t!

    Other thoughts

    These are all small-ish problems. Some of them have workarounds or whatever, but like, they’re all surprising issues that I feel shouldn’t happen at all. And I’ve only had the monitor for one day!

    Hardware

    • MSI NVIDIA GeForce GTX 960
    • MSI Z270-A Pro motherboard
    • Windows 10 up-to-date
    • NVIDIA drivers up-to-date

    My old monitor is a 2560x1440 panel connected over dual-link DVI. It exhibited none of the above problems, but I used it at 100% scale, at native resolution, and without DisplayPort.


subscribe via RSS