zsh word splitting behavior with variable and subshell expansion

December 23, 2024

TL;DR: With zsh’s default settings:

Variable expansion ${name}
- NEVER undergoes word splitting, and even without quoting they’ll be passed with all spaces, newlines etc. intact.
Subshell output $(cat something)
- DOES undergo word splitting, unless wrapped in quotes like "$(cat something)" (except when assigning the result to a variable).

#!/usr/bin/env zsh
# -*- coding: utf-8 -*-

# The only differences I observed in the behavior of variable
# expansion were around how newlines and spaces are treated ("word
# splitting").  When word splitting is performed, newlines are turned
# into spaces, and consecutive spaces are collapsed into a single one.
#
# I never observed any issue with unicode characters, or special
# characters that are significant to the shell in regular syntax -
# these always passed through unmodified and uninterpreted.

# For debugging: Shows commands being invoked, with actual quoted
# arguments.
set -x

# Output some test data.
cmd() {
    cat <<"EOF"
OneWord
Two Words
Many     Spaces
  Leading spaces!
An 'apostrophe', "quotes", $ sign, (parens), ? mark, \ backslash, \
'An '\''apostrophe'\'', "quotes", $ sign, (parens), ? mark, \ backslash, now escaped for a shell \'
c'est français
🤣
EOF
}


# NO SHELL BEHAVIOR: Store the output as a single scalar string.
# Spaces, newlines, special characters are intact.  Either works.
content="$(cmd)"
content=$(cmd)

# NO SHELL BEHAVIOR: Echo the content exactly as it was stored, with
# spaces, newlines, special characters intact. Unlike bash, zsh
# doesn't perform word splitting on expanded variables by default, but
# you can change that with setopt SH_WORD_SPLIT.  Either works.
echo "$content"
echo $content

# NO SHELL BEHAVIOR: Echo the content exactly, without storing it in
# a variable.
echo "$(cmd)"

# CAUSES SHELL BEHAVIOR: Perform word splitting: Newlines are treated
# as spaces, consecutive spaces are dropped,
echo $(cmd)

# CAUSES SHELL BEHAVIOR: ${=name} performs word splitting on $name
# even if it's quoted.
echo "${=content}"

# Any text surrounding ${=name} will get "glued" to the first & last
# words in $name after word splitting is performed, even if there are
# spaces in the surrounding text, but otherwise each word in $name
# gets passed as a separate argument.
#
# This split happens even if it looks like we're passing a single
# argument:

food='barbecue basil spam eggs'
print -rl -- "We have ${=food} and nothing else"

# Output:
#     We have barbecue
#     basil
#     spam
#     eggs and nothing else

Also available as a gist.

More thoughts on foreign function interfaces

August 23, 2024
I have been thinking some more about Using a C library from Java, especially the Translate to JVM approach.

When I tried this, I mostly got caught up in compiling the target library to MIPS in the first place. I tried building a GCC cross compiler for MIPS, but ran into some build errors and kinda gave up.

Since then, I’ve looked at using clang to target a MIPS binary, and this looks feasible, but I haven’t tried it. There’s also an option to build to LLVM IR and interpret that. This seems like a good idea, but I’ve heard, via other projects that work off LLVM IR, that it’s something of a moving target; also, it seems like a more complex process than it would be to integrate a MIPS core.

I’ll need to compile a C library with enough functionality to support the target library, and the library has C++ components so I’ll need to deal with that too. At least the library exposes a C interface so I don’t need to build C++ FFI glue.

I identified all of these problems earlier on, but despite all this I wasn’t quite sure how I would actually call the target library’s functions from outside the MIPS core. I knew I could use interrupts to switch out of the MIPS context, like syscalls do to switch from user to kernel context, but I didn’t quite know where to fit that machinery in.

So I came up with some options:
1. I could write some glue, in C, that I package with the library, to provide a syscall-like interface between the MIPS guest and the host (Java) code. The Java code acts as a supervisor and has access to the MIPS core’s registers, memory (& stack), etc., so would be able to use that for passing arguments & context around. But if possible, I’d like to do it without writing C glue code.
2. I can have some host (Java) code that set up the stack & environment etc. required to call a function, and then start executing at the right location. It’ll have to capture the return somehow, which I can either intercept with something akin to a debugger breakpoint, or place some code in memory to invoke an interrupt (as in option 1).
  
  It occurred to me while I was considering this option, that even though I’m running the code in a virtual host environment that’s local to my own program, it’s still C code and I still have to think about all the usual FFI stuff: Type signatures, struct packing, memory management etc. That means that even if I run the library in an emulated core, I’ll still end up building an interface similar to Java Native Access or the newer Java Foreign Function and Memory (FFM) API. I guess I kind of hoped that part of the process would be simpler.
I still think it’s a viable solution with real benefits, especially if I need to target multiple platforms. I also wonder if it might be feasible to compile or transpile the C library directly to Java, whether to source code or bytecode. That sounds far more difficult to me because it depends on compiler internals, but maybe building a new backend onto an existing compiler could be practical?

On generating structured data from templates

June 17, 2024
Recently I chatted with a friend about generating structured data from templates. Specifically, he observed that Jekyll’s atom feed is generated from an XML template. He posted about his experience.

I’ve long felt that it’s extremely challenging to correctly use templates to generate structured data, meaning files like source code - HTML/XML/C/etc. I don’t particularly like templating, but I acknowledge its value, especially in HTML templating where you’re mostly writing static markup to represent the page layout. In fact, template-engine substitutions in HTML templates are only a small part of most HTML template files.

The problem I have is that the template engine doesn’t actually know the context for how it’s making substitutions. No HTML template engine parses the HTML and decides what escaping rules are appropriate for each piece of data being placed into the generated output.

This post is kinda rambly, but I have some lightly-organized thoughts about templating that I wanted to put into words.

A typical problem

Let me give an example:
<div id="name">Hello, {{ email }}!</div> <script> database.saveEmail('{{ email }}'); </script>
If email is 'Baz' <foo.bar@example.com>, what happens?

I see two typical options, both of which result in the wrong result:
- The value of email is substituted directly into the template with no modification. The resulting output is incorrect:
   <div id="name">Hello, 'Baz' <foo.bar@example.com>!</div> <script> // WRONG! Incorrectly quoted; leads to syntax error! database.saveEmail(''Baz' <foo.bar@example.com>'); </script>
- The value of email is escaped for HTML. Django and many other template engines will do this for regular strings.
   <div id="name">Hello, 'Baz' <foo.bar@example.com>!</div> <script> // WRONG! Still incorrectly quoted, and now we're incorrectly // storing HTML entities in the database instead of the // original string! database.saveEmail(''Baz' <foo.bar@example.com>'); </script>
Comparison to SQL

The problem is similar to SQL injection, although the stakes are usually a lot lower. The typical solution for avoiding SQL injection bugs when writing SQL is by using SQL-aware interpolation functions provided by our SQL libraries:
db.execute('SELECT * FROM foo WHERE email = ?', email)
In SQL, this works because SQL allows every value in a query to be represented as a string literal with consistent quoting & escaping rules, regardless of the data type of the value being represented:
select '3'::integer + '4'::integer; -- Result: 7
But SQL-aware interpolation only works for data. Other SQL syntax and identifiers cannot be represented as string data, so this code:
# This won't work! table_name = "user_table" db.execute("SELECT COUNT(*) FROM ?", user_table)
results in an invalid query:
SELECT COUNT(*) FROM 'user_table'; -- ERROR: syntax error at or near "'user_table'"
In SQL, you often avoid this by marking the interpolated string as “safe”, which indicates that you’ve already verified that it won’t lead to problems if it’s substituted raw:
table_name = "user_table" db.execute("SELECT COUNT(*) FROM ?", AsIs(user_table))
You do have to be really careful that table_name doesn’t include anything malicious, since its contents will be interpreted as valid SQL syntax. I might even suggest that we should be able to tag it as a different kind of identifier, like TableName(table_name), so the interpolating code can validate/quote/escape it for use ONLY as a table name.

Usage contexts

The main problem I see with HTML and other languages is that there are way more different kinds of contexts that variables get substituted into.

Above, I showed that one escaping rule isn’t sufficient when a variable gets substituted into both HTML and JavaScript.

A common solution here is to indicate the context to your template engine - perhaps using a filter:
<div id="name">Hello, {{ email | html }}!</div> <script> database.saveEmail('{{ email | javascript_string }}'); </script>
This works OK when the number of different usage contexts is small, like in this example, but I don’t like that you have to remember to use the right filter every time you code in a substitution. If the default behavior is to escape for HTML, you’ll start omitting the | html part, and then it’s easy to accidentally miss the | javascript_string filter because it’s used so much less-frequently.

And if you do miss it, will you even notice? You’ll only see problems with strings that contain syntax that that’s meaningful to JavaScript. So it becomes a bug that happens infrequently, which makes it harder to find later on. This is actually also a problem that SQL interpolation would suffer from:
# Don't do this! It's SO UNSAFE! # But it results in working code, and doesn't even break for most typical # inputs, and that's almost worse! email = request.GET['email'] # eg. foo@example.com db.execute(f"SELECT * FROM foo WHERE email = '{email}'")
Too many contexts

If you have a lot of different contexts that substitutions need to be placed into, it can be arduous to make sure they’re all correct:
# NOTE: Function {{ fn_name | for_comment }} is generated from a template. def {{ fn_name | for_identifier }}(): num1 = {{ num1 | for_number }} num_squared = num1 * num1 logging.debug(r'{{ name | for_raw_string }}') print(f'Hey {{ name | for_string }}, your squared number {{ num1 | for_string }} is {num1}!')
The idea here is that you need to escape/quote/etc. values depending on how they’re being used. Like, in a string, \ should be escaped to \\, but that would be inappropriate in a raw string. Numbers shouldn’t have spaces or anything in them.

Not using templates

There are libraries that allow you to generate HTML by writing code in your host language. This is comparable to generating JSON using a JSON library, or XML using an XML library, and it can also be a pain:
doc = htmltag('html') body = htmltag('body').add_to(doc) # Static elements are too much work to create. div = htmltag('div', { 'id': 'outermost' }).add_to(body) # Attribute values are too much work, but they'll be properly escaped on output. textinput = htmltag('input', { 'type': 'text', 'name': 'user_email', 'value': email_address }.add_to(div) # When htmltext is rendered to HTML, its contents are escaped. htmltext(f'Hello, {email}!').add_to(div) response.send(doc.render_to_html())
In my experience, nobody wants to write HTML in anything other than an HTML file. Totally understandable - editors have good syntax highlighting, feedback, etc. for HTML, and you’re mostly writing static HTML anyway with only a few substitutions here and there.

What do I want anyway

I don’t really know. I think templating has substantial problems.

I feel like the fact that frameworks usually have default settings that “just work” for most cases, so it’s easy to get complacent in less-common situations. Or, people fail to gain an understanding how templating works, and the gotchas when they need to substitute different contexts, like JavaScript code.

Even if you’re well-aware of the limitations and gotchas, it’s also easy to make a mistake and not notice until some unusual text shows up and breaks your output.

By the way:
<script> database.saveEmail('{{ email | javascript_string }}'); </script>
It’s inappropriate to replace < and > with HTML entities in JavaScript strings, so they’ll get substituted verbatim in the string. What happens if email contains the text </script>?

Using a C library from Java

May 28, 2024
Recently I’ve been considering making Java bindings to an open-source C library.

It’s such a pain though.

Update: On 2024-08-23, I wrote another post about this topic.

Native binding to C library

Traditionally you’d do this with JNI:
1. Compile the C library,
2. Write some JNI glue in C and Java,
3. Package it all up into a JAR
Writing JNI isn’t trivial. I experimented with it and there are a lot of gotchas around memory management and string handling. I’m confident I could manage it but it’s a lot of work even to just move a UTF-8 string from C into Java without leaking memory or mis-handling exceptions.

Java 22 reached General Availability recently (March 2024) and it includes the first non-preview release of the Java Foreign Function and Memory (FFM) API, which is like a libffi or Python ctypes mechanism for Java - which Java Native Access (JNA) also already provided.

With that approach, you don’t write any glue code in C: Instead, you describe the C library’s exports in Java and use FFM/JNA to access them.

So then, your process looks like:
1. Compile the C library,
2. Write FFM/JNA glue in Java,
3. Package it all up into a JAR
It’s still not perfect, though. In C, you can have platform- and implementation-dependent definitions of primitive types, standard library types and typedefs, etc. In C, these are resolved at compile time, so JNI gives you the opportunity to adapt to the platform’s specifications in a general way.

I wrote about this problem before: Using setjmp/longjmp from Java.

setjmp is a pretty obscure example, though. Here’s an easier example: long int is 64 bits on Linux x86_64, but 32 bits on Windows x86_64, and also on both Linux & Windows x86_32. So if you want to call unsigned long strtoul(...), you need to know how big unsigned long is at runtime when you’re describing strtoul to FFM/JNA.

In theory, types and sizes will vary depending on:
- Operating system (Linux, Windows, macOS, …)
- C library (glibc, musl, MSVC, …)
- Compiler (gcc, clang, Visual C++, …)
The above typically choose different behaviour depending on CPU architecture (x86_32, x86_64, 64-bit ARM, …)

Compile the C library

The C library you’re wrapping also needs to be compiled to match all of the above too.

Most Linux distributions standardize on glibc, but musl is also common (like on Alpine Linux, which is extremely common in Docker images).

Practically speaking, you’ll need to link dynamically against the same C library that’s being used on the system. If you bring another C library in (like through static linking, or including it as a pre-packaged dynamic library), you’re likely to encounter conflicts with the system-installed library.

You can avoid that problem with statically-linked executables, but libraries are not executables, so they have less control over their immediate execution environment. That is, they need to stay compatible with other libraries that are also linked into the same project, which are certainly using the system C library.

More precisely, you need to link against a version of the C library that is ABI-compatible with the one that’s present at runtime: If I compile my library against glibc 2.13 x86_64 Linux, I can be pretty confident it’ll run on glibc 2.15 x86_64 Linux, because glibc is backward-compatible. However, glibc is not forward-compatible, so it won’t run on glibc 2.10 x86_64 Linux. And of course, it won’t work on x86_32, musl, etc.

This doesn’t apply to JUST the C library, but any dependency you need to link against. That could include a C++ standard library or other more exotic dependencies, depending on the library you’re trying to wrap.

And since Java is used on so many different platforms…

… you end up having to compile your library for every possible combination you’re willing to support.

Look at the matrix of SQLite-JDBC supported operating systems:

They compile the SQLite C library separately for each of those targets! You can peek at the platform targets in the Makefile for a hint on how they’re cross-compiling. I find their approach very impressive, but it sure seems like a lot of work to maintain!

Translate to JVM

Emscripten compiles C code and libraries to Javascript. At a very high level, it does this by providing C standard library functionality (primarily, the functionality provided by the OS kernel), and using JS/WASM as the compilation target.

In JS, you basically have no choice: You can’t run native code in the Javascript sandbox, so you have to provide everything as JS code.

You’d expect a performance hit for this, but that’s OK for a lot of libraries. This is true for me, too: The library I want to provide in Java provides unique functionality, and doesn’t necessarily need to run fast.

So I’d like to use a similar approach in Java:
1. Compile a C library to JVM bytecode (or even plain Java code),
2. Write glue to provide a better Java-style interface,
3. Package it all up as a JAR
This is totally feasible. I’ve found two projects that use translation to achieve this:
- LLJVM translates LLVM IR (bitcode) to Java bytecode and provides C library via newlib & custom Java. Inactive since ~2010.
- NestedVM translates MIPS binaries to Java bytecode. GCC can create MIPS binaries. Inactive since ~2009 with some more recent updates available on a fork.
A lot of the discourse I’ve read focuses on how people want to call native C libraries from Java because the native code is expected to perform better, so this kind of “translation” approach typically gets dismissed: Why write your high-performance code in C in the first place if you’re going to run it in the JVM?

The library I want to wrap:
1. Doesn’t require high performance
2. Wasn’t written by me, so I didn’t have the choice to write it in Java vs. C
3. Has unique and specialized functionality that’s hard to replicate
4. Already works in Emscripten
So I’d love to try a translation approach and see how it works. It has some pretty significant advantages over using a native library:
1. No difficulty compiling for all platform, OS, CPU architecture, compiler, and standard library configurations
2. No difficulty porting, running, and testing on obscure platforms/configurations
3. Low maintenance: Java code, even compiled, typically ages well and works unmodified for years/decades
It sounds absolutely delightful!

Uncommon zsh shell techniques, part 1

April 8, 2024
Some of these work in other shells too, but I only use zsh these days.

Anonymous functions
() { cp "$1" /tmp/ } filename
This works the same as cp filename /tmp/, but it’s more convenient in some cases:
- When you’re running the same command on many filenames, and want to use command history (up + enter) to modify the filename. You don’t have to position the cursor onto the filename mid-command - it’s just at the end.
- When you use the argument multiple times, or need to use variable modifiers on the input:
  # Print the directory containing the passed file () { echo "${1:A:h}" } file.csv # Transcode to mp3, unless the source is already mp3 () { newfn="${1:r}.mp3"; if [[ "$1" != "$newfn" ]]; then echo "$1 -> $newfn"; ffmpeg -i "$1" "$newfn"; fi } foo.flac
It’s also safe to use with spaces & other sensitive characters.

-c with variables
# List all files without extensions find . -type f -exec zsh -c 'printf "%s\n" "${1:r}"' . '{}' ';'
It’s tempting and common to place {} directly into the argument passed to zsh -c, like this:
# DON'T DO THIS! find . -type f -exec zsh -c 'printf "%s\n" "{}"' ';'
This will cause problems for filenames that contain special characters like ", because find (and many other programs) won’t escape them. How could it? It doesn’t know what escaping strategy to use, because it depends on the command you’re invoking. For example, we’re using zsh here, but if you were writing inline Python code you’d need to escape the string following Python rules instead of zsh.

By passing the argument to zsh -c instead, you can use $1 in zsh as a variable with all the safety that comes along with that. You also get to use variable modifiers like :r.

Note also:
- I passed . to act as the $0 argument to the command-line script. I’m not using the value of it in the script, but I need to pass it so that the filename is passed as $1.
- I used printf instead of echo because echo will try to handle filenames like -n as an argument.
Globbing flags and qualifiers

I find the zsh documentation on filename generation pretty hard to read, but here are some examples I use that might help:

Globbing flags

Globbing flags appear right before the part of the glob you want to apply them on. I usually apply them to the whole pattern so I put them right at the start.

These require extended_glob (see docs) to be set.
# Match all .jpg files, matched case-insensitively (so it also includes # *.JPG, *.Jpg, etc.), like the option nocaseglob. setopt extended_glob echo (#i)*.jpg
Glob qualifiers

Glob qualifiers are suffixes that modify how the glob works.
# List all jpg and gif files. No matches = no arguments. echo *.jpg(N) *.gif(N)
Adding (N) to a glob string makes it expand to no arguments if there are no matches (same as the null_glob option). Without this, you’ll typically either pass the raw argument *.jpg(N) if there are no matches, or zsh won’t run the command and will raise an error instead.

The exact default behaviour depends on the setting of options null_glob, nomatch, and null_glob.

Together

You can use them together:
# Match GIF, JPG, JPEG, HEIF, and AVIF extensions # with case-insensitive matching, # and run with no arguments if no files match. setopt extended_glob echo "Image files:" (#i)*.(gif|jpe#g|heif|avif|png)(N)

« Newer 1 2 3 Older »

Mike's dev journal

About me

zsh word splitting behavior with variable and subshell expansion

More thoughts on foreign function interfaces

On generating structured data from templates

A typical problem

Comparison to SQL

Usage contexts

Too many contexts

Not using templates

What do I want anyway

Using a C library from Java

Native binding to C library

Compile the C library

Translate to JVM

Uncommon zsh shell techniques, part 1

Anonymous functions

-c with variables

Globbing flags and qualifiers

Globbing flags

Glob qualifiers

Together