On generating structured data from templates

Recently I chatted with a friend about generating structured data from templates. Specifically, he observed that Jekyll’s atom feed is generated from an XML template. He posted about his experience.

I’ve long felt that it’s extremely challenging to correctly use templates to generate structured data, meaning files like source code - HTML/XML/C/etc. I don’t particularly like templating, but I acknowledge its value, especially in HTML templating where you’re mostly writing static markup to represent the page layout. In fact, template-engine substitutions in HTML templates are only a small part of most HTML template files.

The problem I have is that the template engine doesn’t actually know the context for how it’s making substitutions. No HTML template engine parses the HTML and decides what escaping rules are appropriate for each piece of data being placed into the generated output.

This post is kinda rambly, but I have some lightly-organized thoughts about templating that I wanted to put into words.

A typical problem

Let me give an example:

<div id="name">Hello, {{ email }}!</div>
<script>
    database.saveEmail('{{ email }}');
</script>

If email is 'Baz' <foo.bar@example.com>, what happens?

I see two typical options, both of which result in the wrong result:

The value of email is substituted directly into the template with no modification. The resulting output is incorrect:

  <!-- WRONG! <foo...> is not a valid HTML tag! -->
  <div id="name">Hello, 'Baz' <foo.bar@example.com>!</div>
  <script>
      // WRONG! Incorrectly quoted; leads to syntax error!
      database.saveEmail(''Baz' <foo.bar@example.com>');
  </script>

The value of email is escaped for HTML. Django and many other template engines will do this for regular strings.

  <!-- Correct: The string is escaped for inclusion as plain text in HTML. -->
  <div id="name">Hello, 'Baz' &lt;foo.bar@example.com&gt;!</div>
  <script>
      // WRONG! Still incorrectly quoted, and now we're incorrectly
      // storing HTML entities in the database instead of the
      // original string!
      database.saveEmail(''Baz' &lt;foo.bar@example.com&gt;');
  </script>

Comparison to SQL

The problem is similar to SQL injection, although the stakes are usually a lot lower. The typical solution for avoiding SQL injection bugs when writing SQL is by using SQL-aware interpolation functions provided by our SQL libraries:

db.execute('SELECT * FROM foo WHERE email = ?', email)

In SQL, this works because SQL allows every value in a query to be represented as a string literal with consistent quoting & escaping rules, regardless of the data type of the value being represented:

select '3'::integer + '4'::integer;
-- Result: 7

But SQL-aware interpolation only works for data. Other SQL syntax and identifiers cannot be represented as string data, so this code:

# This won't work!
table_name = "user_table"
db.execute("SELECT COUNT(*) FROM ?", user_table)

results in an invalid query:

SELECT COUNT(*) FROM 'user_table';
-- ERROR:  syntax error at or near "'user_table'"

In SQL, you often avoid this by marking the interpolated string as “safe”, which indicates that you’ve already verified that it won’t lead to problems if it’s substituted raw:

table_name = "user_table"
db.execute("SELECT COUNT(*) FROM ?", AsIs(user_table))

You do have to be really careful that table_name doesn’t include anything malicious, since its contents will be interpreted as valid SQL syntax. I might even suggest that we should be able to tag it as a different kind of identifier, like TableName(table_name), so the interpolating code can validate/quote/escape it for use ONLY as a table name.

Usage contexts

The main problem I see with HTML and other languages is that there are way more different kinds of contexts that variables get substituted into.

Above, I showed that one escaping rule isn’t sufficient when a variable gets substituted into both HTML and JavaScript.

A common solution here is to indicate the context to your template engine - perhaps using a filter:

<div id="name">Hello, {{ email | html }}!</div>
<script>
    database.saveEmail('{{ email | javascript_string }}');
</script>

This works OK when the number of different usage contexts is small, like in this example, but I don’t like that you have to remember to use the right filter every time you code in a substitution. If the default behavior is to escape for HTML, you’ll start omitting the | html part, and then it’s easy to accidentally miss the | javascript_string filter because it’s used so much less-frequently.

And if you do miss it, will you even notice? You’ll only see problems with strings that contain syntax that that’s meaningful to JavaScript. So it becomes a bug that happens infrequently, which makes it harder to find later on. This is actually also a problem that SQL interpolation would suffer from:

# Don't do this! It's SO UNSAFE!
# But it results in working code, and doesn't even break for most typical
# inputs, and that's almost worse!

email = request.GET['email']     # eg. foo@example.com
db.execute(f"SELECT * FROM foo WHERE email = '{email}'")

Too many contexts

If you have a lot of different contexts that substitutions need to be placed into, it can be arduous to make sure they’re all correct:

# NOTE: Function {{ fn_name | for_comment }} is generated from a template.
def {{ fn_name | for_identifier }}():
    num1 = {{ num1 | for_number }}
    num_squared = num1 * num1

    logging.debug(r'{{ name | for_raw_string }}')

    print(f'Hey {{ name | for_string }}, your squared number {{ num1 | for_string }} is {num1}!')

The idea here is that you need to escape/quote/etc. values depending on how they’re being used. Like, in a string, \ should be escaped to \\, but that would be inappropriate in a raw string. Numbers shouldn’t have spaces or anything in them.

Not using templates

There are libraries that allow you to generate HTML by writing code in your host language. This is comparable to generating JSON using a JSON library, or XML using an XML library, and it can also be a pain:

doc = htmltag('html')
body = htmltag('body').add_to(doc)

# Static elements are too much work to create.
div = htmltag('div', {
    'id': 'outermost'
}).add_to(body)

# Attribute values are too much work, but they'll be properly escaped on output.
textinput = htmltag('input', {
    'type': 'text',
    'name': 'user_email',
    'value': email_address
}.add_to(div)

# When htmltext is rendered to HTML, its contents are escaped.
htmltext(f'Hello, {email}!').add_to(div)

response.send(doc.render_to_html())

In my experience, nobody wants to write HTML in anything other than an HTML file. Totally understandable - editors have good syntax highlighting, feedback, etc. for HTML, and you’re mostly writing static HTML anyway with only a few substitutions here and there.

What do I want anyway

I don’t really know. I think templating has substantial problems.

I feel like the fact that frameworks usually have default settings that “just work” for most cases, so it’s easy to get complacent in less-common situations. Or, people fail to gain an understanding how templating works, and the gotchas when they need to substitute different contexts, like JavaScript code.

Even if you’re well-aware of the limitations and gotchas, it’s also easy to make a mistake and not notice until some unusual text shows up and breaks your output.

By the way:

<script>
    database.saveEmail('{{ email | javascript_string }}');
</script>

It’s inappropriate to replace < and > with HTML entities in JavaScript strings, so they’ll get substituted verbatim in the string. What happens if email contains the text </script>?

Mike's dev journal

About me