Un-mangling some mangled unicode

Recently I got some data from an external source that I’m to review and correct prior to use. One of the things I’ve been addressing is weird Unicode encoding stuff.

For example:

b'O\xc3\x82\xc2\x80\xc2\x99SAMPLA'

Clearly this is supposed to have an apostrophe ’, but how on earth did it get turned into \xc3\x82\xc2\x80\xc2\x99?

After poking at it with different coding systems for a while, I finally figured it out:

# Mangle input
('’'
    .encode('utf-8')    # b'\xe2\x80\x99'
    .decode('latin-1')  # 'â\x80\x99'
    .upper()            # 'Â\x80\x99'
    .encode('utf-8')    # b'\xc3\x82\xc2\x80\xc2\x99'
)

I’ve never seen mangled Unicode get passed through .upper() before. I wasn’t around to see this data get created in the first place, but my guess is something like this happened:

Software A accepted the input O’SAMPLA
Software A exported the data using UTF-8 encoding
Software B imported the data but incorrectly interpreted it using Latin-1 encoding
Software B uppercased the data (typical for this software)
Software B exported the data using UTF-8 encoding

Here’s the reverse, to restore the original data:

# Fix mangled input
(b'O\xc3\x82\xc2\x80\xc2\x99SAMPLA'
    .decode('utf-8')    # 'OÂ\x80\x99SAMPLA'
    .lower()            # 'oâ\x80\x99sampla'
    .encode('latin-1')  # b'o\xe2\x80\x99sampla'
    .decode('utf-8')    # 'o’sampla'
    .upper()            # 'O’SAMPLA'
)

This works for this particular input because Â needs to become â before the latin-1/utf-8 interpretation steps, but I don’t consider it appropriate to assume this will work for all inputs. Some inputs may not have been affected at all by upper(), and it would be incorrect to apply lower() to them.

Unfortunately I can’t predict with total confidence whether applying lower() is appropriate for each input, so this data is gonna require manual review.

Mike's dev journal

About me

Un-mangling some mangled unicode