Un-mangling some mangled unicode
Recently I got some data from an external source that I’m to review and correct prior to use. One of the things I’ve been addressing is weird Unicode encoding stuff.
For example:
b'O\xc3\x82\xc2\x80\xc2\x99SAMPLA'
Clearly this is supposed to have an apostrophe ’
, but how on earth did it get turned into \xc3\x82\xc2\x80\xc2\x99
?
After poking at it with different coding systems for a while, I finally figured it out:
# Mangle input
('’'
.encode('utf-8') # b'\xe2\x80\x99'
.decode('latin-1') # 'â\x80\x99'
.upper() # 'Â\x80\x99'
.encode('utf-8') # b'\xc3\x82\xc2\x80\xc2\x99'
)
I’ve never seen mangled Unicode get passed through .upper()
before. I wasn’t around to see this data get created in the first place, but my guess is something like this happened:
- Software A accepted the input
O’SAMPLA
- Software A exported the data using UTF-8 encoding
- Software B imported the data but incorrectly interpreted it using Latin-1 encoding
- Software B uppercased the data (typical for this software)
- Software B exported the data using UTF-8 encoding
Here’s the reverse, to restore the original data:
# Fix mangled input
(b'O\xc3\x82\xc2\x80\xc2\x99SAMPLA'
.decode('utf-8') # 'OÂ\x80\x99SAMPLA'
.lower() # 'oâ\x80\x99sampla'
.encode('latin-1') # b'o\xe2\x80\x99sampla'
.decode('utf-8') # 'o’sampla'
.upper() # 'O’SAMPLA'
)
This works for this particular input because Â
needs to become â
before the latin-1/utf-8 interpretation steps, but I don’t consider it appropriate to assume this will work for all inputs. Some inputs may not have been affected at all by upper()
, and it would be incorrect to apply lower()
to them.
Unfortunately I can’t predict with total confidence whether applying lower()
is appropriate for each input, so this data is gonna require manual review.