• Un-mangling some mangled unicode

    Recently I got some data from an external source that I’m to review and correct prior to use. One of the things I’ve been addressing is weird Unicode encoding stuff.

    For example:

    b'O\xc3\x82\xc2\x80\xc2\x99SAMPLA'
    

    Clearly this is supposed to have an apostrophe , but how on earth did it get turned into \xc3\x82\xc2\x80\xc2\x99?

    After poking at it with different coding systems for a while, I finally figured it out:

    # Mangle input
    ('’'
        .encode('utf-8')    # b'\xe2\x80\x99'
        .decode('latin-1')  # 'â\x80\x99'
        .upper()            # 'Â\x80\x99'
        .encode('utf-8')    # b'\xc3\x82\xc2\x80\xc2\x99'
    )
    

    I’ve never seen mangled Unicode get passed through .upper() before. I wasn’t around to see this data get created in the first place, but my guess is something like this happened:

    1. Software A accepted the input O’SAMPLA
    2. Software A exported the data using UTF-8 encoding
    3. Software B imported the data but incorrectly interpreted it using Latin-1 encoding
    4. Software B uppercased the data (typical for this software)
    5. Software B exported the data using UTF-8 encoding

    Here’s the reverse, to restore the original data:

    # Fix mangled input
    (b'O\xc3\x82\xc2\x80\xc2\x99SAMPLA'
        .decode('utf-8')    # 'OÂ\x80\x99SAMPLA'
        .lower()            # 'oâ\x80\x99sampla'
        .encode('latin-1')  # b'o\xe2\x80\x99sampla'
        .decode('utf-8')    # 'o’sampla'
        .upper()            # 'O’SAMPLA'
    )
    

    This works for this particular input because  needs to become â before the latin-1/utf-8 interpretation steps, but I don’t consider it appropriate to assume this will work for all inputs. Some inputs may not have been affected at all by upper(), and it would be incorrect to apply lower() to them.

    Unfortunately I can’t predict with total confidence whether applying lower() is appropriate for each input, so this data is gonna require manual review.


  • Hurdles to making a multitasking environment on the NES

    I’ve been thinking about what would be required to make a multitasking environment/platform on the NES.

    Requirements:

    • Can load applications on-demand as independent processes
    • Can launch multiple instances of each application
    • Uses cooperative multitasking

    Realistically you’ll want the cartridge to have some RAM and allow bank switching for both RAM and ROM in order to increase the memory & storage available to programs.

    • The 6502 has a single stack that’s fixed to live from $100 to $1ff. This is mapped to console RAM and can’t be bank-switched. Each process wants its own stack, so they’ll either have to share this very limited space, or you’ll have to swap the contents of the stack when switching tasks.
      • Compared to x86, where you can update SS to any segment & SP to any location within the segment.
    • Similarly, process memory stored in system RAM will need to be swapped out on task switch. Memory located above $4020 could be bank-switched instead.
    • Accessing data in different banks is desirable so we can jump to code in a currently-unloaded bank, or simply access data from one.
      • We can add code to perform this work and store it in a fixed bank that’s always available, and make the compiler use that instead of a plain JSR.
      • Pointers will need to include bank information as well.
      • Compared to x86, where you can jump to a different segment directly without losing access to the caller’s segment.
    • Graphics/PPU state also needs to be associated with each process.
      • It’s probably easiest to give the active process the full screen, instead of allowing background processes to share the display (eg. overlapping/tiled windows). The CHR ROM (or other video data) for a background application probably needs to get swapped out for the active process so it won’t be available to show properly, and there are challenges around sharing the current palette.
        • It might be possible to switch banks/contexts between scanlines, which could require windows to be screen-width but let them successfully stack vertically.
      • A system menu UI could be handled using code & data in reserved always-available banks, like the code we use to handle moves and jumps across banks.

  • Using setjmp/longjmp from JNA

    TLDR: I didn’t think it would work, and it didn’t.

    Today I had a goal to call an established C library from Java, but it uses setjmp and longjmp for error reporting.

    I had been hoping/planning to use Java Native Access to interact with the libraries. This is just a simple hobby project, so I want to keep it as simple as I realistically can. That means I don’t want to add a C build step to my project at all, not to mention having the build target multiple OS platforms and CPU architectures.

    But I didn’t really expect setjmp and longjmp to work in Java. I have no idea what the JVM does with the execution environment and I expected longjmp would interfere with it in a way that would very probably corrupt the JVM’s state.

    I tried it anyway. It didn’t work. The program crashed with SIGABRT after longjmp (running on Linux).


    I encountered some things I found a little more interesting than just “it doesn’t work”, though:

    jmp_buf’s size isn’t predictable

    setjmp requires that you allocate a jmp_buf to store the environment in.

    jmp_buf is defined in the system setjmp.h. On my 64-bit Linux system, sizeof(jmp_buf) == 200, and it’s defined as a 1-element array containing a struct, so it can be allocated easily then passed by reference.

    I dug into setjmp.h first to understand it more, and realized the size of jmp_buf isn’t really predictable:

    1. It varies by architecture even with the same C library, and
    2. It’s not specified by the standard that it even has to be a struct or anything. It could just be a handle or whatever.

    setjmp could be a macro

    The standard doesn’t specify whether setjmp is a function or macro. JNA can only call functions, since macros are inlined by the compiler at build time.

    (I didn’t check how it’s implemented in other C libraries, like MSVCRT on Windows or libSystem on macOS.)


    Not exactly related, but I also happened to call fflush(stdout) from Java. It turns out that stdout is actually specified in C89/C99 to be a macro. In glibc, it’s also exported as extern FILE *stdout so I was able to use that, but then my code would not conform to standard.


    I guess I’m gonna have to write a C adapter library that’s more Java-friendly.




subscribe via RSS