# Temperature logging: redux

Previously as I was experimenting with logging the temperature using a Raspberry Pi (to monitor the temperatures experienced by fermenting cider), I noted that the Pi was something of a terrible hack, and it should be possible to do more efficiently with some slightly less common hardware.

I decided that improved version would be interesting to build for use at home, since it’s both kind of fun to collect data like that, and actually knowing the temperature in the house is useful at times. The end result of this project is that I can generate graphs like the one below of conditions around the house:

# Building a terrible ‘IoT’ temperature logger

I had approximately the following exchange with a co-worker a few days ago:

Them: “Hey, do you have a spare Raspberry Pi lying around?”
Me: [thinks] “..yes, actually.”
T: “Do you want to build a temperature logger with Prometheus and a DS18B20+?
M: “Uh, okay?”

It later turned out that that co-worker had been enlisted by yet another individual to provide a temperature logger for their project of brewing cider, to monitor the temperature during fermentation. Since I had all the hardware at hand (to wit, a Raspberry Pi 2 that I wasn’t using for anything and temperature sensors provided by the above co-worker), I threw something together. It also turned out that the deadline was quite short (brewing began just two days after this initial exchange), but I made it work in time.

# Considering my backup systems

With the recent news that Crashplan were doing away with their “Home” offering, I had reason to reconsider my choice of online backup backup provider. Since I haven’t written anything here lately and the results of my exploration (plus description of everything else I do to ensure data longevity) might be of interest to others looking to set up backup systems for their own data, a version of my notes from that process follows.

## The status quo

I run a Linux-based home server for all of my long-term storage, currently 15 terabytes of raw storage with btrfs RAID on top. The choice of btrfs and RAID allows me some degree of robustness against local disk failures and accidental damage to data.

If a disk fails I can replace it without losing data, and using btrfs’ RAID support it’s possible to use heterogenous disks, meaning when I need more capacity it’s possible to remove one disk (putting the volume into a degraded state) and add a new (larger) one and rebalance onto the new disk.

btrfs’ ability to take copy-on-write snapshots of subvolumes at any time makes it reasonable to take regular snapshots of everything, providing a first line of defense against accidental damage to data. I use Snapper to automatically create rolling snapshots of each of the major subvolumes:

• Synchronized files (mounted to other machines over the network) have 8 hourly, 7 daily, 4 weekly and 3 monthly snapshots available at any time.
• Staging items (for sorting into other locations) have a snapshot for each of the last two hours only, because those items change frequently and are of low value until considered further.
• Everything else keeps one snapshot from the last hour and each of the last 3 days.

This configuration strikes a balance according to my needs for accident recovery and storage demands plus performance. The frequently-changed items (synchronized with other machines and containing active projects) have a lot of snapshots because most individual files are small but may change frequently, so a large number of snapshots will tend to have modest storage needs. In addition, the chances of accidental data destruction are highest there. The other subvolumes are either more static or lower-value, so I feel little need to keep many snapshots of them.

I use Crashplan to back up the entire system to their “cloud”1 service for $5 per month. The rate at which I add data to the system is usually lower than the rate at which it can be uploaded back to Crashplan as a backup, so in most cases new data is backed up remotely within hours of being created. Finally, I have a large USB-connected external hard drive as a local offline backup. Also formatted with btrfs like the server (but with the entire disk encrypted), I can use btrfs send to send incremental backups to this external disk, even without the ability to send information from the external disk back. In practice, this means I can store the external disk somewhere else completely (possibly without an Internet connection) and occasionally shuttle diffs to it to update to a more recent version. I always unplug this disk from power and its host computer when not being updated, so it should only be vulnerable to physical damage and not accidental modification of its contents. # An illustrated guide to LLVM At the most recent Rust Sydney meetup (yesterday, “celebrating” Rust’s second birthday) I gave a talk intended to provide an introduction to using LLVM to build compilers, using Rust as the implementing language. The presentations were not recorded which might have been neat, but I’m publishing the slides and notes here for anybody who might find it interesting or useful. It is however not as illustrated as the title may seem to suggest. It’s embedded below, or you can view standalone in your browser or as a PDF, available with or without presenter notes. Navigate with the arrow keys on your keyboard or by swiping. Press ? to show additional keys for controls; in particular, s will open a presenter view that includes the plentiful notes I’ve included. In case of curiousity, I built the presentation with reveal.js and its preparation consumed a lot more time than I initially expected (though that’s not the fault of reveal). # Quick and dirty web image optimization Given a large pile of images that nominally live on a web server, I want to make them smaller and more friendly to serve to clients. This is hardly novel: for example, Google offer detailed advice on reducing the size of images for the web. I have mostly JPEG and PNG images, so, jpegtran and optipng are the tools of choice for bulk lossless compression. To locate and compress images, I’ll use GNU find and parallel to invoke those tools. For JPEGs I take a simple approach, preserving comment tags and creating a progressive JPEG (which can be displayed at a reduced resolution before the entire image has been downloaded). find -iname '*.jpg' -printf '%P,%s\n' \ -exec jpegtran -o -progressive -copy comments -outfile {} {} \; \ > delta.csv  # sax-ng Over on Cemetech, we’ve long had an embedded chat widget called “SAX” (“Simultaneous Asynchronous eXchange”). It behaves kind of like a traditional shoutbox, in that registered users can use the SAX widget to chat in near-real-time. There is also a bot that relays messages between the on-site widget and an IRC channel, which we call “saxjax”. The implementation of this, however, was somewhat lacking in efficiency. It was first implemented around mid2006, and saw essentially no updates until just recently. The following is a good example of how dated the implementation was: // code for Mozilla, etc if (window.XMLHttpRequest) { xmlhttp=new XMLHttpRequest() xmlhttp.onreadystatechange=state_Change xmlhttp.open("GET",url,true) xmlhttp.send(null) } else if (window.ActiveXObject) { // code for IE xmlhttp=new ActiveXObject("Microsoft.XMLHTTP") if (xmlhttp) { xmlhttp.onreadystatechange=state_Change xmlhttp.open("GET",url,true) xmlhttp.send() } }  The presence of ActiveXObject here implies it was written at a time when a large fraction of users would have been using Internet Explorer 5 or 6 (the first version of Internet Explorer released which supported the standard form of XMLHttpRequest was version 7). Around a year ago (that’s how long this post has been a draft for!), I took it upon myself to design and implement a more modern replacement for SAX. This post discusses that process and describes the design of the replacement, which I have called “sax-ng.” ## Legacy SAX The original SAX implementation, as alluded to above, is based on AJAX polling. On the server, a set of approximately the 30 most recent messages were stored in a MySQL database and a few PHP scripts managed retrieving and modifying messages in the database. This design was a logical choice when initially built, since the web site was running on a shared web host (supporting little more than PHP and MySQL) at the time. Eventually this design became a problem, as essentially every page containing SAX that is open at any given time regularly polls for new messages. Each poll calls into PHP on the server, which opens a database connection to perform one query. Practically, this means a very large number of database connections being opened at a fairly regular pace. In mid-2012 the connection count reached levels where the shared hosting provider were displeased with it, and requested that we either pay for a more expensive hosting plan or reduce resource usage. In response, we temporarily disabled SAX, then migrated the web site to a dedicated server provided by OVH, who had opened a new North American datacenter in July. We moved to the dedicated server in August of 2012. This infrastructure change kept the system running, and opened the door to a more sophisticated solution since we gained the ability to run proper server applications. Meanwhile, the limitations of saxjax (the IRC relay bot) slowly became more evident over time. The implementation was rather ad-hoc, in Python. It used two threads to implement relay, with a dismaying amount of shared state used to relay messages between the two threads. It tended to stop working correctly in case of an error in either thread, be it due to a transient error response from polling the web server for new messages, or an encoding-related exception thrown from the IRC client (since Python 2.x uses bytestrings for most tasks unless specifically told not to, and many string operations (particularly outputting the string to somewhere) can break without warning when used with data that is not 8-bit clean (that is, basically anything that isn’t ASCII). Practically, this meant that the bot would frequently end up in a state where it would only relay messages one way, or relay none at all. I put some time into making it more robust to these kinds of failures early in 2015, such that some of the time it would manage to catch these errors and outright restart (rather than try to recover from an inconsistent state). Doing so involved some pretty ugly hacks though, which prompted a return to some longtime thoughts on how SAX could be redesigned for greater efficiently and robustness. ## sax-ng For a long time prior to beginning this work, I frequently (semi-jokingly) suggested XMPP (Jabber) as a solution to the problems with SAX. At a high level this seems reasonable: XMPP is a chat protocol with a number of different implementations available, and is relatively easy to set up as a private chat service. On the other hand, the feature set of SAX imposes a few requirements which are not inherently available for any given chat service: 1. An HTTP gateway, so clients can run inside a web browser. 2. Group chat, not just one-to-one conversation capability. 3. External authentication (logging in to the web site should permit connection to chat as well). 4. Retrieval of chat history (so a freshly-loaded page can have some amount of chat history shown). As it turns out, ejabberd enables all of these, with relatively little customization. mod_http_bind provides an HTTP gateway as specified in XEP-0206, mod_muc implements multi-user chat as specified in XEP-0045 which also includes capabilities to send chat history to clients when they connect, and authentication can be handled by an external program which speaks a simple binary protocol and is invoked by ejabberd. Main implementation of the new XMPP-based system was done in about a week, perhaps 50 hours of concerted work total (though I may be underestimating). I had about a month of “downtime” at the beginning of this past summer, the last week of which was devoted to building sax-ng. ### ejabberd The first phase involved setting up an instance of ejabberd to support the rest of the system. I opted to run it inside Docker, ideally to make the XMPP server more self-contained and avoid much custom configuration on the server. Conveniently, somebody had already built a Docker configuration for ejabberd with a wealth of configuration switches, so it was relatively easy to set up. Implementing authentication against the web site was also easy, referring to the protocol description in the ejabberd developers guide. Since this hooks into the website’s authentication system (a highly modified version of phpBB), this script simply connects to the mysql server and runs queries against the database. Actual authentication is performed with phpBB SIDs (Session ID), rather than a user’s password. It was built this way because the SID and username are stored in a cookie, which is available to a client running in a web browser. This is probably also somewhat more secure than storing a password in the web browser, since the SID is changed regularly so data exposure via some vector cannot compromise a user’s web site password. Error handling in the authentication script is mostly nonexistent. The Erlang approach to such problems is mostly “restart the component if it fails”, so in case of a problem (of which the only real possibility is a database connection error) ejabberd will restart the authentication script and attempt to carry on. In practice this has proven to be perfectly reliable. In XMPP MUC (Multi-User Chat), users are free to choose any nickname they wish. For our application, there is really only one room and we wish to enforce that the nickname used in XMPP is the same as a user’s username on the web site. There ends up being no good way in ejabberd to require that a user take a given nickname, but we can ensure that it is impossible to impersonate other users by registering all site usernames as nicknames in XMPP. Registered nicknames may only be used by the user to which they are registered, so the only implementation question is in how to automatically register nicknames. I ended up writing a small patch to mod_muc_admin, providing an ejabberdctl subcommand to register a nickname. This patch is included in its entirety below. diff --git a/src/mod_muc_admin.erl b/src/mod_muc_admin.erl index 9c69628..3666ba0 100644 --- a/src/mod_muc_admin.erl +++ b/src/mod_muc_admin.erl @@ -15,6 +15,7 @@ start/2, stop/1, % gen_mod API muc_online_rooms/1, muc_unregister_nick/1, + muc_register_nick/3, create_room/3, destroy_room/3, create_rooms_file/1, destroy_rooms_file/1, rooms_unused_list/2, rooms_unused_destroy/2, @@ -38,6 +39,9 @@ %% Copied from mod_muc/mod_muc.erl -record(muc_online_room, {name_host, pid}). +-record(muc_registered, + {us_host = {\{<<"">>, <<"">>}, <<"">>} :: {\{binary(), binary()}, binary()} | '$1',
+         nick = <<"">> :: binary()}).

%%----------------------------
%% gen_mod
@@ -73,6 +77,11 @@ commands() ->
module = ?MODULE, function = muc_unregister_nick,
args = [{nick, binary}],
result = {res, rescode}},
+     #ejabberd_commands{name = muc_register_nick, tags = [muc],
+              desc = "Register the nick in the MUC service to the JID",
+              module = ?MODULE, function = muc_register_nick,
+              args = [{nick, binary}, {jid, binary}, {domain, binary}],
+              result = {res, rescode}},

#ejabberd_commands{name = create_room, tags = [muc_room],
desc = "Create a MUC room name@service in host",
@@ -193,6 +202,16 @@ muc_unregister_nick(Nick) ->
error
end.

+muc_register_nick(Nick, JID, Domain) ->
+    {jid, UID, Host, _,_,_,_} = jlib:string_to_jid(JID),
+    F = fun (MHost, MNick) ->
+                mnesia:write(#muc_registered{us_host=MHost,
+                                             nick=MNick})
+        end,
+    case mnesia:transaction(F, [{\{UID, Host}, Domain}, Nick]) of
+        {atomic, ok} -> ok;
+        {aborted, _Error} -> error
+    end.

%%----------------------------


It took me a while to work out how exactly to best implement this feature, but considering I had never worked in Erlang before it was reasonably easy. I do suspect some familiarity with Haskell and Rust provided background to more easily understand certain aspects of the language, though. The requirement that I duplicate the muc_registered record (since apparently Erlang provides no way to import records from another file) rubs me the wrong way, though.

In practice, then, a custom script traverses the web site database, invoking ejabberdctl to register the nickname for every existing user at server startup and then periodically or on demand when the server is running.

### Web interface

The web interface into XMPP was implemented with Strophe.js, communicating with ejabberd via HTTP-bind with the standard support in both the client library and server.

The old SAX design served a small amount of chat history with every page load so it was immediately visible without performing any additional requests after page load, but since the web server never receives chat data (it all goes into XMPP directly), this is no longer possible. The MUC specification allows a server to send chat history to clients when they join a room, but that still requires several HTTP round-trips (taking up to several seconds) to even begin receiving old lines.

I ended up storing a cache of messages in the browser, which is used to populate the set of displayed messages on initial page load. Whenever a message is received and displayed, its text, sender and a timestamp are added to the local cache. On page load, messages from this cache which are less than one hour old are displayed. The tricky part with this approach is avoiding duplication of lines when messages sent as part of room history already exist, but checking the triple of sender, text and timestamp seems to handle these cases quite reliably.

### webridge

The second major feature of SAX is to announce activity on the web site’s bulletin board, such as when people create or reply to threads. Since the entire system was previously managed by code tightly integrated with the bulletin board, a complete replacement of the relevant code was required.

In the backend, SAX functionality was implemented entirely in one PHP function, so replacing the implementation was relatively easy. The function’s signature was something like saxSay($type,$who, $what,$where), where type is a magic number indicating what kind of message it is, such as the creation of a new thread, a post in a thread or a message from a user. The interpretation of the other parameters depends on the message type, and tends to be somewhat inconsistent.

The majority of that function was a maze of comparisons against the message type, emitting a string which was eventually pushed into the chat system. Rather than attempt to make sense of that code, I decided to replace it with a switch statement over symbolic values (whereas the old code just used numbers with no indication of purpose), feeding simple invocations of sprintf. Finding the purpose of each of the message types was most challenging among that, but it wasn’t terribly difficult as I ended up searching the entire web site source code for references to saxSay and determined the meaning of the types from the caller’s context.

To actually feed messages from PHP into XMPP, I wrote a simple relay bot which reads messages from a UNIX datagram socket and repeats them into a MUC room. A UNIX datagram socket was selected because there need not be any framing information in messages coming in (just read a datagram and copy its payload), and this relay should not be accessible to anything running outside the same machine (hence a UNIX socket).

The bot is implemented in Python with Twisted, utilizing Twisted’s provided protocol support for XMPP. It is run as a service under twistd, with configuration provided via environment variables because I didn’t want to write anything to handle reading a more “proper” configuration file. When the PHP code calls saxSay, that function connects to a socket with path determined from web site configuration and writes the message into that socket. The relay bot (“webridge”) receives these messages and writes them into MUC.

### saxjax-ng

Since keeping a web page open for chatting is not particularly convenient, we also operate a bridge between the SAX chat and an IRC channel called saxjax. The original version of this relay bot was of questionable quality at best: the Python implementation ran two threads, each providing one-way communication though a list. No concurrency primitives, little sanity.

Prior to creation of sax-ng I had put some amount of effort in improving the reliability of that system, since an error in either thread would halt all processing of messages in the direction corresponding to the thread in which the error occurred. Given there was essentially no error handling anywhere in the program, this sort of thing happened with dismaying frequency.

saxjax-ng is very similar in design to webridge, in that it’s Twisted-based and uses the Twisted XMPP library. On the IRC side, it uses Twisted’s IRC library (shocking!). Both ends of this end up being very robust when combined with the components that provide automatic reconnection and a little bit of custom logic for rotating through a list of IRC servers. Twisted guarantees singlethreaded operation (that’s the whole point; it’s an async event loop), so relaying a message between the two connections is simply a matter of repeating it on the other connection.

## Contact with users

This system has been perfectly reliable since deployment, after a few changes. Most notably, the http-bind interface for ejabberd was initially exposed on port 5280 (the default for http-bind). Users behind certain restrictive firewalls can’t connect to that port, so we quickly reconfigured our web server to reverse-proxy to http-bind and solve that problem. Doing so also means the XMPP server doesn’t need its own copy of the server’s SSL certificate, which would otherwise require periodic maintenance to give a freshly-renewed certificate.

There are still some pieces of the web site that emit messages containing HTML entities in accordance with the old system. The new system.. doesn’t emit HTML entities because that should be the responsibility of something doing HTML presentation (Strong Opinion) and I haven’t bothered trying to find the things that are still emitting HTML-like strings.

The reconnect logic on the web client tends to act like it’s received multiples of every message that arrives after it’s tried to reconnect to XMPP, such as when a user puts their computer to sleep and later resumes; the web client tries to detect the lost connection and reopen it, and I think some event handlers are getting duplicated at that point. Haven’t bothered working on a fix for that either.

# Conclusion

ejabberd is a solid piece of software and not hard to customize. Twisted is a good library for building reliable network programs in Python, but has enough depth that some of its features lack useful documentation so finding what you need and figuring out how to use it can be difficult. This writeup has been languishing for too long so I’m done writing now.

# Web history archival and WARC management

I’ve been a sort of ‘rogue archivist’ along the lines of the Archive Team for some time, but generally lack the combination of motivation and free time to directly take part in their activities. That said, I do sometimes go on bursts of archival since these things do concern me; it’s just a question of when I’ll get manic enough to be useful and latch onto an archival task as the one to do. An earlier public example is when I mirrored ticalc.org.

The historical record contains plenty of instances where people maintained copies of their communications or other documentation which has proven useful to study, and in the digital world the same is likely to be true. With the ability to cheaply store large amounts of data, it is also relatively easy to generate collections in the hope of their future utility.

Something I first played with back in 2014 was extracting lists of web pages to archive from web browser history. From a public perspective this may not be particularly interesting, but if maintained over a period of time this data could be interesting as a snapshot of a typical-in-some-fashion individual’s daily life, or for purposes I can’t foresee.

Today I’m going to write a little about how I collect this data and reduce the space requirements. The products of this work that are source code can be found on Bitbucket.

# HodorCSE

Localization of software, while not trivial, is not a particularly novel problem. Where it gets more interesting is in resource-constrained systems, where your ability to display strings is limited by display resolution and memory limitations may make it difficult to include multiple localized copies of any given string in a single binary. All of this is then on top of the usual (admittedly slight in well-designed systems) difficulty in selecting a language at runtime and maintaining reasonably readable code.

This all comes to mind following discussion of providing translations of Doors CSE, a piece of software for the TI-84+ Color Silver Edition1 that falls squarely into the “embedded software” category. The simple approach (and the one taken in previous versions of Doors CS) to localizing it is just replacing the hard-coded strings and rebuilding.

As something of a joke, it was proposed to make additional “joke” translations, for languages such as Klingon or pirate. I proposed a Hodor translation, along the lines of the Hodor UI patch2 for Android. After making that suggestion, I decided to exercise my skills a bit and actually make one.

## Hodor (Implementation)3

Since I don’t have access to the source code of Doors CSE, I had to modify the binary to rewrite the strings. Referring the to file format guide, we are aware that TI-8x applications are mostly Intel hex, with a short header. Additionally, I know that these applications are cryptographically signed which implies I will need to resign the application when I have made my changes.

### Dumping contents

I installed the IntelHex module in a Python virtualenv to process the file into a format easier to modify, though I ended up not needing much capability from there. I simply used a hex editor to remove the header from the 8ck file (the first 0x4D bytes).

Simply trying to convert the 8ck payload to binary without further processing doesn’t work in this case, because Doors CSE is a multipage application. On these calculators Flash applications are split into 16-kilobyte pages which get swapped into the memory bank at 0x4000. Thus the logical address of the beginning of each page is 0x4000, and programs that are not aware of the special delimiters used in the TI format (to delimit pages) handle this poorly. The raw hex file (after removing the 8ck header) looks like this:

:020000020000FC
:20400000800F00007B578012010F8021088031018048446F6F727343534580908081020382
:2040200022090002008070C39D40C39A6DC3236FC30E70C3106EC3CA7DC3FD7DC3677EC370
:20404000A97EC3FF7EC35D40C35D40C33D78C34E78C36A78C37778C35D40C3A851C9C940F3
:2040600001634001067001C36D00CA7D00BC6E00024900097A00E17200487500985800BDF8
[snip]
:020000020001FB
:20402000EF7D4721B98411AE84010900EDB0EFAA4AC302723A9B87B7CA4940FE01CA4340B9
:20404000C30272CD4F40C30272CDB540C30272EF67452100002275FE3EA03273FECD63405B


Lines 1 and 7 here are the TI-specific page markers, indicating the beginning of pages 0 and 1, respectively. The lines following each of those contain 32 (20 hex) bytes of data starting at address 0x40000 (4000). I extracted the data from each page out to its own file with a text editor, minus the page delimiter. From there, I was able to use the hex2bin.py script provided with the IntelHex module to create two binary files, one for each page.

### Modifying strings

With two binary files, I was ready to modify some strings. The calculator’s character set mostly coincides with ASCII, so I used the strings program packaged with GNU binutils to examine the strings in the image.

$strings page00.bin HDoorsCSE ##6M#60> oJ:T Uo& dQ:T [snip] xImprove BASIC editor Display clock Enable lowercase Always launch Doors CSE Launch Doors CSE with PRGM]  With some knowledge of the strings in there, it was reasonably short work to find them with a hex editor (in this case I used HxD) and replace them with variants on the string “Hodor”. I also found that page 1 of the application contains no meaningful strings, so I ended up only needing to examine page 0. Some of the reported strings require care in modification, because they refer to system-invariant strings. For example, “OFFSCRPT” appears in there, which I know from experience is the magic name which may be given to an AppVar to make the calculator execute its contents when turned off. Thus I did not modify that string, in addition to a few others (names of authors, URLs, etc). ### Repacking I ran bin2hex.py to convert the modified page 0 binary back into hex, and pasted the contents of that file back into the whole-app hex file (replacing the original contents of page 0). From there, I had to re-sign the binary.4 WikiTI points out how easy that process is, so I installed rabbitsign and went on my merry way: $ rabbitsign -g -r -o HodorCSE.8ck HodorCSE.hex


### Testing

I loaded the app up in an emulator to give it a quick test, and was met by complete nonsense, as intended.

I’m providing the final modified 8ck here, for the amusement of my readers. I don’t suggest that anybody use it seriously, not for the least reason that I didn’t test it at all thoroughly to be sure I didn’t inadvertently break something.

## Extending the concept

It’s relatively easy to extend this concept to the calculator’s OS as well (and in fact similar string replacements have been done before) with the OS signing keys in hand. I lack the inclination to do so, but surely somebody else would be able to do something fun with it using the process I outlined here.

1. That name sounds stupider every time I write it out. Henceforth, it’s just “the CSE.”
2. The programmer of that one took is surprisingly far, such that all of the code that feasibly can be is also Hodor-filled
3. Hodor hodor hodor hodor. Hodor hodor hodor.
4. This signature doesn’t identify the author, as you might assume. Once upon a time TI provided the ability for application authors to pay some amount of money to get a signing key associated with them personally, but that system never saw wide use. Nowadays everybody signs their applications with the public “freeware” keys, just because the calculator requires that all apps be signed and the public keys must be stored on the calculator (of which the freeware keys are preinstalled on all of them).

# “A Sufficiently Smart Compiler”

On a bit of a lark today, I decided to see if I could get Spasm running in a web browser via Emscripten. I was successful, but found that something seemed to be optimizing out most of main() such that I had to hack in my own main function that performed the same critical functions and (for the sake of simplicity) hard-coded the relevant command-line options.

Looking into the problem a bit further, I observed that not all of main() was being removed; there was one critical line left in. The beginning of the function in source and the generated code were as follows.

C++ source:

int main (int argc, char **argv)
{
int curr_arg = 1;
bool case_sensitive = false;
bool is_storage_initialized = false;

use_colors = true;
extern WORD user_attributes;
user_attributes = save_console_attributes ();
atexit (restore_console_attributes_at_exit);

//if there aren't enough args, show info
if (argc < 2) {

Generated Javascript (asm.js):

function _main($argc,$argv) {
$argc =$argc | 0;
$argv =$argv | 0;
HEAP8[4296] = 1;
__Z23save_console_attributesv() | 0;
return 0;
}

Spasm is known to work in general, but I found it unlikely that LLVM’s optimizer would be optimizing this code wrong as well. Building with optimizations turned off generated correct code, so it was definitely the optimizer breaking this and not some silly bug in Emscripten. Looking a little deeper into the save_console_attributes function, we see the following code:

WORD save_console_attributes () {
#ifdef WIN32
CONSOLE_SCREEN_BUFFER_INFO csbiScreenBufferInfo;
GetConsoleScreenBufferInfo (GetStdHandle (STD_OUTPUT_HANDLE), &csbiScreenBufferInfo);
return csbiScreenBufferInfo.wAttributes;
#endif
}

Since I’m not building for a Windows target (Emscripten’s runtime environment resembles a Unix-like system), this was preprocesses down to an empty function (returning void), but it’s declared with a non-void return. Smells like undefined behavior! Let’s make this function return 0:

WORD save_console_attributes () {
#ifdef WIN32
CONSOLE_SCREEN_BUFFER_INFO csbiScreenBufferInfo;
GetConsoleScreenBufferInfo (GetStdHandle (STD_OUTPUT_HANDLE), &csbiScreenBufferInfo);
return csbiScreenBufferInfo.wAttributes;
#else
return 0;
#endif
}

With that single change, I now get useful code in main. Evidently LLVM’s optimizer was smart enough to recognize the call to that function invoked UB and optimized out the rest of main.

## Concluding

This issue illustrates nicely the dangers of a sufficiently smart compiler, where updates to your compiler might break otherwise-working code because it’s subtly broken. This is particularly of concern in C, where the compilers tend to go to extreme measures to optimize the generated code and there are a lot of ways to inadvertently invoke undefined behavior.

Static analyzers are a big help in finding these issues. Looking more closely at the compiler output from building Spasm, it emitted a warning regarding this function, as well as several potential buffer overflows of the following form:

strncat(s, "/", sizeof(s));

This looks correct (s is a static buffer), but is subtly broken because the length parameter taken by strncat should be the maximum allowed length of the string, excluding the null terminator. The third parameter should be sizeof(s) - 1 in this case, otherwise the string’s null terminator might be written out of bounds.

## Appendix

The code for my work on this is up on Bitbucket and might be of interest to some readers. I fear that by working on this project I’ve inadvertently committed to becoming the future maintainer of Spasm, which I find to contain a significant amount of poor-quality code. Perhaps I’ll have to write a replacement for Spasm in Rust, which I’ve been quite pleased with as a potential replacement for C, without the numerous pitfalls and rather more modern in its capabilities.

# Reverse-engineering Ren’py packages

Some time ago (September 3, 2013, apparently), I had just finished reading Analogue: A Hate Story (which I highly recommend, by the way) and was particularly taken with the art. At that point it seems my engineer’s instincts kicked in and it seemed reasonable to reverse-engineer the resource archives to extract the art for my own nefarious purposes.

A little examination of the game files revealed a convenient truth: it was built with Ren’Py, a (open-source) visual novel engine written in Python. Python is a language I’m quite familiar with, so the actual task promised to be well within my expertise.

## Code

Long story short, I’ve build some rudimentary tools for working with compiled Ren’py data. You can get it from my repository on BitBucket. Technically-inclined readers might also want to follow along in the code while reading.

## Background

There are a large number of games designed with Ren’py. It’s an easy tool to get started with and hack on, since the script language is fairly simple and because it’s open-source, more sophisticated users are free to bend it to their will. A few examples of (in my opinion) high-quality things built with the engine:

Since visual novels tend to live or die on the combination of art and writing, the ability to examine the assets outside the game environment offers interesting possibilities.

Since it was handy, I started my experimentation with Analogue.