Hidden Image for Share

Friday, January 4, 2013

On dynamic code rewriting.

Any reader who has done a Computer Science course at some point, or read about the history of the science, has most likely at some point come across the Von Neumann architecture. In short, rather than having code and data stored in separate places, they're stored together - you can think of the code as just more data that can be executed, and acts on other data. There are a number of advantages to this - for example, if your programs are stored, you can easily pick which to load (just as you would have done when opening your browser to read this post), and pretty much every modern computer is using an evolved version of this system. But there's one fundamental property of Von Neumann architecture which I feel is severely under-represented: Given that code can manipulate data, and is stored as data, code can therefore modify itself.

Some examples of self-modification

While altering the instructions of a running program may sound cool, it also sounds problematic and finding examples of usefulness may not be immediately obvious. As it happens however, there may be some modification going on right now - you see, as the industry pushes towards making everything faster, a number of languages (including the javascript engines in browsers) employ a techinique called just-in-time compilation, or JIT. While alone this isn't quite rewriting (more, translating from a slow language to a faster one), most JIT compilation also allows an optimisation step, where the translation be introspected and rewritten for performance, and it is this type of modification that should be everywhere - allowing the human-written code to be simple and clear, but then optimised automatically.

Some downsides

As the wikipedia article for Von Neumann mentions: "Self-modifying code has largely fallen out of favor, since it is usually hard to understand and debug, as well as being inefficient under modern processor pipelining and caching schemes.". This is my impression of the downsides too - but while they're both valid (especially at this point in time), I'm hoping they'll both be overcome for two separate reasons.

Firstly, 'hard to understand' is an interesting concept, and really I think that complex is meant above. The thing is, computing systems (and programming languages) are both continually getting more complex, but the tools for creating them are getting much easier to understand, our ability to write better tools is keeping up with our ability to write complex systems. If we can debug JIT compilation, then self-modifying code (when used in the right places) shouldn't be much harder, in particularly when provably-correct systems improve.

Finally, it is indeed the case that the current caching/pipelining has evolved with static code, and so has many decades of optimisations along with it. This will be the main sticking point, but it feels to me that, overall, self-modifying code will end up executing less code (see below), and so at some point, there'll be enough demand for a new genre of optimisations, and eventually the speed will surpass the current static version.

Note that there's also some nasty interplay with multithreaded code running in parallel over the same instruction set - however, I feel that (and some other cases) are included in the analyses above. For example, in this case the same instructions could be loaded, and then cached depending on what rewriting has taken place.

Some next steps

So the question I asked myself was - if you assume at some point self-modifying code will become popular, what are likely ways where people will incrementally progress towards it? And the most immediate solution seemed to partner with something I mentioned last post - the scoped stack. You see, there's a problem I run into continually when writing code, and that is: where to place logical assertions. For example, if I write an API, should I put bounds checking / nullity checks / other preconditions at the start of exposed methods?
In most cases yes, but if this is then used within another API with the same preconditions, this is duplicated logic, which is either left in (at a performance cost) or hacked around, e.g by adding versions without the checks.

What I'd like to see instead is that these are written everywhere, but with no performance cost when already checked - and the only way I see this happening is through dynamically rewriting condition checking on long-lived scopes. Using the example from last post, say I'm making answering requests that each have a User scope. In today's coding, there's likely at least one code path handling an un-set user (e.g. error / login flow) and a dozen or more checks within the request handling to assert the user has been set. Upon starting to handle a request, a code rewriter would then rewrite to the login flow if there is no user, or remove all but the first is-set check, thereby reducing the instructions run while the original code remains clean.

Note that permissions checking should also benefit greatly from this - if you can add checks throughout a system to ensure that correct access has been given to the data you're using, yet have these self-modified to a single check with little performance hit, this would no doubt lead to better protected data access.

For a brief addition before the summary, two more fanciful predictions of where self-modifying code might gain popularity: (1) Cloud sandboxing: executing code on third-party machines, rewritten to avoid making certain calls or re-optimise for different machine profiles they're running on, and (2) on-the-fly updates / code fetching - Chrome already does something like this, but while web apps can do small patches by loading more javascript, native OS apps are far from this point (important given the growing popularity of mobile apps).

tl;dr (?)

In summary, my guess is that the pursuit of fast execution times, plus better tools, will eventually lead us down the path of self-modifying code and hopefully cool new applications using the techniques it can provide. It should in general lead to faster, cleaner and more secure code, however don't hold your breath.

2 comments:

  1. I don't think JIT or most of the other stuff you're talking about is actually self-modifying code in the traditional sense (and in the sense of the wikipedia article). Dynamically generated and/or regenerated is not self-modifying. Yes, from e.g. the kernel's POV, a process is modifying its own code but it's the compiler modifying the code of the application.

    More to the point, the code in memory each time is functionally equivalent each time it's regenerated, it just is optimised for different cases.

    The self-modifying code that has a bad rep is the stuff in http://en.wikipedia.org/wiki/Self-modifying_code . It's crazy stuff like a bash script that runs sed on itself (fun fact - bash scripts are interpreted line and if the file is modified, bash will reread it and continue on from the next line, so if a script is taking an unexpectedly long time and you'd like it mail you when it's done, you can just echo >> script "echo done | mail you@foo.com").

    Processors have been optimised for running tight loops that do not alter themselves during execution (well the loops don't have to be so tight these days to fit in cache). I don't think there's anything there that prevents on-the-fly-regenerated code from working well. It's only worth regenerating if it's going to be called repeatedly anyway and that should work just fine with the cache.

    ReplyDelete
    Replies
    1. Thanks - indeed, I was a bit unsure about the JIT part, as it's usually thought of as language translation (e.g. scripting to native). Applying that to the same language, just with different optimisations based on runtime context, is the self-modifying I'm thinking about above;

      I guess it's just a terminology difference, the stuff you're calling dynamic regeneration - though by the running program, rather than a compiler, as the program has contextual knowledge from runtime. That said, even compiler-based incremental runtime rewriting would be a neat step in the direction.

      In the same way that (I'm guessing?) JS JITs optimise differently from site to site, it'd be nice if native apps can JIT from version to version, or user to user, or even change as user behaviour changes. These would ideally be either infrequent enough to be worth it, or dependant on a small enough state (e.g. a few booleans) to be cacheable by state.

      Delete