Hidden Image for Share

Sunday, November 24, 2013

On the Uniform Access Principle

The UAP is not good idea for generalised use, and Properties (e.g. C#, ObjC) should be avoided. There, I said it :) I think this is the post with the most non-standard viewpoint so far, so I wanted to start with it stated very clearly, and look forwards to reasons why my logic is wrong (as I'm guessing there are a number), so as always, please add feedback after reading this post.

Uniform Access Principle definition

Firstly, some background. The Uniform Access Principle was coined by the creator of the Eiffel language, and states:
All services offered by a module should be available through a uniform notation, which does not betray whether they are implemented through storage or through computation. 
I'm not 100% certain on the exact reasons why this was considered the best idea, but from reading some sources it seemed that the benefits primarily are derived from the abstraction of the source of the data from callers - i.e if you call getFoo(), you don't need to know whether it's stored directly or computed, which also means that the provider of the data can change that source without having to change all callers.

On hiding the data source

The main criticism of the UAP seems to be that hiding the cost of computing some data can be problematic - e.g. there is a big difference between returning a member, and computing it via RPC to somewhere that has to do a long calculation before the value is known. This is certainly a problem, and in particular is even moreso in a more network-based client-server world, but in itself seems to be mostly solved by simply not hiding expensive calculations in get() methods. That is, getCommentCount()#int that requires a database index look-up is more likely to be queryCommentCount(Callback<int>)#void - callers know this will take a while, must provide an asynchronous solution, and everyone is happy.

On hiding the dependencies

The more fundamental problem I see is making a calculated value appear like one that is a normal member, independent to the other members of the containing object. Firstly, if the calculation is only to look up the value from an external source (i.e. it's not stored locally) then that's no issue, and is mostly solved by fixing the querying as mentioned above - e.g having some way to convert an UnresolvedFoo into a ResolvedFoo, from which the required members can be retrieved locally.

No, the main problem is from calculations of 'members' which depend on other members of the object. The best example of this is a collection's size, but also things like displayName (= firstName + ' ' + lastName) or BMI (= mass/(height * height)). Having the value of 'members' depend on the values of others embeds within it a scoping issue - if the value of one of the dependencies changes, so does the value of the calculated member, and this should be made clear to callers of the API.

On commutativity

To express this more clearly, there's a concept of commutativity of an API - that is, the order in which you call methods does not matter. Or, to put it another way, the methods are as independent/orthogonal as possible, which is good for maintainability, but also is theorised to improve scalability.

Obviously, there are some things which this can't hold, but ideally they're as easy to spot as possible. For example, for a mutable member, setFoo() and getFoo() do not commute, but this is clear from the fact they operate on the same state. Calling setFoo() resets the foo scope, and everything that was using the value (displaying it on a screen, used as input for a calculation, ...) is now invalid, and needs to acquire the new value.

So as long as our members are kept separate, commutativity is clear. With dependencies between a local member and a calculated member however, these become very complex - how should a caller know that calling setFirstName() no longer commutes with getDisplayName()? This knowledge is required in order to have the correct value when needed (e.g. to reset the display name scope when the first name is changed), but it is exactly this dependency that the UAP is hiding!

On serialization

Another smaller area where the distinction between local members and computed properties is required is in serialization of each type; e.g. across a network, or into persistent storage, etc... When sending or storing an object of information, the individual members are generally the only parts used, and instead the calculated properties remain calculated, the definition for the calculation (i.e. function) is instead defined once for the type. This distinction (sent as values on the wire vs. sent as a function in code) is quite important given that cross-language/cross-platform member serialization is mostly a solved problem (see previous post) but there is no equivalent good solution for function definitions (...if you know one, it would be great to hear!).

Immutability

It is important to add that immutable data structures are a special case for this - for those, I see much less of a problem in knowing what is calculated and what is not, other than the calculation cost mentioned above. Because of the immutability, none of the values can change, and hence the update scoping between dependencies is no longer a problem. Note that immutable objects also have the nice property that, without a set() method, members with multiple dependencies do not have to decide their update semantics. In the example above, getBMI() depended on both height and weight, so it is not clear what setBMI() would do - it could lock weight and update height, or vice versa, or even some combination in the middle. One caution: these need to be actually immutable, not just behind an unmodifiable API. If the data can change, even if only remotely, then dependencies still need to be clear.

Summary

Seriously, don't use the UAP. Keep your mutable members as members, and make sure they're all orthogonal (i.e. the value of one does not depend on the value of another). Serialization and persistence will be much easier, callers will better know when values update, and correctness in general should be simpler. Try to define members that actually represent what you need, as changing them to calculated values will now be harder, not so much because the clients will need to be updated, but also because they probably made assumptions about the life-cycles of the values they are using, which you just broke (but wouldn't have been noticed if using UAP). And take immutable snapshots whenever possible.

Sunday, October 13, 2013

On Syncing (push vs pull)

The next topic on my to-do list of things to cover is one that seems to come up an awful lot, underlying a huge range of performance issues at scale, yet as far as I can tell has no good reusable infrastructure of solutions yet. Whether it be storing data locally in a web or phone client, or caching server-server communication, syncing that information between two places seems notoriously hard.

The problem

I have a complex-typed tree of data that is persisted in one place, but I want to use it in another. If it were immutable, this would be simple - I could read it once, and persist that as long as it makes sense (i.e. a simple cache). Things get much more complicated however when the values change - assuming the machine applying the change is the one providing access to the data originally, propogating this information to the remote user of the data is non trivial.

Solution #1: Pull (passive data transfer)

The solution most widely used today is a simple pull - that is, the remote client asks for the lastest version, and either receives it, or a response saying nothing has changed. This is nice for a number of reasons, but the primary one seems to be that you have to provide a way to get the data anyway, so this just reuses that multiple times. ETags and 304-Not-Modified are examples of infrastructure for this purpose. It also fits the client-server model quite nicely - all clients call one GET method, and the server responds identically without any additional state required (except for checking etag match). Another clear win, coming from the passive aspect, is similar to lazily evaluated systems: as the client triggers the transfer, it only needs to be done when the data is going to be used; if it not needed (e.g. off screen, or denoted low priority) then the polling never needs to take place.

Unfortunately, even with this system, there's still a bunch of setup - firstly, how often you ask for the latest version is a big question, trading off network traffic with freshness of the remote client. Other questions arise with how the etag (i.e. hash) is calculated, and whether it's ok to apply changes a stale object, plus what the updates look like (see optimisations below).

Solution #2: Push (active data transfer)

The opposite option is a push approach - any time data is read by a client, it tells the data's origin to keep an open channel, and once the origin processes an update to that data, it tells the client that it has changed. This is, at least conceptually, by far the most efficient for things which are read a lot, and change infrequently; rather than communication being linear in time (divided by poll frequency), it is linear in edits. It is also the more favoured way of passing information locally in an event-driven system - every event listener works in this way, and is starting to gain traction client-side too (at least on web) with things like Object.observe and binding frameworks like Ember.js or Angular.js.

It is not an easy switch however - primarily, a data source now needs to keep a reference to each remote client that is listening in, and these need to be two-way channels (pull doesn't have this requirement, and if using a channel, every message is client-initiated); this adds a requirement for this to be cleaned up properly too, as anyone familiar with event handlers will know. Secondly, and this can be important, it is much harder to verify correctness in a client. In a pull system, you know that the data you receive when asking is as fresh as possible, but if you miss a push event (due to a flaky connection, or channel fault) then neither the client nor the server will know until a second mutation happens - which might be for some time, given that this works best for infrequently modified data.

That said, I feel I'd be remiss if I didn't show the example of Calendar API push notifications - while I don't write API code, I am on the Calendar team, and this is a salient example of a system where push notifications are offered along side pull ones.

Optimisations:

Neither of these are perfect, and there are plenty of trade-offs between the two, but one thing worth mentioning is an optimisation that is becoming a bit more common, and certainly helps reduce the size of communication between the two - and that is the use of PATCH semantics. I'd covered this briefly in "On types, part 2", but if you assume that an etag is the same as a version identifier, than responses from the data origin only need to contain the diff between the old and current version, rather than the snapshot of the current version. For new clients (with undefined 'old' versions) the diff and snapshot are equivalent, but for once-off edits, e.g. changing a single property, it is much more efficient to send just the operations to apply to update the locally cached value to its new state. [Some might notice my use of 'operations' above - it is no accident that this also allows operational transform, a feature of the Wave product I once worked on for a bit.]

The interesting thing is that this approach benefits both the pull and push systems - in pull, the server responds with just the difference between the requested ETag and the current version (with a small cost of keeping some old versions), and in push, the server only sends out diffs, with a client requesting old updates if it turns out it has lost some previous updates. There is a cost however, which explains the low take-up rate of diff-based technologies, and that is the update events are not standardised, so supporting this model for your own data types will likely take quite a bit of custom implementation work (though this may improve in the future, e.g. see Firebase for an interesting offering in this area).

Miscellaneous thoughts:

It is worth pointing out that not only do these apply in client-server applications, but also server-server calls, and maybe more surprisingly, to indexing calculations; you can think of an indexer as a client who updates its own state based on remote updates to data its interested in, so it also needs to act as a client in that regard, and fetch the updates either by push or pull. For a really good writeup of that aspect, see this article on how twitter decides which technique to use for its different indexes.

Really however, I want this to be something that's very easy to configure. In the same family as RPCs (remote procedure calls) it seems like all people want is Remote Objects. As mentioned, there are positive and negative aspects of push and pull systems, so it would need to be heavily configurable - but switching between options, and the various features offered, should be as simple as a config file or code annotation. It still frustrates me every time I write an API client that I have to manually set up my own updating code, compared to e.g:
@Live(mode = Modes.PUSH,
    backup_pull_snapshot = "10min",
    wire_operations = true)
 MyObjectType myLiveObject;
to set up a live-updated object using push notifications, that sends diffs over the wire, and every 10min sends a pull request to ensure it is at the most recent version. Oh well, maybe eventually... :)

Friday, January 4, 2013

On dynamic code rewriting.

Any reader who has done a Computer Science course at some point, or read about the history of the science, has most likely at some point come across the Von Neumann architecture. In short, rather than having code and data stored in separate places, they're stored together - you can think of the code as just more data that can be executed, and acts on other data. There are a number of advantages to this - for example, if your programs are stored, you can easily pick which to load (just as you would have done when opening your browser to read this post), and pretty much every modern computer is using an evolved version of this system. But there's one fundamental property of Von Neumann architecture which I feel is severely under-represented: Given that code can manipulate data, and is stored as data, code can therefore modify itself.

Some examples of self-modification

While altering the instructions of a running program may sound cool, it also sounds problematic and finding examples of usefulness may not be immediately obvious. As it happens however, there may be some modification going on right now - you see, as the industry pushes towards making everything faster, a number of languages (including the javascript engines in browsers) employ a techinique called just-in-time compilation, or JIT. While alone this isn't quite rewriting (more, translating from a slow language to a faster one), most JIT compilation also allows an optimisation step, where the translation be introspected and rewritten for performance, and it is this type of modification that should be everywhere - allowing the human-written code to be simple and clear, but then optimised automatically.

Some downsides

As the wikipedia article for Von Neumann mentions: "Self-modifying code has largely fallen out of favor, since it is usually hard to understand and debug, as well as being inefficient under modern processor pipelining and caching schemes.". This is my impression of the downsides too - but while they're both valid (especially at this point in time), I'm hoping they'll both be overcome for two separate reasons.

Firstly, 'hard to understand' is an interesting concept, and really I think that complex is meant above. The thing is, computing systems (and programming languages) are both continually getting more complex, but the tools for creating them are getting much easier to understand, our ability to write better tools is keeping up with our ability to write complex systems. If we can debug JIT compilation, then self-modifying code (when used in the right places) shouldn't be much harder, in particularly when provably-correct systems improve.

Finally, it is indeed the case that the current caching/pipelining has evolved with static code, and so has many decades of optimisations along with it. This will be the main sticking point, but it feels to me that, overall, self-modifying code will end up executing less code (see below), and so at some point, there'll be enough demand for a new genre of optimisations, and eventually the speed will surpass the current static version.

Note that there's also some nasty interplay with multithreaded code running in parallel over the same instruction set - however, I feel that (and some other cases) are included in the analyses above. For example, in this case the same instructions could be loaded, and then cached depending on what rewriting has taken place.

Some next steps

So the question I asked myself was - if you assume at some point self-modifying code will become popular, what are likely ways where people will incrementally progress towards it? And the most immediate solution seemed to partner with something I mentioned last post - the scoped stack. You see, there's a problem I run into continually when writing code, and that is: where to place logical assertions. For example, if I write an API, should I put bounds checking / nullity checks / other preconditions at the start of exposed methods?
In most cases yes, but if this is then used within another API with the same preconditions, this is duplicated logic, which is either left in (at a performance cost) or hacked around, e.g by adding versions without the checks.

What I'd like to see instead is that these are written everywhere, but with no performance cost when already checked - and the only way I see this happening is through dynamically rewriting condition checking on long-lived scopes. Using the example from last post, say I'm making answering requests that each have a User scope. In today's coding, there's likely at least one code path handling an un-set user (e.g. error / login flow) and a dozen or more checks within the request handling to assert the user has been set. Upon starting to handle a request, a code rewriter would then rewrite to the login flow if there is no user, or remove all but the first is-set check, thereby reducing the instructions run while the original code remains clean.

Note that permissions checking should also benefit greatly from this - if you can add checks throughout a system to ensure that correct access has been given to the data you're using, yet have these self-modified to a single check with little performance hit, this would no doubt lead to better protected data access.

For a brief addition before the summary, two more fanciful predictions of where self-modifying code might gain popularity: (1) Cloud sandboxing: executing code on third-party machines, rewritten to avoid making certain calls or re-optimise for different machine profiles they're running on, and (2) on-the-fly updates / code fetching - Chrome already does something like this, but while web apps can do small patches by loading more javascript, native OS apps are far from this point (important given the growing popularity of mobile apps).

tl;dr (?)

In summary, my guess is that the pursuit of fast execution times, plus better tools, will eventually lead us down the path of self-modifying code and hopefully cool new applications using the techniques it can provide. It should in general lead to faster, cleaner and more secure code, however don't hold your breath.