Code Architecture: On Syncing (push vs pull)

The next topic on my to-do list of things to cover is one that seems to come up an awful lot, underlying a huge range of performance issues at scale, yet as far as I can tell has no good reusable infrastructure of solutions yet. Whether it be storing data locally in a web or phone client, or caching server-server communication, syncing that information between two places seems notoriously hard.

The problem

I have a complex-typed tree of data that is persisted in one place, but I want to use it in another. If it were immutable, this would be simple - I could read it once, and persist that as long as it makes sense (i.e. a simple cache). Things get much more complicated however when the values change - assuming the machine applying the change is the one providing access to the data originally, propogating this information to the remote user of the data is non trivial.

Solution #1: Pull (passive data transfer)

The solution most widely used today is a simple pull - that is, the remote client asks for the lastest version, and either receives it, or a response saying nothing has changed. This is nice for a number of reasons, but the primary one seems to be that you have to provide a way to get the data anyway, so this just reuses that multiple times. ETags and 304-Not-Modified are examples of infrastructure for this purpose. It also fits the client-server model quite nicely - all clients call one GET method, and the server responds identically without any additional state required (except for checking etag match). Another clear win, coming from the passive aspect, is similar to lazily evaluated systems: as the client triggers the transfer, it only needs to be done when the data is going to be used; if it not needed (e.g. off screen, or denoted low priority) then the polling never needs to take place.

Unfortunately, even with this system, there's still a bunch of setup - firstly, how often you ask for the latest version is a big question, trading off network traffic with freshness of the remote client. Other questions arise with how the etag (i.e. hash) is calculated, and whether it's ok to apply changes a stale object, plus what the updates look like (see optimisations below).

Solution #2: Push (active data transfer)

The opposite option is a push approach - any time data is read by a client, it tells the data's origin to keep an open channel, and once the origin processes an update to that data, it tells the client that it has changed. This is, at least conceptually, by far the most efficient for things which are read a lot, and change infrequently; rather than communication being linear in time (divided by poll frequency), it is linear in edits. It is also the more favoured way of passing information locally in an event-driven system - every event listener works in this way, and is starting to gain traction client-side too (at least on web) with things like Object.observe and binding frameworks like Ember.js or Angular.js.

It is not an easy switch however - primarily, a data source now needs to keep a reference to each remote client that is listening in, and these need to be two-way channels (pull doesn't have this requirement, and if using a channel, every message is client-initiated); this adds a requirement for this to be cleaned up properly too, as anyone familiar with event handlers will know. Secondly, and this can be important, it is much harder to verify correctness in a client. In a pull system, you know that the data you receive when asking is as fresh as possible, but if you miss a push event (due to a flaky connection, or channel fault) then neither the client nor the server will know until a second mutation happens - which might be for some time, given that this works best for infrequently modified data.

That said, I feel I'd be remiss if I didn't show the example of Calendar API push notifications - while I don't write API code, I am on the Calendar team, and this is a salient example of a system where push notifications are offered along side pull ones.

Optimisations:

Neither of these are perfect, and there are plenty of trade-offs between the two, but one thing worth mentioning is an optimisation that is becoming a bit more common, and certainly helps reduce the size of communication between the two - and that is the use of PATCH semantics. I'd covered this briefly in "On types, part 2", but if you assume that an etag is the same as a version identifier, than responses from the data origin only need to contain the diff between the old and current version, rather than the snapshot of the current version. For new clients (with undefined 'old' versions) the diff and snapshot are equivalent, but for once-off edits, e.g. changing a single property, it is much more efficient to send just the operations to apply to update the locally cached value to its new state. [Some might notice my use of 'operations' above - it is no accident that this also allows operational transform, a feature of the Wave product I once worked on for a bit.]

The interesting thing is that this approach benefits both the pull and push systems - in pull, the server responds with just the difference between the requested ETag and the current version (with a small cost of keeping some old versions), and in push, the server only sends out diffs, with a client requesting old updates if it turns out it has lost some previous updates. There is a cost however, which explains the low take-up rate of diff-based technologies, and that is the update events are not standardised, so supporting this model for your own data types will likely take quite a bit of custom implementation work (though this may improve in the future, e.g. see Firebase for an interesting offering in this area).

Miscellaneous thoughts:

It is worth pointing out that not only do these apply in client-server applications, but also server-server calls, and maybe more surprisingly, to indexing calculations; you can think of an indexer as a client who updates its own state based on remote updates to data its interested in, so it also needs to act as a client in that regard, and fetch the updates either by push or pull. For a really good writeup of that aspect, see this article on how twitter decides which technique to use for its different indexes.

Really however, I want this to be something that's very easy to configure. In the same family as RPCs (remote procedure calls) it seems like all people want is Remote Objects. As mentioned, there are positive and negative aspects of push and pull systems, so it would need to be heavily configurable - but switching between options, and the various features offered, should be as simple as a config file or code annotation. It still frustrates me every time I write an API client that I have to manually set up my own updating code, compared to e.g:

@Live(mode = Modes.PUSH,
backup_pull_snapshot = "10min",
wire_operations = true)
MyObjectType myLiveObject;

to set up a live-updated object using push notifications, that sends diffs over the wire, and every 10min sends a pull request to ensure it is at the most recent version. Oh well, maybe eventually... :)