Code Architecture

Sunday, November 24, 2013

On the Uniform Access Principle

The UAP is not good idea for generalised use, and Properties (e.g. C#, ObjC) should be avoided. There, I said it :) I think this is the post with the most non-standard viewpoint so far, so I wanted to start with it stated very clearly, and look forwards to reasons why my logic is wrong (as I'm guessing there are a number), so as always, please add feedback after reading this post.

Uniform Access Principle definition

Firstly, some background. The Uniform Access Principle was coined by the creator of the Eiffel language, and states:

All services offered by a module should be available through a uniform notation, which does not betray whether they are implemented through storage or through computation.

I'm not 100% certain on the exact reasons why this was considered the best idea, but from reading some sources it seemed that the benefits primarily are derived from the abstraction of the source of the data from callers - i.e if you call getFoo(), you don't need to know whether it's stored directly or computed, which also means that the provider of the data can change that source without having to change all callers.

On hiding the data source

The main criticism of the UAP seems to be that hiding the cost of computing some data can be problematic - e.g. there is a big difference between returning a member, and computing it via RPC to somewhere that has to do a long calculation before the value is known. This is certainly a problem, and in particular is even moreso in a more network-based client-server world, but in itself seems to be mostly solved by simply not hiding expensive calculations in get() methods. That is, getCommentCount()#int that requires a database index look-up is more likely to be queryCommentCount(Callback<int>)#void - callers know this will take a while, must provide an asynchronous solution, and everyone is happy.

On hiding the dependencies

The more fundamental problem I see is making a calculated value appear like one that is a normal member, independent to the other members of the containing object. Firstly, if the calculation is only to look up the value from an external source (i.e. it's not stored locally) then that's no issue, and is mostly solved by fixing the querying as mentioned above - e.g having some way to convert an UnresolvedFoo into a ResolvedFoo, from which the required members can be retrieved locally.

No, the main problem is from calculations of 'members' which depend on other members of the object. The best example of this is a collection's size, but also things like displayName (= firstName + ' ' + lastName) or BMI (= mass/(height * height)). Having the value of 'members' depend on the values of others embeds within it a scoping issue - if the value of one of the dependencies changes, so does the value of the calculated member, and this should be made clear to callers of the API.

On commutativity

To express this more clearly, there's a concept of commutativity of an API - that is, the order in which you call methods does not matter. Or, to put it another way, the methods are as independent/orthogonal as possible, which is good for maintainability, but also is theorised to improve scalability.

Obviously, there are some things which this can't hold, but ideally they're as easy to spot as possible. For example, for a mutable member, setFoo() and getFoo() do not commute, but this is clear from the fact they operate on the same state. Calling setFoo() resets the foo scope, and everything that was using the value (displaying it on a screen, used as input for a calculation, ...) is now invalid, and needs to acquire the new value.

So as long as our members are kept separate, commutativity is clear. With dependencies between a local member and a calculated member however, these become very complex - how should a caller know that calling setFirstName() no longer commutes with getDisplayName()? This knowledge is required in order to have the correct value when needed (e.g. to reset the display name scope when the first name is changed), but it is exactly this dependency that the UAP is hiding!

On serialization

Another smaller area where the distinction between local members and computed properties is required is in serialization of each type; e.g. across a network, or into persistent storage, etc... When sending or storing an object of information, the individual members are generally the only parts used, and instead the calculated properties remain calculated, the definition for the calculation (i.e. function) is instead defined once for the type. This distinction (sent as values on the wire vs. sent as a function in code) is quite important given that cross-language/cross-platform member serialization is mostly a solved problem (see previous post) but there is no equivalent good solution for function definitions (...if you know one, it would be great to hear!).

Immutability

It is important to add that immutable data structures are a special case for this - for those, I see much less of a problem in knowing what is calculated and what is not, other than the calculation cost mentioned above. Because of the immutability, none of the values can change, and hence the update scoping between dependencies is no longer a problem. Note that immutable objects also have the nice property that, without a set() method, members with multiple dependencies do not have to decide their update semantics. In the example above, getBMI() depended on both height and weight, so it is not clear what setBMI() would do - it could lock weight and update height, or vice versa, or even some combination in the middle. One caution: these need to be actually immutable, not just behind an unmodifiable API. If the data can change, even if only remotely, then dependencies still need to be clear.

Summary

Seriously, don't use the UAP. Keep your mutable members as members, and make sure they're all orthogonal (i.e. the value of one does not depend on the value of another). Serialization and persistence will be much easier, callers will better know when values update, and correctness in general should be simpler. Try to define members that actually represent what you need, as changing them to calculated values will now be harder, not so much because the clients will need to be updated, but also because they probably made assumptions about the life-cycles of the values they are using, which you just broke (but wouldn't have been noticed if using UAP). And take immutable snapshots whenever possible.

Sunday, October 13, 2013

On Syncing (push vs pull)

The next topic on my to-do list of things to cover is one that seems to come up an awful lot, underlying a huge range of performance issues at scale, yet as far as I can tell has no good reusable infrastructure of solutions yet. Whether it be storing data locally in a web or phone client, or caching server-server communication, syncing that information between two places seems notoriously hard.

The problem

I have a complex-typed tree of data that is persisted in one place, but I want to use it in another. If it were immutable, this would be simple - I could read it once, and persist that as long as it makes sense (i.e. a simple cache). Things get much more complicated however when the values change - assuming the machine applying the change is the one providing access to the data originally, propogating this information to the remote user of the data is non trivial.

Solution #1: Pull (passive data transfer)

The solution most widely used today is a simple pull - that is, the remote client asks for the lastest version, and either receives it, or a response saying nothing has changed. This is nice for a number of reasons, but the primary one seems to be that you have to provide a way to get the data anyway, so this just reuses that multiple times. ETags and 304-Not-Modified are examples of infrastructure for this purpose. It also fits the client-server model quite nicely - all clients call one GET method, and the server responds identically without any additional state required (except for checking etag match). Another clear win, coming from the passive aspect, is similar to lazily evaluated systems: as the client triggers the transfer, it only needs to be done when the data is going to be used; if it not needed (e.g. off screen, or denoted low priority) then the polling never needs to take place.

Unfortunately, even with this system, there's still a bunch of setup - firstly, how often you ask for the latest version is a big question, trading off network traffic with freshness of the remote client. Other questions arise with how the etag (i.e. hash) is calculated, and whether it's ok to apply changes a stale object, plus what the updates look like (see optimisations below).

Solution #2: Push (active data transfer)

The opposite option is a push approach - any time data is read by a client, it tells the data's origin to keep an open channel, and once the origin processes an update to that data, it tells the client that it has changed. This is, at least conceptually, by far the most efficient for things which are read a lot, and change infrequently; rather than communication being linear in time (divided by poll frequency), it is linear in edits. It is also the more favoured way of passing information locally in an event-driven system - every event listener works in this way, and is starting to gain traction client-side too (at least on web) with things like Object.observe and binding frameworks like Ember.js or Angular.js.

It is not an easy switch however - primarily, a data source now needs to keep a reference to each remote client that is listening in, and these need to be two-way channels (pull doesn't have this requirement, and if using a channel, every message is client-initiated); this adds a requirement for this to be cleaned up properly too, as anyone familiar with event handlers will know. Secondly, and this can be important, it is much harder to verify correctness in a client. In a pull system, you know that the data you receive when asking is as fresh as possible, but if you miss a push event (due to a flaky connection, or channel fault) then neither the client nor the server will know until a second mutation happens - which might be for some time, given that this works best for infrequently modified data.

That said, I feel I'd be remiss if I didn't show the example of Calendar API push notifications - while I don't write API code, I am on the Calendar team, and this is a salient example of a system where push notifications are offered along side pull ones.

Optimisations:

Neither of these are perfect, and there are plenty of trade-offs between the two, but one thing worth mentioning is an optimisation that is becoming a bit more common, and certainly helps reduce the size of communication between the two - and that is the use of PATCH semantics. I'd covered this briefly in "On types, part 2", but if you assume that an etag is the same as a version identifier, than responses from the data origin only need to contain the diff between the old and current version, rather than the snapshot of the current version. For new clients (with undefined 'old' versions) the diff and snapshot are equivalent, but for once-off edits, e.g. changing a single property, it is much more efficient to send just the operations to apply to update the locally cached value to its new state. [Some might notice my use of 'operations' above - it is no accident that this also allows operational transform, a feature of the Wave product I once worked on for a bit.]

The interesting thing is that this approach benefits both the pull and push systems - in pull, the server responds with just the difference between the requested ETag and the current version (with a small cost of keeping some old versions), and in push, the server only sends out diffs, with a client requesting old updates if it turns out it has lost some previous updates. There is a cost however, which explains the low take-up rate of diff-based technologies, and that is the update events are not standardised, so supporting this model for your own data types will likely take quite a bit of custom implementation work (though this may improve in the future, e.g. see Firebase for an interesting offering in this area).

Miscellaneous thoughts:

It is worth pointing out that not only do these apply in client-server applications, but also server-server calls, and maybe more surprisingly, to indexing calculations; you can think of an indexer as a client who updates its own state based on remote updates to data its interested in, so it also needs to act as a client in that regard, and fetch the updates either by push or pull. For a really good writeup of that aspect, see this article on how twitter decides which technique to use for its different indexes.

Really however, I want this to be something that's very easy to configure. In the same family as RPCs (remote procedure calls) it seems like all people want is Remote Objects. As mentioned, there are positive and negative aspects of push and pull systems, so it would need to be heavily configurable - but switching between options, and the various features offered, should be as simple as a config file or code annotation. It still frustrates me every time I write an API client that I have to manually set up my own updating code, compared to e.g:

@Live(mode = Modes.PUSH,
backup_pull_snapshot = "10min",
wire_operations = true)
MyObjectType myLiveObject;

to set up a live-updated object using push notifications, that sends diffs over the wire, and every 10min sends a pull request to ensure it is at the most recent version. Oh well, maybe eventually... :)

Friday, January 4, 2013

On dynamic code rewriting.

Any reader who has done a Computer Science course at some point, or read about the history of the science, has most likely at some point come across the Von Neumann architecture. In short, rather than having code and data stored in separate places, they're stored together - you can think of the code as just more data that can be executed, and acts on other data. There are a number of advantages to this - for example, if your programs are stored, you can easily pick which to load (just as you would have done when opening your browser to read this post), and pretty much every modern computer is using an evolved version of this system. But there's one fundamental property of Von Neumann architecture which I feel is severely under-represented: Given that code can manipulate data, and is stored as data, code can therefore modify itself.

Some examples of self-modification

While altering the instructions of a running program may sound cool, it also sounds problematic and finding examples of usefulness may not be immediately obvious. As it happens however, there may be some modification going on right now - you see, as the industry pushes towards making everything faster, a number of languages (including the javascript engines in browsers) employ a techinique called just-in-time compilation, or JIT. While alone this isn't quite rewriting (more, translating from a slow language to a faster one), most JIT compilation also allows an optimisation step, where the translation be introspected and rewritten for performance, and it is this type of modification that should be everywhere - allowing the human-written code to be simple and clear, but then optimised automatically.

Some downsides

As the wikipedia article for Von Neumann mentions: "Self-modifying code has largely fallen out of favor, since it is usually hard to understand and debug, as well as being inefficient under modern processor pipelining and caching schemes.". This is my impression of the downsides too - but while they're both valid (especially at this point in time), I'm hoping they'll both be overcome for two separate reasons.

Firstly, 'hard to understand' is an interesting concept, and really I think that complex is meant above. The thing is, computing systems (and programming languages) are both continually getting more complex, but the tools for creating them are getting much easier to understand, our ability to write better tools is keeping up with our ability to write complex systems. If we can debug JIT compilation, then self-modifying code (when used in the right places) shouldn't be much harder, in particularly when provably-correct systems improve.

Finally, it is indeed the case that the current caching/pipelining has evolved with static code, and so has many decades of optimisations along with it. This will be the main sticking point, but it feels to me that, overall, self-modifying code will end up executing less code (see below), and so at some point, there'll be enough demand for a new genre of optimisations, and eventually the speed will surpass the current static version.

Note that there's also some nasty interplay with multithreaded code running in parallel over the same instruction set - however, I feel that (and some other cases) are included in the analyses above. For example, in this case the same instructions could be loaded, and then cached depending on what rewriting has taken place.

Some next steps

So the question I asked myself was - if you assume at some point self-modifying code will become popular, what are likely ways where people will incrementally progress towards it? And the most immediate solution seemed to partner with something I mentioned last post - the scoped stack. You see, there's a problem I run into continually when writing code, and that is: where to place logical assertions. For example, if I write an API, should I put bounds checking / nullity checks / other preconditions at the start of exposed methods?
In most cases yes, but if this is then used within another API with the same preconditions, this is duplicated logic, which is either left in (at a performance cost) or hacked around, e.g by adding versions without the checks.

What I'd like to see instead is that these are written everywhere, but with no performance cost when already checked - and the only way I see this happening is through dynamically rewriting condition checking on long-lived scopes. Using the example from last post, say I'm making answering requests that each have a User scope. In today's coding, there's likely at least one code path handling an un-set user (e.g. error / login flow) and a dozen or more checks within the request handling to assert the user has been set. Upon starting to handle a request, a code rewriter would then rewrite to the login flow if there is no user, or remove all but the first is-set check, thereby reducing the instructions run while the original code remains clean.

Note that permissions checking should also benefit greatly from this - if you can add checks throughout a system to ensure that correct access has been given to the data you're using, yet have these self-modified to a single check with little performance hit, this would no doubt lead to better protected data access.

For a brief addition before the summary, two more fanciful predictions of where self-modifying code might gain popularity: (1) Cloud sandboxing: executing code on third-party machines, rewritten to avoid making certain calls or re-optimise for different machine profiles they're running on, and (2) on-the-fly updates / code fetching - Chrome already does something like this, but while web apps can do small patches by loading more javascript, native OS apps are far from this point (important given the growing popularity of mobile apps).

tl;dr (?)

In summary, my guess is that the pursuit of fast execution times, plus better tools, will eventually lead us down the path of self-modifying code and hopefully cool new applications using the techniques it can provide. It should in general lead to faster, cleaner and more secure code, however don't hold your breath.

Sunday, September 30, 2012

On naming arguments.

I've recently started trying out one of the course offerings over at Coursera.org - a guide to functional programming, using Scala. The approach I take to coding certainly seems functional-based, so it seemed a good way to practice but also learn the approaches taken by a new, familiar-ish language.

The relevance here is that the recent lectures included an introduction to currying, and got me thinking about something which has been growing on me for a while now: the current default convention of ordered function arguments is inferior to using named function arguments.

A brief history of arguments

Firstly, some background. Traditionally, in languages like C, Basic and Java, a function is given a name and an argument list - they can be referred to in the function by name, but this is for convenience - things like bash allow explicit index references, and all callers always specify the arguments by list. The background for this, I believe, is the underlying machine instructions used for function calling - parameters are on a stack, so having them in an ordered list is an easy way for callers and callees to agree where in the stack they are situated.

That said, there appears to be a trend in languages like Python, Dart and Ruby towards passing parameters by name. If you have read my earlier post on types, you might recognise what's going on here - it's a simple change, instead of every function accepting a list<type> as arguments, it now accepts stringmap<type>, where the keys of the map are the (string) names of the arguments.

The next questions to ask are why newer languages have gone this way, is it really better (or should we revert to argument lists), and at what cost? The immediate cost is clear - maps can't easily be serialised to the function stack, so there's a potentially unacceptable additional cost to pay; that said, I don't think this cost is as big as it first seems (and, I believe, could actually be an improvement) - and here's why: the way apps are written, the function stack architecture could do with a cleanup.

An inefficient stack?

This is what I was mentioning earlier when talking about the Scala course - you see, you can allow Scala functions to be curried by iteratively currying the first parameter(s). As an example:

sum(a: Int, b:Int) = ... // sum of numbers between a and b
triangle = sum(0) // triangular(n) = sum(0)(n) = 0 + ... + n

The second function is a currying of the first, where the first parameter is always set to zero. However, what happened if I wanted a function which sums up to a constant (e.g. negtri = sum(?, 0)) - this requires a different syntax and may result in a different underlying implementation. Why? Well, consider the most efficient implementation of the code above. Assuming a normal stack architecture, we could have the following:

triangle:
push 0; // curry on the new param
invoke sum; // call sum instead.

Now, say I have a function with 5 parameters, and want to curry in two, not at the start. With this change, the function has to pop off the arguments from the stack, then re-push everything in the new order. With longer call chains, multiple arguments and liberal usage of currying, I'd be intrigued to find out how much time the average program spends rearranging arguments. My guess is, it's non-trivial.

A scoped 'stack'

No, instead what is needed is for functions to know where on the stack their parameters can be read from - so really a random-access list instead of a stack. This has a number of complications, though I feel that most of these can be solved with a few compilation tweaks to carry around more state, now that functions don't exist in isolation, but instead within the scope of an executing program. Which is exactly what I see in the way that much of application code is being written - within contextual scopes. Given the prevalence of server architectures, request scope (or RPC scope) is an obvious example, but even also just simple object scope: if your program is a data storage layer for UserProfile objects (say), then you will likely be spending much of the execution time calling functions scoped within a given UserProfile object, getting/setting its values, updating its listeners, persisting, ... - if we assume that the profile ID (or a reference to it) is needed for all of those functions, then rather than passing that around all over the place, it'd be ideal to fix its location on the stack in a known location, and leave it there for the entire profile scope.

The method above will result in minimal rewriting of argument locations, but also now leads to the ability to curry arbitrary parameters, something that simply could not be done before, at the cost of callers needing to know which scopes to fill out before calling a function. These scopes are then the name of the parameter (e.g. currentUser above) and map to a location in the 'stack' (/array).

There's still some implementation specifics to sort out, but I'm sure they're tractable, and for the switch to named parameters you get a possible improvement in parameter passing, and the ability to curry arbitrary subsets - a big win! Plus: adding a new parameter to the stringmap (by overwriting the old keyed value) is much simpler than the old list (requiring linear insertion), and callers are less likely to make mistakes in specifying parameters when you have to name them (..."I'm passing true,false,false,true...is that the right order"?).

The biggest against / for?

It's probably a good time to mention another downside at this point, which is its wordiness - having to name each parameter you pass, instead of inferring the names from the index, is quite verbose: you've just doubled the amount of text per function call, with a bunch of duplication. There are a few things to mitigate it (better IDE support, assuming the param name is the name of the variable by default, ...) but a verbose language is always going to detract programmers.

This last part leads in to the biggest gain however, in my eyes. You can always afford to make something a bit more verbose if you are using it to better map the digital program to the real-world objects/concepts that its is representing. In this case, sum above does not take two ordered ints - it takes a lowerBound and upperBound, callers can specify what the bounds they are giving are, and currying code can specify the names of the concepts for which it is putting in scope. It'd be good to acknowledge the argument list was a good, albeit flawed, approximation of named parameters, and the sooner we support the ('truer') stringmap version, the sooner we can make cleaner programs.

Hopefully how we can do that will be covered in the next post - on rewriting executing code. As always, I'm very keen to hear your thoughts on ordered vs. named arguments! Starting points: are there more pros/cons? how big of a problem is being verbose? is it feasible to rearchitect the stack frame?

Tuesday, July 31, 2012

On Types - Part 2

Recap

Just a brief reminder to please first read Part 1 - which gives a brief overview of the usefulness of primitives, lists and string maps in a type system, plus the 'null' concept to allow them to be optional. This post extends on those ideas to add new useful concepts that I've found useful / missed when using a type system.

Undefined

Given a null / unset option for values, this gets us pretty much to where we need to be in terms of storing information to use within an application - and numerous modifications we can apply to that information; in particular, it's enough for a basic CRUD system. But if you look closely at that previous link, you might notice where it falls down: modification (PUT vs PATCH in http).

Simply put, a standard update ('put') is where you copy every value from the given item into the target one, whereas patch lets you conditionally only update a subset. Note that you can always do a patch manually via read-modify-put, though too much semantics are lost (and result in concurrency problems). In my experience, the patch operation itself is used really frequently in applications (e.g. updating just an address for a contact), so is definitely worth supporting, but the question is then how to distinguish which values to update.

One common solution is to indicate 'don't update' by a null value, to the inevitable problem of then being unable to specify when to set a value to null! e.g. how would you patch a node to become a leaf? The alternative is to allow field-level PUTs - and while this is a great idea, and should be supported, it results in request explosion (one for each field, instead of for each substantial thing) and loses the batch atomicity that's inherent in patch methods. Once you then support batched PUTs, it's (almost) equivalent to a PATCH.

Javascript (/actionscript) however does this very well - by having 'null' = no value, but adding the concept 'undefined'; patching with null then sets a field to null, patching with undefined results in an unchanged field. Implementation hacks aside, 'undefined' is a great concept - consider models where a default value can be used in the absence of a value - i.e. a field that, when undefined, gets a default (possibly null) value; if in a hierarchy (think: css), it could then even walk up ancestors until a defined value can be used. Use with care though, as mentioned above conflation with null can lead to problems if the two need to be distinct (as with patch above).

Delete

As seen above, there's a difference in a patch method to setting a value to the literal 'null' vs not having it set; and undefined was used for the latter. The problem next then is - how can you set a value to 'undefined' in a patch method? So far we can create typed items (by setting to non-null), update and patch them, but there's no support yet for deletion (= setting to undefined, which can be different from setting to null). Essentially, what's required is a way to patch the value with the literal 'undefined', rather than the semantic version, which would then be ignored. The question then is how to patch something to literal-undefined...by setting it to literal-literal-undefined? ad infinitum.

Obviously, this solution has reached a problematic point, with a never-ending list of literals required. I'm not sure the best solution to this, so feel free to add any to the comments. Maybe the answer lies in each field being coupled with a 'defined' property, so a null value becomes {defined: true, value: null}, undefined is {defined: false, value: null} and literal undefined is {defined: true, value: undefined}. The literals then follow naturally via e.g. literal literal undefined is {defined: true, value: {defined: true, value: {defined: false, value: null}}} - which certainly does seem hacky though; however each layer is increasingly rare, so it's possible.

In the wild: ML/proto/objects/JSON/...

One thing worth adding when considering something like this is to see what the decades of programming have already given us in this area - after all, coders have needed types for a long time now. As mentioned, the simplest example of these types in action are classin OOP - primitives, arrays (lists) and objects/structs (string maps) are the basis for model types in dozens of languages.

That's not all, however - these types almost precisely descibe what's available in Javascript, and one of the reasons I believe it's as popular a language as it is - it has string map objects, and arrays, but is just lacking in primitives with its unusual number system. It's popular predecessor XML (or other markup languages) even have something similar - where an element is a string map with attributes for key-value fields, but the limitation of a single special list property, being its children; and with HTML5 data attributes gaining popularity, I can only see them merging closer.

Finally, there's another family of objects interesting in this case, being those designed for language independence. For example, Google's protocol buffers and Facebook's Thrift systems, which both attempt to capture aspects from type systems of multiple languages; each have message/struct-based key-value string maps as the fundamental structure, repeated fields, a variety of primitive types, etc... plus add a number of useful additional features to consider: optional fields/isset handling of null vs. undefined, default values, 'required'-ness, sets, enums, serialization, ... but overall, there's a large similarity between them and objects, json, python dictionaries, etc.

Future extensions?

That feels enough for this post, and possibly even enough for types in general - mostly, I feel that the features listed (primitives, lists, stringmaps, null/undefined/literals) are enough to cover a huge amount of applications, which giving enough control to require a non-huge amount of understandable code.

There are a few considerations provided by other systems though, that are likely to be worth future blog posts - for example, enums are available in many of the examples listed above, but haven't yet been mentioned. In particular, functions are definitely going to be a later topic, as it's a split between a number of examples above - for instance, Javascript and python both allow functional members, but Java and protocol buffers do not and XML being in the middle where by default it doesn't, but XSLT provides a lot of what is required. To hint at my thoughts, I think the Uniform Access Principle is flawed, but to know why, you'll have to tune in later!

As always, questions/comments etc. are welcome on this post or on G+.

Saturday, July 7, 2012

On Types - Part 1

Welcome to the first instalment of my Code Architecture blog, a place for me to document thoughts I've gained while working on multiple projects, where I think app development is headed, and how to get there.
As always, please feel free to add discussions at the bottom of each post or suggestions on topics to cover.

This initial post is going to cover what I consider one of the most fundamental topics - data types. That is, the values you can represent within an application, and how to use them.

Primitives - Numbers, Letters, ...

These are the most low-level types available - ones that map directly into binary values in hardware; For example, if you have a single bit, you can have a boolean value. This can be extended to 8-, 16-, 32- and 64-bit values (or higher), which can be interpreted as a number of types - integers, floating-point numbers, characters, enums, ... but that's about it in terms of how it's stored.

This gets us to the first area for expansion that I see, which is in terms of variable-length primitives. For example, UTF-8 is a common way to have a variable-length character set, and a few languages have arbitrary-precision integers, though there seems to be no one agreed-upon way of representing arbitrary-bit values using only 2^n-bit numbers.

Complex - Lists (and Strings), Maps ...

There are a few more concepts that appear all over the place while writing software - the most basic is a list (or array, vector, ...). Simply put, it's an ordered collection of primitives, with the most typical example being a list of characters, or a string. These are used everywhere for text, but at a higher level, lists are required to represent anything ordered, and can be combined for higher dimensions - e.g. a list of lists for a grid/table.

For unordered items however, by far the most useful type appears to be the Map, at least where model data for an application is concerned - and in particular, StringsMaps, where the values are each keyed by a string. A StringMap (or Dictionary) is the fundamental type in both JavaScript and Python, as well as representing the underlying structure of an object in O-O languages, with field names mapping to their values.

More types exist (collections, sets, binary trees, ...) but mostly these can be implemented in terms of the above, and are less common in usage - so a modern language mostly just needs primitives, lists, strings and stringmaps for its data model types. As an aside, perhaps for a later post, I feel that (nested) stringmaps are the best approximation so far of how humans index information themselves - e.g. chair = {legs: 4, colour: "black", cushion: true} etc... hence their prevalence and usefulness today.

Type extension: Nullability

Given a language with defined value families, this gives us the ability to represent a lot of real-world entities in a program - but there's still a problem in that all values are required; to properly represent optional values, a type system will also need the concept of nullability - to be used when a value is unset.

For an example of what this gives you, consider a recursively defined data structure, like a binary tree; If all nodes require two child nodes, and they require two children, and those four grandchildren require two each, and ...etc, suddenly simply can't define all the (infinite) objects required, there needs to be a way of optionally not having some. Of course, it's possible to have these without nullability - e.g. add an 'isSet' boolean to every type, and a singleton 'null' value to use when isSet is false (or in the tree example, have 'parent' and 'leaf' types) but these all degenerate into essentially equivalent scenarios, with the concept of nullability being the simplest generalisation. Linked lists, skip lists, B trees, tries etc, and anything using them, are all improved by nullability.

Nullability also assists with asynchronous data (later post coming), for a convenient default value of things before they're fetched; it does have the downside often of forcing repeated null checks, though one possible way of dealing with those will be coming later, and other techniques exist (@NotNull, or the Elvis operator).

The combination of the types above (primitives, complex, nulls) give you enough to write a lot of things in a language - but there's still more aspects to types, problems which I see appearing that are still to be solved by different constructs - that said, this post is already pretty long, so tune in next time for types part 2! And as mentioned before, please post any thoughts or requests in the comments section below.

Hidden Image for Share