research!rsc

Transparent Logs for Skeptical Clients

2019-03-01T11:00:00-05:00

Suppose we want to maintain and publish a public, append-only log of data. Suppose also that clients are skeptical about our correct implementation and operation of the log: it might be to our advantage to leave things out of the log, or to enter something in the log today and then remove it tomorrow. How can we convince the client we are behaving?

This post is about an elegant data structure we can use to publish a log of N records with these three properties:

For any specific record R in a log of length N, we can construct a proof of length O(lg N) allowing the client to verify that R is in the log.
For any earlier log observed and remembered by the client, we can construct a proof of length O(lg N) allowing the client to verify that the earlier log is a prefix of the current log.
An auditor can efficiently iterate over the records in the log.

(In this post, “lg N” denotes the base-2 logarithm of N, reserving the word “log” to mean only “a sequence of records.”)

The Certificate Transparency project publishes TLS certificates in this kind of log. Google Chrome uses property (1) to verify that an enhanced validation certificate is recorded in a known log before accepting the certificate. Property (2) ensures that an accepted certificate cannot later disappear from the log undetected. Property (3) allows an auditor to scan the entire certificate log at any later time to detect misissued or stolen certificates. All this happens without blindly trusting that the log itself is operating correctly. Instead, the clients of the log—Chrome and any auditors—verify correct operation of the log as part of accessing it.

This post explains the design and implementation of this verifiably tamper-evident log, also called a transparent log. To start, we need some cryptographic building blocks.

Cryptographic Hashes, Authentication, and Commitments

A cryptographic hash function is a deterministic function H that maps an arbitrary-size message M to a small fixed-size output H(M), with the property that it is infeasible in practice to produce any pair of distinct messages M₁ ≠ M₂ with identical hashes H(M₁) = H(M₂). Of course, what is feasible in practice changes. In 1995, SHA-1 was a reasonable cryptographic hash function. In 2017, SHA-1 became a broken cryptographic hash function, when researchers identified and demonstrated a practical way to generate colliding messages. Today, SHA-256 is believed to be a reasonable cryptographic hash function. Eventually it too will be broken.

A (non-broken) cryptographic hash function provides a way to bootstrap a small amount of trusted data into a much larger amount of data. Suppose I want to share a very large file with you, but I am concerned that the data may not arrive intact, whether due to random corruption or a man-in-the-middle attack. I can meet you in person and hand you, written on a piece of paper, the SHA-256 hash of the file. Then, no matter what unreliable path the bits take, you can check whether you got the right ones by recomputing the SHA-256 hash of the download. If it matches, then you can be certain, assuming SHA-256 has not been broken, that you downloaded the exact bits I intended. The SHA-256 hash authenticates—that is, it proves the authenticity of—the downloaded bits, even though it is only 256 bits and the download is far larger.

We can also turn the scenario around, so that, instead of distrusting the network, you distrust me. If I tell you the SHA-256 of a file I promise to send, the SHA-256 serves as a verifiable commitment to a particular sequence of bits. I cannot later send a different bit sequence and convince you it is the file I promised.

A single hash can be an authentication or commitment of an arbitrarily large amount of data, but verification then requires hashing the entire data set. To allow selective verification of subsets of the data, we can use not just a single hash but instead a balanced binary tree of hashes, known as a Merkle tree.

Merkle Trees

A Merkle tree is constructed from N records, where N is a power of two. First, each record is hashed independently, producing N hashes. Then pairs of hashes are themselves hashed, producing N/2 new hashes. Then pairs of those hashes are hashed, to produce N/4 hashes, and so on, until a single hash remains. This diagram shows the Merkle tree of size N = 16:

The boxes across the bottom represent the 16 records. Each number in the tree denotes a single hash, with inputs connected by downward lines. We can refer to any hash by its coordinates: level L hash number K, which we will abbreviate h(L, K). At level 0, each hash’s input is a single record; at higher levels, each hash’s input is a pair of hashes from the level below.

h(0, K) = H(record K)
h(L+1, K) = H(h(L, 2 K), h(L, 2 K+1))

To prove that a particular record is contained in the tree represented by a given top-level hash (that is, to allow the client to authenticate a record, or verify a prior commitment, or both), it suffices to provide the hashes needed to recompute the overall top-level hash from the record’s hash. For example, suppose we want to prove that a certain bit string B is in fact record 9 in a tree of 16 records with top-level hash T. We can provide those bits along with the other hash inputs needed to reconstruct the overall tree hash using those bits. Specifically, the client can derive as well as we can that:

T = h(4, 0)
= H(h(3, 0), h(3, 1))
= H(h(3, 0), H(h(2, 2), h(2, 3)))
= H(h(3, 0), H(H(h(1, 4), h(1, 5)), h(2, 3)))
= H(h(3, 0), H(H(H(h(0, 8), h(0, 9)), h(1, 5)), h(2, 3)))
= H(h(3, 0), H(H(H(h(0, 8), H(record 9)), h(1, 5)), h(2, 3)))
= H(h(3, 0), H(H(H(h(0, 8), H(B)), h(1, 5)), h(2, 3)))

If we give the client the values [h(3, 0), h(0, 8), h(1, 5), h(2, 3)], the client can calculate H(B) and then combine all those hashes using the formula and check whether the result matches T. If so, the client can be cryptographically certain that B really is record 9 in the tree with top-level hash T. In effect, proving that B is a record in the Merkle tree with hash T is done by giving a verifiable computation of T with H(B) as an input.

Graphically, the proof consists of the sibling hashes (circled in blue) of nodes along the path (highlighted in yellow) from the record being proved up to the tree root.

In general, the proof that a given record is contained in the tree requires lg N hashes, one for each level below the root.

Building our log as a sequence of records hashed in a Merkle tree would give us a way to write an efficient (lg N-length) proof that a particular record is in the log. But there are two related problems to solve: our log needs to be defined for any length N, not just powers of two, and we need to be able to write an efficient proof that one log is a prefix of another.

A Merkle Tree-Structured Log

To generalize the Merkle tree to non-power-of-two sizes, we can write N as a sum of decreasing powers of two, then build complete Merkle trees of those sizes for successive sections of the input, and finally hash the at-most-lg N complete trees together to produce a single top-level hash. For example, 13 = 8 + 4 + 1:

The new hashes marked “x” combine the complete trees, building up from right to left, to produce the overall tree hash. Note that these hashes necessarily combine trees of different sizes and therefore hashes from different levels; for example, h(3, x) = H(h(2, 2), h(0, 12)).

The proof strategy for complete Merkle trees applies equally well to these incomplete trees. For example, the proof that record 9 is in the tree of size 13 is [h(3, 0), h(0, 8), h(1, 5), h(0, 12)]:

Note that h(0, 12) is included in the proof because it is the sibling of h(2, 2) in the computation of h(3, x).

We still need to be able to write an efficient proof that the log of size N with tree hash T is a prefix of the log of size N′ (> N) with tree hash T′. Earlier, proving that B is a record in the Merkle tree with hash T was done by giving a verifiable computation of T using H(B) as an input. To prove that the log with tree hash T is included in the log with tree hash T′, we can follow the same idea: give verifiable computations of T and T′, in which all the inputs to the computation of T are also inputs to the computation of T′. For example, consider the trees of size 7 and 13:

In the diagram, the “x” nodes complete the tree of size 13 with hash T₁₃, while the “y” nodes complete the tree of size 7 with hash T₇. To prove that T₇’s leaves are included in T₁₃, we first give the computation of T₇ in terms of complete subtrees (circled in blue):

T₇ = H(h(2, 0), H(h(1, 2), h(0, 6)))

Then we give the computation of T₁₃, expanding hashes as needed to expose the same subtrees. Doing so exposes sibling subtrees (circled in red):

T₁₃ = H(h(3, 0), H(h(2, 2), h(0, 12)))
= H(H(h(2, 0), h(2, 1)), H(h(2, 2), h(0, 12)))
= H(H(h(2, 0), H(h(1, 2), h(1, 3))), H(h(2, 2), h(0, 12)))
= H(H(h(2, 0), H(h(1, 2), H(h(0, 6), h(0, 7)))), H(h(2, 2), h(0, 12)))

Assuming the client knows the trees have sizes 7 and 13, it can derive the required decomposition itself. We need only supply the hashes [h(2, 0), h(1, 2), h(0, 6), h(0, 7), h(2, 2), h(0, 12)]. The client recalculates the T₇ and T₁₃ implied by the hashes and checks that they match the originals.

Note that these proofs only use hashes for completed subtrees—that is, numbered hashes, never the “x” or “y” hashes that combine differently-sized subtrees. The numbered hashes are permanent, in the sense that once such a hash appears in a tree of a given size, that same hash will appear in all trees of larger sizes. In contrast, the “x” and “y” hashes are ephemeral—computed for a single tree and never seen again. The hashes common to the decomposition of two different-sized trees therefore must always be permanent hashes. The decomposition of the larger tree could make use of ephemeral hashes for the exposed siblings, but we can easily use only permanent hashes instead. In the example above, the reconstruction of T₁₃ from the parts of T₇ uses h(2, 2) and h(0, 12) instead of assuming access to T₁₃’s h(3, x). Avoiding the ephemeral hashes extends the maximum record proof size from lg N hashes to 2 lg N hashes and the maximum tree proof size from 2 lg N hashes to 3 lg N hashes. Note that most top-level hashes, including T₇ and T₁₃, are themselves ephemeral hashes, requiring up to lg N permanent hashes to compute. The exceptions are the power-of-two-sized trees T₁, T₂, T₄, T₈, and so on.

Storing a Log

Storing the log requires only a few append-only files. The first file holds the log record data, concatenated. The second file is an index of the first, holding a sequence of int64 values giving the start offset of each record in the first file. This index allows efficient random access to any record by its record number. While we could recompute any hash tree from the record data alone, doing so would require N–1 hash operations for a tree of size N. Efficient generation of proofs therefore requires precomputing and storing the hash trees in some more accessible form.

As we noted in the previous section, there is significant commonality between trees. In particular, the latest hash tree includes all the permanent hashes from all earlier hash trees, so it is enough to store “only” the latest hash tree. A straightforward way to do this is to maintain lg N append-only files, each holding the sequence of hashes at one level of the tree. Because hashes are fixed size, any particular hash can be read efficiently by reading from the file at the appropriate offset.

To write a new log record, we must append the record data to the data file, append the offset of that data to the index file, and append the hash of the data to the level-0 hash file. Then, if we completed a pair of hashes in the level-0 hash file, we append the hash of the pair to the level-1 hash file; if that completed a pair of hashes in the level-1 hash file, we append the hash of that pair to the level-2 hash file; and so on up the tree. Each log record write will append a hash to at least one and at most lg N hash files, with an average of just under two new hashes per write. (A binary tree with N leaves has N–1 interior nodes.)

It is also possible to interlace lg N append-only hash files into a single append-only file, so that the log can be stored in only three files: record data, record index, and hashes. See Appendix A for details. Another possibility is to store the log in a pair of database tables, one for record data and one for hashes (the database can provide the record index itself).

Whether in files or in database tables, the stored form of the log is append-only, so cached data never goes stale, making it trivial to have parallel, read-only replicas of a log. In contrast, writing to the log is inherently centralized, requiring a dense sequence numbering of all records (and in many cases also duplicate suppression). An implementation using the two-table database representation can delegate both replication and coordination of writes to the underlying database, especially if the underlying database is globally-replicated and consistent, like Google Cloud Spanner or CockroachDB.

It is of course not enough just to store the log. We must also make it available to clients.

Serving a Log

Remember that each client consuming the log is skeptical about the log’s correct operation. The log server must make it easy for the client to verify two things: first, that any particular record is in the log, and second, that the current log is an append-only extension of a previously-observed earlier log.

To be useful, the log server must also make it easy to find a record given some kind of lookup key, and it must allow an auditor to iterate over the entire log looking for entries that don’t belong.

To do all this, the log server must answer five queries:

Latest() returns the current log size and top-level hash, cryptographically signed by the server for non-repudiation.
RecordProof(R, N) returns the proof that record R is contained in the tree of size N.
TreeProof(N, N′) returns the proof that the tree of size N is a prefix of the tree of size N′.
Lookup(K) returns the record index R matching lookup key K, if any.
Data(R) returns the data associated with record R.

Verifying a Log

The client uses the first three queries to maintain a cached copy of the most recent log it has observed and make sure that the server never removes anything from an observed log. To do this, the client caches the most recently observed log size N and top-level hash T. Then, before accepting data bits B as record number R, the client verifies that R is included in that log. If necessary (that is, if R ≥ its cached N), the client updates its cached N, T to those of the latest log, but only after verifying that the latest log includes everything from the current cached log. In pseudocode:

validate(bits B as record R):
    if R ≥ cached.N:
        N, T = server.Latest()
        if server.TreeProof(cached.N, N) cannot be verified:
            fail loudly
        cached.N, cached.T = N, T
    if server.RecordProof(R, cached.N) cannot be verified using B:
        fail loudly
    accept B as record R

The client’s proof verification ensures that the log server is behaving correctly, at least as observed by the client. If a devious server can distinguish individual clients, it can still serve different logs to different clients, so that a victim client sees invalid entries never exposed to other clients or auditors. But if the server does lie to a victim, the fact that the victim requires any later log to include what it has seen before means the server must keep up the lie, forever serving an alternate log containing the lie. This makes eventual detection more likely. For example, if the victim ever arrived through a proxy or compared its cached log against another client, or if the server ever made a mistake about which clients to lie to, the inconsistency would be readily exposed. Requiring the server to sign the Latest() response makes it impossible for the server to disavow the inconsistency, except by claiming to have been compromised entirely.

The client-side checks are a little bit like how a Git client maintains its own cached copy of a remote repository and then, before accepting an update during git pull, verifies that the remote repository includes all local commits. But the transparent log client only needs to download lg N hashes for the verification, while Git downloads all cached.N – N new data records, and more generally, the transparent log client can selectively read and authenticate individual entries from the log, without being required to download and store a full copy of the entire log.

Tiling a Log

As described above, storing the log requires simple, append-only storage linear in the total log size, and serving or accessing the log requires network traffic only logarithmic in the total log size. This would be a completely reasonable place to stop (and is where Certificate Transparency as defined in RFC 6962 stops). However, one useful optimization can both cut the hash storage in half and make the network traffic more cache-friendly, with only a minor increase in implementation complexity. That optimization is based on splitting the hash tree into tiles, like Google Maps splits the globe into tiles.

A binary tree can be split into tiles of fixed height H and width 2^H. For example, here is the permanent hash tree for the log with 27 records, split into tiles of height 2:

We can assign each tile a two-dimensional coordinate, analogous to the hash coordinates we’ve been using: tile(L, K) denotes the tile at tile level L (hash levels H·L up to H·(L+1)), Kth from the left. For any given log size, the rightmost tile at each level may not yet be complete: the bottom row of hashes may contain only W < 2^H hashes. In that case we will write tile(L, K)/W. (When the tile is complete, the “/W” is omitted, understood to be 2^H.)

Storing Tiles

Only the bottom row of each tile needs to be stored: the upper rows can be recomputed by hashing lower ones. In our example, a tile of height two stores 4 hashes instead of 6, a 33% storage reduction. For tiles of greater heights, the storage reduction asymptotically approaches 50%. The cost is that reading a hash that has been optimized away may require reading as much as half a tile, increasing I/O requirements. For a real system, height four seems like a reasonable balance between storage costs and increased I/O overhead. It stores 16 hashes instead of 30—a 47% storage reduction—and (assuming SHA-256) a single 16-hash tile is only 512 bytes (a single disk sector!).

The file storage described earlier maintained lg N hash files, one for each level. Using tiled storage, we only write the hash files for levels that are a multiple of the tile height. For tiles of height 4, we’d only write the hash files for levels 0, 4, 8, 12, 16, and so on. When we need a hash at another level, we can read its tile and recompute the hash.

Serving Tiles

The proof-serving requests RecordProof(R, N) and TreeProof(N, N′) are not particularly cache-friendly. For example, although RecordProof(R, N) often shares many hashes with both RecordProof(R+1, N) and RecordProof(R, N+1), the three are distinct requests that must be cached independently.

A more cache-friendly approach would be to replace RecordProof and TreeProof by a general request Hash(L, K), serving a single permanent hash. The client can easily compute which specific hashes it needs, and there are many fewer individual hashes than whole proofs (2 N vs N²/2), which will help the cache hit rate. Unfortunately, switching to Hash requests is inefficient: obtaining a record proof used to take one request and now takes up to 2 lg N requests, while tree proofs take up to 3 lg N requests. Also, each request delivers only a single hash (32 bytes): the request overhead is likely significantly larger than the payload.

We can stay cache-friendly while reducing the number of requests and the relative request overhead, at a small cost in bandwidth, by adding a request Tile(L, K) that returns the requested tile. The client can request the tiles it needs for a given proof, and it can cache tiles, especially those higher in the tree, for use in future proofs.

For a real system using SHA-256, a tile of height 8 would be 8 kB. A typical proof in a large log of, say, 100 million records would require only three complete tiles, or 24 kB downloaded, plus one incomplete tile (192 bytes) for the top of the tree. And tiles of height 8 can be served directly from stored tiles of height 4 (the size suggested in the previous section). Another reasonable choice would be to both store and serve tiles of height 6 (2 kB each) or 7 (4 kB each).

If there are caches in front of the server, each differently-sized partial tile must be given a different name, so that a client that needs a larger partial tile is not given a stale smaller one. Even though the tile height is conceptually constant for a given system, it is probably helpful to be explicit about the tile height in the request, so that a system can transition from one fixed tile height to another without ambiguity. For example, in a simple GET-based HTTP API, we could use /tile/H/L/K to name a complete tile and /tile/H/L/K.W to name a partial tile with only W hashes.

Authenticating Tiles

One potential problem with downloading and caching tiles is not being sure that they are correct. An attacker might be able to modify downloaded tiles and cause proofs to fail unexpectedly. We can avoid this problem by authenticating the tiles against the signed top-level tree hash after downloading them. Specifically, if we have a signed top-level tree hash T, we first download the at most (lg N)/H tiles storing the hashes for the complete subtrees that make up T. In the diagram of T₂₇ earlier, that would be tile(2, 0)/1, tile(1, 1)/2, and tile(0, 6)/3. Computing T will use every hash in these tiles; if we get the right T, the hashes are all correct. These tiles make up the top and right sides of the tile tree for the given hash tree, and now we know they are correct. To authenticate any other tile, we first authenticate its parent tile (the topmost parents are all authenticated already) and then check that the result of hashing all the hashes in the tile produces the corresponding entry in the parent tile. Using the T₂₇ example again, given a downloaded tile purporting to be tile(0, 1), we can compute

h(2, 1) = H(H(h(0, 4), h(0, 5)), H(h(0, 6), h(0, 7)))

and check whether that value matches the h(2, 1) recorded directly in an already-authenticated tile(1, 0). If so, that authenticates the downloaded tile.

Summary

Putting this all together, we’ve seen how to publish a transparent (tamper-evident, immutable, append-only) log with the following properties:

A client can verify any particular record using O(lg N) downloaded bytes.
A client can verify any new log contains an older log using O(lg N) downloaded bytes.
For even a large log, these verifications can be done in 3 RPCs of about 8 kB each.
The RPCs used for verification can be made to proxy and cache well, whether for network efficiency or possibly for privacy.
Auditors can iterate over the entire log looking for bad entries.
Writing N records defines a sequence of N hash trees, in which the nth tree contains 2 n – 1 hashes, a total of N² hashes. But instead of needing to store N² hashes, the entire sequence can be compacted into at most 2 N hashes, with at most lg N reads required to reconstruct a specific hash from a specific tree.
Those 2 N hashes can themselves be compacted down to 1.06 N hashes, at a cost of potentially reading 8 adjacent hashes to reconstruct any one hash from the 2 N.

Overall, this structure makes the log server itself essentially untrusted. It can’t remove an observed record without detection. It can’t lie to one client without keeping the client on an alternate timeline forever, making detection easy by comparing against another client. The log itself is also easily proxied and cached, so that even if the main server disappeared, replicas could keep serving the cached log. Finally, auditors can check the log for entries that should not be there, so that the actual content of the log can be verified asynchronously from its use.

Appendix A: Postorder Storage Layout

The file-based storage described earlier held the permanent hash tree in lg N append-only files, one for each level of the tree. The hash h(L, K) would be stored in the Lth hash file at offset K · HashSize

Crosby and Wallach pointed out that it is easy to merge the lg N hash tree levels into a single, append-only hash file by using the postorder numbering of the binary tree, in which a parent hash is stored immediately after its rightmost child. For example, the permanent hash tree after writing N = 13 records is laid out like:

In the diagram, each hash is numbered and aligned horizontally according to its location in the interlaced file.

The postorder numbering makes the hash file append-only: each new record completes between 1 and lg N new hashes (on average 2), which are simply appended to the file, lower levels first.

Reading a specific hash from the file can still be done with a single read at a computable offset, but the calculation is no longer completely trivial. Hashes at level 0 are placed by adding in gaps for completed higher-level hashes, and a hash at any higher level immediately follows its right child hash:

seq(0, K) = K + K/2 + K/4 + K/8 + ...
seq(L, K) = seq(L–1, 2 K + 1) + 1 = seq(0, 2^L (K+1) – 1) + L

The interlaced layout also improves locality of access. Reading a proof typically means reading one hash from each level, all clustered around a particular leaf in the tree. If each tree level is stored separately, each hash is in a different file and there is no possibility of I/O overlap. But when the tree is stored in interlaced form, the accesses at the bottom levels will all be near each other, making it possible to fetch many of the needed hashes with a single disk read.

Appendix B: Inorder Storage Layout

A different way to interlace the lg N hash files would be to use an inorder tree numbering, in which each parent hash is stored between its left and right subtrees:

This storage order does not correspond to append-only writes to the file, but each hash entry is still write-once. For example, with 13 records written, as in the diagram, hashes have been stored at indexes 0–14, 16–22 and 24, but not yet at indexes 15 and 23, which will eventually hold h(4, 0) and h(3, 1). In effect, the space for a parent hash is reserved when its left subtree has been completed, but it can only be filled in later, once its right subtree has also been completed.

Although the file is no longer append-only, the inorder numbering has other useful properties. First, the offset math is simpler:

seq(0, K) = 2 K
seq(L, K) = 2^L+1 K + 2^L – 1

Second, locality is improved. Now each parent hash sits exactly in the middle of its child subtrees, instead of on the far right side.

Appendix C: Tile Storage Layout

Storing the hash tree in lg N separate levels made converting to tile storage very simple: just don’t write (H–1)/H of the files. The simplest tile implementation is probably to use separate files, but it is worth examining what it would take to convert an interlaced hash storage file to tile storage. It’s not as straightforward as omitting a few files. It’s not enough to just omit the hashes at certain levels: we also want each tile to appear contiguously in the file. For example, for tiles of height 2, the first tile at tile level 1 stores hashes h(2, 0)–h(2, 3), but neither the postorder nor inorder interlacing would place those four hashes next to each other.

Instead, we must simply define that tiles are stored contiguously and then decide a linear tile layout order. For tiles of height 2, the tiles form a 4-ary tree, and in general, the tiles form a 2^H-ary tree. We could use a postorder layout, as in Appendix A:

seq(0, K) = K + K/2^H + K/2^2H + K/2^3H + ...
seq(L, K) = seq(L–1, 2^H K + 2^H – 1) + 1 = seq(0, 2^H·L (K+1) – 1) + L

The postorder tile sequence places a parent tile immediately after its rightmost child tile, but the parent tile begins to be written after the leftmost child tile is completed. This means writing increasingly far ahead of the filled part of the hash file. For example, with tiles of height 2, the first hash of tile(2, 0) (postorder index 20) is written after filling tile(1, 0) (postorder index 4):

The hash file catches up—there are no tiles written after index 20 until the hash file fills in entirely behind it—but then jumps ahead again—finishing tile 20 triggers writing the first hash into tile 84. In general only the first 1/2^H or so of the hash file is guaranteed to be densely packed. Most file systems efficiently support files with large holes, but not all do: we may want to use a different tile layout to avoid arbitrarily large holes.

Placing a parent tile immediately after its leftmost child’s completed subtree would eliminate all holes (other than incomplete tiles) and would seem to correspond to the inorder layout of Appendix B:

But while the tree structure is regular, the numbering is not. Instead, the offset math is more like the postorder traversal. A simpler but far less obvious alternative is to vary the exact placement of the parent tiles relative to the subtrees:

seq(L, K) = ((K + B – 2)/(B – 1))_B || (1)_B^L

Here, (X)_B means X written as a base-B number, || denotes concatenation of base-B numbers, (1)_B^L means the base-B digit 1 repeated L times, and the base is B = 2^H.

This encoding generalizes the inorder binary-tree traversal (H = 1, B = 2), preserving its regular offset math at the cost of losing its regular tree structure. Since we only care about doing the math, not exactly what the tree looks like, this is probably a reasonable tradeoff. For more about this surprising ordering, see my blog post, “An Encoded Tree Traversal.”

An Encoded Tree Traversal

2019-02-25T12:00:00-05:00

Every basic data structures course identifies three ways to traverse a binary tree. It’s not entirely clear how to generalize them to k-ary trees, and I recently noticed an unexpected ordering that I’d like to know more about. If you know of references to this ordering, please leave a comment or email me (rsc@swtch.com).

Binary Tree Orderings

First a refresher about binary-tree orderings to set up an analogy to k-ary trees.

Preorder visits a node before its left and right subtrees:

Inorder visits a node between its left and right subtrees:

Postorder visits a node after its left and right subtrees:

Each picture shows the same 16-leaf, 31-node binary tree, with the nodes numbered and also placed horizontally using the order visited in the given traversal.

It was observed long ago that one way to represent a tree in linear storage is to record the nodes in a fixed order (such as one of these), along with a separate array giving the number of children of each node. In the pictures, the trees are complete, balanced trees, so the number of children of each node can be derived from the number of total leaves. (For more, see Knuth Volume 1 §2.3.1; for references, see §2.3.1.6, and §2.6.)

It is convenient to refer to nodes in a tree by a two-dimensional coordinate (l, n), consisting of the level of the node (with 0 being the leaves) and its sequence number at that level. For example, the root of the 16-node tree has coordinate (4, 0), while the leaves are (0, 0) through (0, 15).

When storing a tree using a linearized ordering such as these, it is often necessary to be able to convert a two-dimensional coordinate to its index in the linear ordering. For example, the right child of the root—node (3, 1)—has number 16, 23, and 29 in the three different orderings.

The linearized pre-ordering of (l, n) is given by:

seq(L, 0) = 0 (L is height of tree)
seq(l, n) = seq(l+1, n/2) + 1 (n even)
seq(l, n) = seq(l+1, n/2) + 2^l+1 (n odd)

This ordering is awkward because it changes depending on the height of the tree.

The linearized post-ordering of (l, n) is given by:

seq(0, n) = n + n/2 + n/4 + n/8 + ...
seq(l, n) = seq(l–1, 2 n + 1) + 1 = seq(0, 2^l n + 2^l – 1) + l

This ordering is independent of the height of the tree, but the leaf numbering is still a little complex.

The linearized in-ordering is much more regular. It’s clear just looking at it that seq(0, n) = 2 n, and in fact a single equation applies to all levels:

seq(l, n) = 2^l+1 n + 2^l – 1

If you need to linearize a complete binary tree, using the in-order traversal has the simplest math.

k-ary Tree Orderings

The three binary orderings correspond to visiting the node after 0, 1, or 2 of its child subtrees. To generalize to k-ary trees, we can visit the node after any number of subtrees from 0 to k, producing k+1 orderings. For example, for k = 3, here are the four generalized orderings of a 27-leaf, 39-node 3-ary tree:

Preorder (inorder-0):

Inorder-1:

Inorder-2:

Postorder (inorder-3):

Just looking at the leaves of each, none of them has a nice simple form with regular gaps like the seq(0, n) = 2 n of in-order traversal for binary trees. Instead, both the possible “in-order” traversals end up with equations more like the post-order traversal. What happened? Where did the nice, regular pattern go?

An Unexpected Ordering

In a binary tree, the in-order numbering has the property that after the first leaf, one new parent (non-leaf) node is introduced before each additional one leaf. This works out nicely because the number of parent nodes in a binary tree of N leaves is N–1. The number of parents nodes in a k-ary tree of N leaves is (N–1)/(k–1), so we could try to build a nicer numbering by, after the first leaf, introducing one new parent node before each additional k–1 leaf nodes. That is, the leaves would be numbered by

seq(0, n) = n + (n+k–2)/(k–1),

which gives this leaf structure:

But how do we fill in the gaps? The first three triples—0, 2, 3 and 5, 6, 8 and 9, 11, 12—clearly get nodes 1, 4, 7, and 10 as their three parents and one grandparent, but which is the grandparent? The binary in-order traversal was very self-similar, so let’s try the same thing here: after the first node, reserve one node for higher levels before each k–1 nodes at this level. That is, the parents are 1, 7, 10, and the grandparent is 4.

Applying this process throughout the tree, we end up with this traversal order (inorder-G, for gap-induced):

For contrast, here is the ordering from the previous section that visited each node after its first subtree (inorder-1):

The inorder-1 traversal has a regular tree structure but irregular numbering. In contrast, the inorder-G traversal has an irregular tree structure but very regular numbering that generalizes the binary inorder numbering:

seq(l, n) = k^l (n+k–2)/(k–1) + k^l-1 + k^l-2 + ... + k⁰

For a binary tree, inorder-1 and inorder-G are the same: the traversal has both a regular tree structure and a regular numbering. But for k-ary trees, it seems you can pick only one.

The regularity of the numbering is easiest to see in base k. For example, here is the binary inorder traversal with binary numbering:

The bottom row uses numbers ending in 0; the next row up uses numbers ending in 01; then 011; and so on.

For the 3-ary tree, it is the inorder-G traversal (not inorder-1 or inorder-2) that produces an equivalent pattern:

The bottom row uses numbers ending in 0 or 2; the next row up uses numbers ending in 01 or 21; then 011 or 211; and so on. The general rule is that

seq(l, n) = ((n+k–2)/(k–1))_k || (1)_k^l

where (x)_k means x written as a base-k number, || denotes concatenation of base-k numbers, and (1)_k^l means the base-k digit 1 repeated l times.

Through a roundabout way, then, we’ve ended up with a tree traversal that’s really just a nice base-k encoding. There must be existing uses of this encoding, but I’ve been unable to find any or even determine what its name is.

Our Software Dependency Problem

2019-01-23T11:00:00-05:00

For decades, discussion of software reuse was far more common than actual software reuse. Today, the situation is reversed: developers reuse software written by others every day, in the form of software dependencies, and the situation goes mostly unexamined.

My own background includes a decade of working with Google’s internal source code system, which treats software dependencies as a first-class concept,¹ and also developing support for dependencies in the Go programming language.²

Software dependencies carry with them serious risks that are too often overlooked. The shift to easy, fine-grained software reuse has happened so quickly that we do not yet understand the best practices for choosing and using dependencies effectively, or even for deciding when they are appropriate and when not. My purpose in writing this article is to raise awareness of the risks and encourage more investigation of solutions.

What is a dependency?

In today’s software development world, a dependency is additional code that you want to call from your program. Adding a dependency avoids repeating work already done: designing, writing, testing, debugging, and maintaining a specific unit of code. In this article we’ll call that unit of code a package; some systems use terms like library or module instead of package.

Taking on externally-written dependencies is an old practice: most programmers have at one point in their careers had to go through the steps of manually downloading and installing a required library, like C’s PCRE or zlib, or C++’s Boost or Qt, or Java’s JodaTime or JUnit. These packages contain high-quality, debugged code that required significant expertise to develop. For a program that needs the functionality provided by one of these packages, the tedious work of manually downloading, installing, and updating the package is easier than the work of redeveloping that functionality from scratch. But the high fixed costs of reuse mean that manually-reused packages tend to be big: a tiny package would be easier to reimplement.

A dependency manager (sometimes called a package manager) automates the downloading and installation of dependency packages. As dependency managers make individual packages easier to download and install, the lower fixed costs make smaller packages economical to publish and reuse.

For example, the Node.js dependency manager NPM provides access to over 750,000 packages. One of them, escape-string-regexp, provides a single function that escapes regular expression operators in its input. The entire implementation is:

var matchOperatorsRe = /[|\\{}()[\]^$+*?.]/g;

module.exports = function (str) {
	if (typeof str !== 'string') {
		throw new TypeError('Expected a string');
	}
	return str.replace(matchOperatorsRe, '\\$&');
};

Before dependency managers, publishing an eight-line code library would have been unthinkable: too much overhead for too little benefit. But NPM has driven the overhead approximately to zero, with the result that nearly-trivial functionality can be packaged and reused. In late January 2019, the escape-string-regexp package is explicitly depended upon by almost a thousand other NPM packages, not to mention all the packages developers write for their own use and don’t share.

Dependency managers now exist for essentially every programming language. Maven Central (Java), Nuget (.NET), Packagist (PHP), PyPI (Python), and RubyGems (Ruby) each host over 100,000 packages. The arrival of this kind of fine-grained, widespread software reuse is one of the most consequential shifts in software development over the past two decades. And if we’re not more careful, it will lead to serious problems.

What could go wrong?

A package, for this discussion, is code you download from the internet. Adding a package as a dependency outsources the work of developing that code—designing, writing, testing, debugging, and maintaining—to someone else on the internet, someone you often don’t know. By using that code, you are exposing your own program to all the failures and flaws in the dependency. Your program’s execution now literally depends on code downloaded from this stranger on the internet. Presented this way, it sounds incredibly unsafe. Why would anyone do this?

We do this because it’s easy, because it seems to work, because everyone else is doing it too, and, most importantly, because it seems like a natural continuation of age-old established practice. But there are important differences we’re ignoring.

Decades ago, most developers already trusted others to write software they depended on, such as operating systems and compilers. That software was bought from known sources, often with some kind of support agreement. There was still a potential for bugs or outright mischief,³ but at least we knew who we were dealing with and usually had commercial or legal recourses available.

The phenomenon of open-source software, distributed at no cost over the internet, has displaced many of those earlier software purchases. When reuse was difficult, there were fewer projects publishing reusable code packages. Even though their licenses typically disclaimed, among other things, any “implied warranties of merchantability and fitness for a particular purpose,” the projects built up well-known reputations that often factored heavily into people’s decisions about which to use. The commercial and legal support for trusting our software sources was replaced by reputational support. Many common early packages still enjoy good reputations: consider BLAS (published 1979), Netlib (1987), libjpeg (1991), LAPACK (1992), HP STL (1994), and zlib (1995).

Dependency managers have scaled this open-source code reuse model down: now, developers can share code at the granularity of individual functions of tens of lines. This is a major technical accomplishment. There are myriad available packages, and writing code can involve such a large number of them, but the commercial, legal, and reputational support mechanisms for trusting the code have not carried over. We are trusting more code with less justification for doing so.

The cost of adopting a bad dependency can be viewed as the sum, over all possible bad outcomes, of the cost of each bad outcome multiplied by its probability of happening (risk).

The context where a dependency will be used determines the cost of a bad outcome. At one end of the spectrum is a personal hobby project, where the cost of most bad outcomes is near zero: you’re just having fun, bugs have no real impact other than wasting some time, and even debugging them can be fun. So the risk probability almost doesn’t matter: it’s being multiplied by zero. At the other end of the spectrum is production software that must be maintained for years. Here, the cost of a bug in a dependency can be very high: servers may go down, sensitive data may be divulged, customers may be harmed, companies may fail. High failure costs make it much more important to estimate and then reduce any risk of a serious failure.

No matter what the expected cost, experiences with larger dependencies suggest some approaches for estimating and reducing the risks of adding a software dependency. It is likely that better tooling is needed to help reduce the costs of these approaches, much as dependency managers have focused to date on reducing the costs of download and installation.

Inspect the dependency

You would not hire a software developer you’ve never heard of and know nothing about. You would learn more about them first: check references, conduct a job interview, run background checks, and so on. Before you depend on a package you found on the internet, it is similarly prudent to learn a bit about it first.

A basic inspection can give you a sense of how likely you are to run into problems trying to use this code. If the inspection reveals likely minor problems, you can take steps to prepare for or maybe avoid them. If the inspection reveals major problems, it may be best not to use the package: maybe you’ll find a more suitable one, or maybe you need to develop one yourself. Remember that open-source packages are published by their authors in the hope that they will be useful but with no guarantee of usability or support. In the middle of a production outage, you’ll be the one debugging it. As the original GNU General Public License warned, “The entire risk as to the quality and performance of the program is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.”⁴

The rest of this section outlines some considerations when inspecting a package and deciding whether to depend on it.

Design

Is package’s documentation clear? Does the API have a clear design? If the authors can explain the package’s API and its design well to you, the user, in the documentation, that increases the likelihood they have explained the implementation well to the computer, in the source code. Writing code for a clear, well-designed API is also easier, faster, and hopefully less error-prone. Have the authors documented what they expect from client code in order to make future upgrades compatible? (Examples include the C++⁵ and Go⁶ compatibility documents.)

Code Quality

Is the code well-written? Read some of it. Does it look like the authors have been careful, conscientious, and consistent? Does it look like code you’d want to debug? You may need to.

Develop your own systematic ways to check code quality. For example, something as simple as compiling a C or C++ program with important compiler warnings enabled (for example, -Wall) can give you a sense of how seriously the developers work to avoid various undefined behaviors. Recent languages like Go, Rust, and Swift use an unsafe keyword to mark code that violates the type system; look to see how much unsafe code there is. More advanced semantic tools like Infer⁷ or SpotBugs⁸ are helpful too. Linters are less helpful: you should ignore rote suggestions about topics like brace style and focus instead on semantic problems.

Keep an open mind to development practices you may not be familiar with. For example, the SQLite library ships as a single 200,000-line C source file and a single 11,000-line header, the “amalgamation.” The sheer size of these files should raise an initial red flag, but closer investigation would turn up the actual development source code, a traditional file tree with over a hundred C source files, tests, and support scripts. It turns out that the single-file distribution is built automatically from the original sources and is easier for end users, especially those without dependency managers. (The compiled code also runs faster, because the compiler can see more optimization opportunities.)

Testing

Does the code have tests? Can you run them? Do they pass? Tests establish that the code’s basic functionality is correct, and they signal that the developer is serious about keeping it correct. For example, the SQLite development tree has an incredibly thorough test suite with over 30,000 individual test cases as well as developer documentation explaining the testing strategy.⁹ On the other hand, if there are few tests or no tests, or if the tests fail, that’s a serious red flag: future changes to the package are likely to introduce regressions that could easily have been caught. If you insist on tests in code you write yourself (you do, right?), you should insist on tests in code you outsource to others.

Assuming the tests exist, run, and pass, you can gather more information by running them with run-time instrumentation like code coverage analysis, race detection,¹⁰ memory allocation checking, and memory leak detection.

Debugging

Find the package’s issue tracker. Are there many open bug reports? How long have they been open? Are there many fixed bugs? Have any bugs been fixed recently? If you see lots of open issues about what look like real bugs, especially if they have been open for a long time, that’s not a good sign. On the other hand, if the closed issues show that bugs are rarely found and promptly fixed, that’s great.

Maintenance

Look at the package’s commit history. How long has the code been actively maintained? Is it actively maintained now? Packages that have been actively maintained for an extended amount of time are more likely to continue to be maintained. How many people work on the package? Many packages are personal projects that developers create and share for fun in their spare time. Others are the result of thousands of hours of work by a group of paid developers. In general, the latter kind of package is more likely to have prompt bug fixes, steady improvements, and general upkeep.

On the other hand, some code really is “done.” For example, NPM’s escape-string-regexp, shown earlier, may never need to be modified again.

Usage

Do many other packages depend on this code? Dependency managers can often provide statistics about usage, or you can use a web search to estimate how often others write about using the package. More users should at least mean more people for whom the code works well enough, along with faster detection of new bugs. Widespread usage is also a hedge against the question of continued maintenance: if a widely-used package loses its maintainer, an interested user is likely to step forward.

For example, libraries like PCRE or Boost or JUnit are incredibly widely used. That makes it more likely—although certainly not guaranteed—that bugs you might otherwise run into have already been fixed, because others ran into them first.

Security

Will you be processing untrusted inputs with the package? If so, does it seem to be robust against malicious inputs? Does it have a history of security problems listed in the National Vulnerability Database (NVD)?¹¹

For example, when Jeff Dean and I started work on Google Code Search¹²—grep over public source code—in 2006, the popular PCRE regular expression library seemed like an obvious choice. In an early discussion with Google’s security team, however, we learned that PCRE had a history of problems like buffer overflows, especially in its parser. We could have learned the same by searching for PCRE in the NVD. That discovery didn’t immediately cause us to abandon PCRE, but it did make us think more carefully about testing and isolation.

Licensing

Is the code properly licensed? Does it have a license at all? Is the license acceptable for your project or company? A surprising fraction of projects on GitHub have no clear license. Your project or company may impose further restrictions on the allowed licenses of dependencies. For example, Google disallows the use of code licensed under AGPL-like licenses (too onerous) as well as WTFPL-like licenses (too vague).¹³

Dependencies

Does the code have dependencies of its own? Flaws in indirect dependencies are just as bad for your program as flaws in direct dependencies. Dependency managers can list all the transitive dependencies of a given package, and each of them should ideally be inspected as described in this section. A package with many dependencies incurs additional inspection work, because those same dependencies incur additional risk that needs to be evaluated.

Many developers have never looked at the full list of transitive dependencies of their code and don’t know what they depend on. For example, in March 2016 the NPM user community discovered that many popular projects—including Babel, Ember, and React—all depended indirectly on a tiny package called left-pad, consisting of a single 8-line function body. They discovered this when the author of left-pad deleted that package from NPM, inadvertently breaking most Node.js users’ builds.¹⁴ And left-pad is hardly exceptional in this regard. For example, 30% of the 750,000 packages published on NPM depend—at least indirectly—on escape-string-regexp. Adapting Leslie Lamport’s observation about distributed systems, a dependency manager can easily create a situation in which the failure of a package you didn’t even know existed can render your own code unusable.

Test the dependency

The inspection process should include running a package’s own tests. If the package passes the inspection and you decide to make your project depend on it, the next step should be to write new tests focused on the functionality needed by your application. These tests often start out as short standalone programs written to make sure you can understand the package’s API and that it does what you think it does. (If you can’t or it doesn’t, turn back now!) It is worth then taking the extra effort to turn those programs into automated tests that can be run against newer versions of the package. If you find a bug and have a potential fix, you’ll want to be able to rerun these project-specific tests easily, to make sure that the fix did not break anything else.

It is especially worth exercising the likely problem areas identified by the basic inspection. For Code Search, we knew from past experience that PCRE sometimes took a long time to execute certain regular expression searches. Our initial plan was to have separate thread pools for “simple” and “complicated” regular expression searches. One of the first tests we ran was a benchmark, comparing pcregrep with a few other grep implementations. When we found that, for one basic test case, pcregrep was 70X slower than the fastest grep available, we started to rethink our plan to use PCRE. Even though we eventually dropped PCRE entirely, that benchmark remains in our code base today.

Abstract the dependency

Depending on a package is a decision that you are likely to revisit later. Perhaps updates will take the package in a new direction. Perhaps serious security problems will be found. Perhaps a better option will come along. For all these reasons, it is worth the effort to make it easy to migrate your project to a new dependency.

If the package will be used from many places in your project’s source code, migrating to a new dependency would require making changes to all those different source locations. Worse, if the package will be exposed in your own project’s API, migrating to a new dependency would require making changes in all the code calling your API, which you might not control. To avoid these costs, it makes sense to define an interface of your own, along with a thin wrapper implementing that interface using the dependency. Note that the wrapper should include only what your project needs from the dependency, not everything the dependency offers. Ideally, that allows you to substitute a different, equally appropriate dependency later, by changing only the wrapper. Migrating your per-project tests to use the new interface tests the interface and wrapper implementation and also makes it easy to test any potential replacements for the dependency.

For Code Search, we developed an abstract Regexp class that defined the interface Code Search needed from any regular expression engine. Then we wrote a thin wrapper around PCRE implementing that interface. The indirection made it easy to test alternate libraries, and it kept us from accidentally introducing knowledge of PCRE internals into the rest of the source tree. That in turn ensured that it would be easy to switch to a different dependency if needed.

Isolate the dependency

It may also be appropriate to isolate a dependency at run-time, to limit the possible damage caused by bugs in it. For example, Google Chrome allows users to add dependencies—extension code—to the browser. When Chrome launched in 2008, it introduced the critical feature (now standard in all browsers) of isolating each extension in a sandbox running in a separate operating-system process.¹⁵ An exploitable bug in an badly-written extension therefore did not automatically have access to the entire memory of the browser itself and could be stopped from making inappropriate system calls.¹⁶ For Code Search, until we dropped PCRE entirely, our plan was to isolate at least the PCRE parser in a similar sandbox. Today, another option would be a lightweight hypervisor-based sandbox like gVisor.¹⁷ Isolating dependencies reduces the associated risks of running that code.

Even with these examples and other off-the-shelf options, run-time isolation of suspect code is still too difficult and rarely done. True isolation would require a completely memory-safe language, with no escape hatch into untyped code. That’s challenging not just in entirely unsafe languages like C and C++ but also in languages that provide restricted unsafe operations, like Java when including JNI, or like Go, Rust, and Swift when including their “unsafe” features. Even in a memory-safe language like JavaScript, code often has access to far more than it needs. In November 2018, the latest version of the NPM package event-stream, which provided a functional streaming API for JavaScript events, was discovered to contain obfuscated malicious code that had been added two and a half months earlier. The code, which harvested large Bitcoin wallets from users of the Copay mobile app, was accessing system resources entirely unrelated to processing event streams.¹⁸ One of many possible defenses to this kind of problem would be to better restrict what dependencies can access.

Avoid the dependency

If a dependency seems too risky and you can’t find a way to isolate it, the best answer may be to avoid it entirely, or at least to avoid the parts you’ve identified as most problematic.

For example, as we better understood the risks and costs associated with PCRE, our plan for Google Code Search evolved from “use PCRE directly,” to “use PCRE but sandbox the parser,” to “write a new regular expression parser but keep the PCRE execution engine,” to “write a new parser and connect it to a different, more efficient open-source execution engine.” Later we rewrote the execution engine as well, so that no dependencies were left, and we open-sourced the result: RE2.¹⁹

If you only need a tiny fraction of a dependency, it may be simplest to make a copy of what you need (preserving appropriate copyright and other legal notices, of course). You are taking on responsibility for fixing bugs, maintenance, and so on, but you’re also completely isolated from the larger risks. The Go developer community has a proverb about this: “A little copying is better than a little dependency.”²⁰

Upgrade the dependency

For a long time, the conventional wisdom about software was “if it ain’t broke, don’t fix it.” Upgrading carries a chance of introducing new bugs; without a corresponding reward—like a new feature you need—why take the risk? This analysis ignores two costs. The first is the cost of the eventual upgrade. In software, the difficulty of making code changes does not scale linearly: making ten small changes is less work and easier to get right than making one equivalent large change. The second is the cost of discovering already-fixed bugs the hard way. Especially in a security context, where known bugs are actively exploited, every day you wait is another day that attackers can break in.

For example, consider the year 2017 at Equifax, as recounted by executives in detailed congressional testimony.²¹ On March 7, a new vulnerability in Apache Struts was disclosed, and a patched version was released. On March 8, Equifax received a notice from US-CERT about the need to update any uses of Apache Struts. Equifax ran source code and network scans on March 9 and March 15, respectively; neither scan turned up a particular group of public-facing web servers. On May 13, attackers found the servers that Equifax’s security teams could not. They used the Apache Struts vulnerability to breach Equifax’s network and then steal detailed personal and financial information about 148 million people over the next two months. Equifax finally noticed the breach on July 29 and publicly disclosed it on September 4. By the end of September, Equifax’s CEO, CIO, and CSO had all resigned, and a congressional investigation was underway.

Equifax’s experience drives home the point that although dependency managers know the versions they are using at build time, you need other arrangements to track that information through your production deployment process. For the Go language, we are experimenting with automatically including a version manifest in every binary, so that deployment processes can scan binaries for dependencies that need upgrading. Go also makes that information available at run-time, so that servers can consult databases of known bugs and self-report to monitoring software when they are in need of upgrades.

Upgrading promptly is important, but upgrading means adding new code to your project, which should mean updating your evaluation of the risks of using the dependency based on the new version. As minimum, you’d want to skim the diffs showing the changes being made from the current version to the upgraded versions, or at least read the release notes, to identify the most likely areas of concern in the upgraded code. If a lot of code is changing, so that the diffs are difficult to digest, that is also information you can incorporate into your risk assessment update.

You’ll also want to re-run the tests you’ve written that are specific to your project, to make sure the upgraded package is at least as suitable for the project as the earlier version. It also makes sense to re-run the package’s own tests. If the package has its own dependencies, it is entirely possible that your project’s configuration uses different versions of those dependencies (either older or newer ones) than the package’s authors use. Running the package’s own tests can quickly identify problems specific to your configuration.

Again, upgrades should not be completely automatic. You need to verify that the upgraded versions are appropriate for your environment before deploying them.²²

If your upgrade process includes re-running the integration and qualification tests you’ve already written for the dependency, so that you are likely to identify new problems before they reach production, then, in most cases, delaying an upgrade is riskier than upgrading quickly.

The window for security-critical upgrades is especially short. In the aftermath of the Equifax breach, forensic security teams found evidence that attackers (perhaps different ones) had successfully exploited the Apache Struts vulnerability on the affected servers on March 10, only three days after it was publicly disclosed, but they’d only run a single whoami command.

Watch your dependencies

Even after all that work, you’re not done tending your dependencies. It’s important to continue to monitor them and perhaps even re-evaluate your decision to use them.

First, make sure that you keep using the specific package versions you think you are. Most dependency managers now make it easy or even automatic to record the cryptographic hash of the expected source code for a given package version and then to check that hash when re-downloading the package on another computer or in a test environment. This ensures that your build use the same dependency source code you inspected and tested. These kinds of checks prevented the event-stream attacker, described earlier, from silently inserting malicious code in the already-released version 3.3.5. Instead, the attacker had to create a new version, 3.3.6, and wait for people to upgrade (without looking closely at the changes).

It is also important to watch for new indirect dependencies creeping in: upgrades can easily introduce new packages upon which the success of your project now depends. They deserve your attention as well. In the case of event-stream, the malicious code was hidden in a different package, flatmap-stream, which the new event-stream release added as a new dependency.

Creeping dependencies can also affect the size of your project. During the development of Google’s Sawzall²³—a JIT’ed logs processing language—the authors discovered at various times that the main interpreter binary contained not just Sawzall’s JIT but also (unused) PostScript, Python, and JavaScript interpreters. Each time, the culprit turned out to be unused dependencies declared by some library Sawzall did depend on, combined with the fact that Google’s build system eliminated any manual effort needed to start using a new dependency.. This kind of error is the reason that the Go language makes importing an unused package a compile-time error.

Upgrading is a natural time to revisit the decision to use a dependency that’s changing. It’s also important to periodically revisit any dependency that isn’t changing. Does it seem plausible that there are no security problems or other bugs to fix? Has the project been abandoned? Maybe it’s time to start planning to replace that dependency.

It’s also important to recheck the security history of each dependency. For example, Apache Struts disclosed different major remote code execution vulnerabilities in 2016, 2017, and 2018. Even if you have a list of all the servers that run it and update them promptly, that track record might make you rethink using it at all.

Conclusion

Software reuse is finally here, and I don’t mean to understate its benefits: it has brought an enormously positive transformation for software developers. Even so, we’ve accepted this transformation without completely thinking through the potential consequences. The old reasons for trusting dependencies are becoming less valid at exactly the same time we have more dependencies than ever.

The kind of critical examination of specific dependencies that I outlined in this article is a significant amount of work and remains the exception rather than the rule. But I doubt there are any developers who actually make the effort to do this for every possible new dependency. I have only done a subset of them for a subset of my own dependencies. Most of the time the entirety of the decision is “let’s see what happens.” Too often, anything more than that seems like too much effort.

But the Copay and Equifax attacks are clear warnings of real problems in the way we consume software dependencies today. We should not ignore the warnings. I offer three broad recommendations.

Recognize the problem. If nothing else, I hope this article has convinced you that there is a problem here worth addressing. We need many people to focus significant effort on solving it.
Establish best practices for today. We need to establish best practices for managing dependencies using what’s available today. This means working out processes that evaluate, reduce, and track risk, from the original adoption decision through to production use. In fact, just as some engineers specialize in testing, it may be that we need engineers who specialize in managing dependencies.
Develop better dependency technology for tomorrow. Dependency managers have essentially eliminated the cost of downloading and installing a dependency. Future development effort should focus on reducing the cost of the kind of evaluation and maintenance necessary to use a dependency. For example, package discovery sites might work to find more ways to allow developers to share their findings. Build tools should, at the least, make it easy to run a package’s own tests. More aggressively, build tools and package management systems could also work together to allow package authors to test new changes against all public clients of their APIs. Languages should also provide easy ways to isolate a suspect package.

There’s a lot of good software out there. Let’s work together to find out how to reuse it safely.

References

Rachel Potvin and Josh Levenberg, “Why Google Stores Billions of Lines of Code in a Single Repository,” Communications of the ACM 59(7) (July 2016), pp. 78-87. https://doi.org/10.1145/2854146 (⇡)
Russ Cox, “Go & Versioning,” February 2018. https://research.swtch.com/vgo (⇡)
Ken Thompson, “Reflections on Trusting Trust,” Communications of the ACM 27(8) (August 1984), pp. 761–763. https://doi.org/10.1145/358198.358210 (⇡)
GNU Project, “GNU General Public License, version 1,” February 1989. https://www.gnu.org/licenses/old-licenses/gpl-1.0.html (⇡)
Titus Winters, “SD-8: Standard Library Compatibility,” C++ Standing Document, August 2018. https://isocpp.org/std/standing-documents/sd-8-standard-library-compatibility (⇡)
Go Project, “Go 1 and the Future of Go Programs,” September 2013. https://golang.org/doc/go1compat (⇡)
Facebook, “Infer: A tool to detect bugs in Java and C/C++/Objective-C code before it ships.” https://fbinfer.com/ (⇡)
“SpotBugs: Find bugs in Java Programs.” https://spotbugs.github.io/ (⇡)
D. Richard Hipp, “How SQLite is Tested.” https://www.sqlite.org/testing.html (⇡)
Alexander Potapenko, “Testing Chromium: ThreadSanitizer v2, a next-gen data race detector,” April 2014. https://blog.chromium.org/2014/04/testing-chromium-threadsanitizer-v2.html (⇡)
NIST, “National Vulnerability Database – Search and Statistics.” https://nvd.nist.gov/vuln/search (⇡)
Russ Cox, “Regular Expression Matching with a Trigram Index, or How Google Code Search Worked,” January 2012. https://swtch.com/~rsc/regexp/regexp4.html (⇡)
Google, “Google Open Source: Using Third-Party Licenses.” https://opensource.google.com/docs/thirdparty/licenses/#banned (⇡)
Nathan Willis, “A single Node of failure,” LWN, March 2016. https://lwn.net/Articles/681410/ (⇡)
Charlie Reis, “Multi-process Architecture,” September 2008. https://blog.chromium.org/2008/09/multi-process-architecture.html (⇡)
Adam Langley, “Chromium’s seccomp Sandbox,” August 2009. https://www.imperialviolet.org/2009/08/26/seccomp.html (⇡)
Nicolas Lacasse, “Open-sourcing gVisor, a sandboxed container runtime,” May 2018. https://cloud.google.com/blog/products/gcp/open-sourcing-gvisor-a-sandboxed-container-runtime (⇡)
Adam Baldwin, “Details about the event-stream incident,” November 2018. https://blog.npmjs.org/post/180565383195/details-about-the-event-stream-incident (⇡)
Russ Cox, “RE2: a principled approach to regular expression matching,” March 2010. https://opensource.googleblog.com/2010/03/re2-principled-approach-to-regular.html (⇡)
Rob Pike, “Go Proverbs,” November 2015. https://go-proverbs.github.io/ (⇡)
U.S. House of Representatives Committee on Oversight and Government Reform, “The Equifax Data Breach,” Majority Staff Report, 115th Congress, December 2018. https://oversight.house.gov/report/committee-releases-report-revealing-new-information-on-equifax-data-breach/ (⇡)
Russ Cox, “The Principles of Versioning in Go,” GopherCon Singapore, May 2018. https://www.youtube.com/watch?v=F8nrpe0XWRg (⇡)
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan, “Interpreting the Data: Parallel Analysis with Sawzall,” Scientific Programming Journal, vol. 13 (2005). https://doi.org/10.1155/2005/962135 (⇡)

Coda

This post is a draft of my current thinking on this topic. I hope that sharing it will provoke productive discussion, attract more attention to the general problem, and help me refine my own thoughts. I also intend to publish a revised copy of this as an article elsewhere. For both these reasons, unlike most of my blog posts, this post is not Creative Commons-licensed. Please link people to this post instead of making a copy. When a more final version is published, I will link to it here.

Why Add Versions To Go?

2018-06-07T10:20:00-04:00

People sometimes ask me why we should add package versions to Go at all. Isn't Go doing well enough without versions? Usually these people have had a bad experience with versions in another language, and they associate versions with breaking changes. In this post, I want to talk a little about why we do need to add support for package versions to Go. Later posts will address why we won't encourage breaking changes.

The go get command has two failure modes caused by ignorance of versions: it can use code that is too old, and it can use code that is too new. For example, suppose we want to use a package D, so we run go get D with no packages installed yet. The go get command will download the latest copy of D (whatever git clone brings down), which builds successfully. To make our discussion easier, let's call that D version 1.0 and keep D's dependency requirements in mind (and in our diagrams). But remember that while we understand the idea of versions and dependency requirements, go get does not.

$ go get D

Now suppose that a month later, we want to use C, which happens to import D. We run go get C. The go get command downloads the latest copy of C, which happens to be C 1.8 and imports D. Since go get already has a downloaded copy of D, it uses that one instead of incurring the cost of a fresh download. Unfortunately, the build of C fails: C is using a new feature from D introduced in D 1.4, and go get is reusing D 1.0. The code is too old.

$ go get C

Next we try running go get -u, which downloads the latest copy of all the code involved, including code already downloaded.

$ go get -u C

Unfortunately, D 1.6 was released an hour ago and contains a bug that breaks C. Now the code is too new. Watching this play out from above, we know what go get needs to do: use D ≥ 1.4 but not D 1.6, so maybe D 1.4 or D 1.5. It's very difficult to tell go get that today, since it doesn't understand the concept of a package version.

Getting back to the original question in the post, why add versions to Go?

Because agreeing on a versioning system—a syntax for version identifiers, along with rules for how to order and interpret them—establishes a way for us to communicate more precisely with our tools, and with each other, about which copy of a package we mean. Versioning matters for correct builds, as we just saw, but it enables other interesting tools too.

For example, the obvious next step is to be able to list which versions of a package are being used in a given build and whether any of them have updates available. Generalizing that, it would be useful to have a tool that examines a list of builds, perhaps all the targets built at a given company, and assembles the same list. Such a list of versions can then feed into compliance checks, queries into bug databases, and so on. Embedding the version list in a built binary would even allow a program to make these checks on its own behalf while it runs. These all exist for other systems already, of course: I'm not claiming the ideas are novel. The point is that establishing agreement on a versioning system enables all these tools, which can even be built outside the language toolchain.

We can also move from query tools, which tell you about your code, to development tools, which update it for you. For example, an obvious next step is a tool to update a package's dependencies to their latest versions automatically whenever the package's tests and those of its dependencies continue to pass. Being able to describe versions might also enable tools that apply code cleanups. For example, having versions would let us write instructions “when using D version ≥ 1.4, replace the common client code idiom x.Foo(1).Bar(2) with x.FooBar()” that a tool like go fix could execute.

The goal of our work adding versions to the core Go toolchain—or, more generally, adding them to the shared working vocabulary of both Go developers and our tools—is to establish a foundation that helps with core issues like building working programs but also enables interesting external tools like these, and certainly others we haven't imagined yet.

If we're building a foundation for other tools, we should aim to make that foundation as versatile, strong, and robust as possible, to enable as many other tools as possible, with as little hindrance as possible to those tools. We're not just writing a single tool. We're defining the way all these tools will work together. This foundation is an API in the broad sense of something that programs must be written against. Like in any API, we want to choose a design that is powerful enough to enable many uses but at the same time simple, reliable, consistent, coherent, and predictable. Future posts will explore how vgo's design decisions aim for those properties.

What is Software Engineering?

2018-05-30T10:00:00-04:00

Nearly all of Go’s distinctive design decisions were aimed at making software engineering simpler and easier. We've said this often. The canonical reference is Rob Pike's 2012 article, “Go at Google: Language Design in the Service of Software Engineering.” But what is software engineering?

Software engineering is what happens to programming
when you add time and other programmers.

Programming means getting a program working. You have a problem to solve, you write some Go code, you run it, you get your answer, you’re done. That’s programming, and that's difficult enough by itself. But what if that code has to keep working, day after day? What if five other programmers need to work on the code too? Then you start to think about version control systems, to track how the code changes over time and to coordinate with the other programmers. You add unit tests, to make sure bugs you fix are not reintroduced over time, not by you six months from now, and not by that new team member who’s unfamiliar with the code. You think about modularity and design patterns, to divide the program into parts that team members can work on mostly independently. You use tools to help you find bugs earlier. You look for ways to make programs as clear as possible, so that bugs are less likely. You make sure that small changes can be tested quickly, even in large programs. You're doing all of this because your programming has turned into software engineering.

(This definition and explanation of software engineering is my riff on an original theme by my Google colleague Titus Winters, whose preferred phrasing is “software engineering is programming integrated over time.” It's worth seven minutes of your time to see his presentation of this idea at CppCon 2017, from 8:17 to 15:00 in the video.)

As I said earlier, nearly all of Go’s distinctive design decisions have been motivated by concerns about software engineering, by trying to accommodate time and other programmers into the daily practice of programming.

For example, most people think that we format Go code with gofmt to make code look nicer or to end debates among team members about program layout. But the most important reason for gofmt is that if an algorithm defines how Go source code is formatted, then programs, like goimports or gorename or go fix, can edit the source code more easily, without introducing spurious formatting changes when writing the code back. This helps you maintain code over time.

As another example, Go import paths are URLs. If code said import "uuid", you’d have to ask which uuid package. Searching for uuid on godoc.org turns up dozens of packages. If instead the code says import "github.com/pborman/uuid", now it’s clear which package we mean. Using URLs avoids ambiguity and also reuses an existing mechanism for giving out names, making it simpler and easier to coordinate with other programmers.

Continuing the example, Go import paths are written in Go source files, not in a separate build configuration file. This makes Go source files self-contained, which makes it easier to understand, modify, and copy them. These decisions, and more, were all made with the goal of simplifying software engineering.

In later posts I will talk specifically about why versions are important for software engineering and how software engineering concerns motivate the design changes from dep to vgo.

The vgo proposal is accepted. Now what?

2018-05-29T16:45:00-04:00

Last week, the proposal review committee accepted the “vgo approach” elaborated on this blog in February and then summarized as proposal #24301. There has been some confusion about exactly what that means and what happens next.

In general, a Go proposal is a discussion about whether to adopt a particular approach and move on to writing, reviewing, and releasing a production implementation. Accepting a proposal does not mean the implementation is complete. (In some cases there is no implementation yet at all!) Accepting a proposal only means that we believe the design is appropriate and that the production implementation can proceed and be committed and released. Inevitably we find details that need adjustment during that process.

Vgo as it exists today is not the final implementation. It is a prototype to make the ideas concrete and to make it possible to experiment with the approach. Bugs and design flaws will necessarily be found and fixed as we move toward making it the official approach in the go command. For example, the original vgo prototype downloaded code from sites like GitHub using their APIs, for better efficiency and to avoid requiring users to have every possible version control system installed. Unfortunately, the GitHub API is far more restrictively rate-limited than plain git access, so the current vgo implementation has gone back to invoking git. Although we'd still like to move away from version control as the default mechanism for obtaining open source code, we won't do that until we have a viable replacement ready, to make any transition as smooth as possible.

More generally, the key reason for the vgo proposal is to add a common vocabulary and semantics around versions of Go code, so that developers and all kinds of tools can be precise when talking to each other about exactly which program should be built, run, or analyzed. Accepting the proposal is the beginning, not the end.

One thing I've heard from many people is that they want to start using vgo in their company or project but are held back by not having support for it in the toolchains their developers are using. The fact that vgo is integrated deeply into the go command, instead of being a separate vendor directory-writer, introduces a chicken-and-egg problem. To address that problem and make it as easy as possible for developers to try the vgo approach, we plan to include vgo functionality as an experimental opt-in feature in Go 1.11, with the hope of incorporating feedback and finalizing the feature for Go 1.12. (This rollout is analogous to how we included vendor directory functionality as an experimental opt-in feature in Go 1.5 and turned it on by default in Go 1.6.) We also plan to make minimal changes to legacy go get so that it can obtain and understand code written using vgo conventions. Those changes will be included in the next point release for Go 1.9 and Go 1.10.

One thing I've heard from zero people is that they wish my blog posts were longer. The original posts are quite dense and a number of important points are more buried than they should be. This post is the first of a series of much shorter posts to try to make focused points about specific details of the vgo design, approach, and process.

Versioned Go Commands

2018-02-23T10:09:00-05:00

What does it mean to add versioning to the go command? The overview post gave a preview, but the followup posts focused mainly on underlying details: the import compatibility rule, minimal version selection, and defining go modules. With those better understood, this post examines the details of how versioning affects the go command line and the reasons for those changes.

The major changes are:

All commands (go build, go run, and so on) will download imported source code automatically, if the necessary version is not already present in the download cache on the local system.
The go get command will serve mainly to change which version of a package should be used in future build commands.
The go list command will add access to module information.
A new go release command will automate some of the work a module author should do when tagging a new release, such as checking API compatibility.
The all pattern is redefined to make sense in the world of modules.
Developers can and will be encouraged to work in directories outside the GOPATH tree.

All these changes are implemented in the vgo prototype.

Deciding exactly how a build system should work is hard. The introduction of new build caching in Go 1.10 prompted some important, difficult decisions about the meaning of go commands, and the introduction of versioning does too. Before I explain some of the decisions, I want to start by explaining a guiding principle that I've found helpful recently, which I call the isolation rule:

The result of a build command should depend only on the source files that are its logical inputs, never on hidden state left behind by previous build commands.)
That is, what a command does in isolation—on a clean system loaded with only the relevant input source files—is what it should do all the time, no matter what else has happened on the system recently.

To see the wisdom of this rule, let me retell an old build story and show how the isolation rule explains what happened.

An Old Build Story

Long ago, when compilers and computers were very slow, developers had scripts to build their whole programs from scratch, but if they were just modifying one source file, they might save time by manually recompiling just that file and then relinking the overall program, avoiding the cost of recompiling all the source files that hadn't changed. These manual incremental builds were fast but error-prone: if you forgot to recompile a source file that you'd modified, the link of the final executable would use an out-of-date object file, the executable would demonstrate buggy behavior, and you might spend a long time staring at the (correct!) source code looking for a bug that you'd already fixed.

Stu Feldman once explained what it was like in the early 1970s when he spent a few months working on a few-thousand-line Ratfor program:

I would go home for dinner at six or so, recompile the whole world in the background, shut up, and then drive home. It would take through the drive home and through dinner for anything to happen. This is because I kept making the classic error of debugging a correct program, because you'd forget to compile the change.

Transliterated to modern C tools (instead of Ratfor), Feldman would work on a large program by first compiling it from scratch:

$ rm -f *.o && cc *.c && ld *.o

This build follows the isolation rule: starting from the same source files, it produces the same result, no matter what else has been run in that directory.

But then Feldman would make changes to specific source files and recompile only the modified ones, to save time:

$ cc r2.c r3.c r5.c && ld *.o

This incremental build does not follow the isolation rule. The correctness of the command depends on Feldman remembering which files they modified, and it's easy to forget one. But it was so much faster, everyone did it anyway, resorting to routines like Feldman's daily “build during dinner” to correct any mistakes.

Feldman continued:

Then one day, Steve Johnson came storming into my office in his usual way, saying basically, “Goddamn it, I just spent the whole morning debugging a correct program, again. Why doesn't anybody do something like this? ...”

And that's the story of how Stu Feldman invented make.

Make was a major advance because it provided fast, incremental builds that followed the isolation rule. Isolation is important because it means the build is properly abstracted: only the source code matters. As a developer, you can make changes to source code and not even think about details like stale object files.

However, the isolation rule is never an absolute. There is always some area where it applies, which I call the abstraction zone. When you step out of the abstraction zone, you are back to needing to keep state in your head. For make, the abstraction zone is a single directory. If you are working on a program made up of libraries in multiple directories, traditional make is no help. Most Unix programs in the 1970s fit in a single directory, so it just wasn't important for make to provide isolation semantics in multi-directory builds.

Go Builds and the Isolation Rule

One way to view the history of design bug fixes in the go command is a sequence of steps extending its abstraction zone to better match developer expectations.

One of the advances of the go command was correct handling of source code spread across multiple directories, extending the abstraction zone beyond what make provided. Go programs are almost always spread across multiple directories, and when we used make it was very common to forget to install a package in one directory before trying to use it in another directory. We were all too familiar with “the classic error of debugging a correct program.” But even after fixing that, there were still many ways to step out of the go command's abstraction zone, with unfortunate consequences.

To take one example, if you had multiple directory trees listed in GOPATH, builds in one tree blindly assumed that installed packages in the others were up-to-date if present, but it would rebuild them if missing. This violation of the isolation rule caused no end of mysterious problems for projects using godep, which used a second GOPATH entry to simulate vendor directories. We fixed this in Go 1.5.

As another example, until very recently command-line flags were not part of the abstraction zone. If you start with a standard Go 1.9 distribution and run

$ go build hello.go
$ go install -a -gcflags=-N std
$ go build hello.go

the second go build command produces a different executable than the first. The first hello is linked against an optimized build of the Go and standard library, while the second hello is linked against an unoptimized standard library. This violation of the isolation rule led to widespread use of go build -a (always rebuild everything), to reestablish isolation semantics. We fixed this in Go 1.10.

In both cases, the go command was “working as designed.” These were the kinds of details that we always kept mental track of when using other build systems, so it seemed reasonable to us not to abstract them away. In fact, when I designed the behavior, I thought it was feature that

$ go install -a -gcflags=-N std
$ go build hello.go

let you build an optimized hello against an unoptimized standard library, and I sometimes took advantage of that. But, on the whole, Go developers disagreed. They did not expect to, nor want to, keep mental track of that state. For me, the isolation rule is useful because it gives a simple test that helps me cut through any mental contamination left by years of using less capable build systems: every command should have only one meaning, no matter what other commands have preceded it.

The isolation rule implies that some commands may need to be made more complex, so one command can serve where two commands did before. For example, if you follow the isolation rule, how do you build an optimized hello against an unoptimized standard library? We answered this in Go 1.10 by extending the -gcflags argument to start with an optional pattern that controls which packages the flags affect. To build an optimized hello against an unoptimized standard library, go build -gcflags=std=-N hello.go.

The isolation rule also implies that previously context-dependent commands need to settle on one context-independent meaning. A good general rule seems to be to use the one meaning that developers are most familiar with. For example, a different variation of the flag problem is:

$ go build -gcflags=-N hello.go
$ rm -rf $GOROOT/pkg
$ go build -gcflags=-N hello.go

In Go 1.9, the first go build command builds an unoptimized hello against the preinstalled, optimized standard library. The second go build command finds no preinstalled standard library, so it rebuilds the standard library, and the -gcflags applies to all packages built during the command, so the result is an unoptimized hello built against an unoptimized standard library. For Go 1.10, we had to choose which meaning is the one true meaning.

Our original thought was that in the absence of a restricting pattern like std=, the -gcflags=-N should apply to all packages in the build, so that this command would always build an unoptimized hello against an unoptimized standard library. But most developers expect this command to apply the -gcflags=-N only to the argument of go build, namely hello.go, because that's how it works in the common case, when you have not just deleted $GOROOT/pkg. We decided to preserve this expectation, defining that when no pattern is given, the flags apply only to the packages or files named on the build comamnd line. In Go 1.10, building hello.go with -gcflags=-N always builds an unoptimized hello against an optimized standard library, even if $GOROOT/pkg has been deleted and the standard library must be rebuilt on the spot. If you do want a completely unoptimized build, that's -gcflags=all=-N.

The isolation rule is also helpful for thinking through the design questions that arise in a versioned go command. Like in the flag decisions, some commands need to be made more capable. Others have multiple meanings now and must be reduced to a single meaning.

Automatic Downloads

The most significant implication of the isolation rule is that commands like go build, go install, and go test should download versioned dependencies as needed (that is, if not already downloaded and cached).

Suppose I have a brand new Go 1.10 installation and I write this program to hello.go:

package main

import (
	"fmt"
	"rsc.io/quote"
)

func main() {
	fmt.Println(quote.Hello())
}

This fails:

$ go run hello.go
hello.go:5: import "rsc.io/quote": import not found
$

But this succeeds:

$ go get rsc.io/quote
$ go run hello.go
Hello, world.
$

I can explain this. After eight years of conditioning by use of goinstall and go get, it seemed obvious to me that this behavior was correct: go get downloads rsc.io/quote for us and stashes it away for use by future commands, so of course that must happen before go run. But I can explain the behavior of the optimization flag examples in the previous section too, and until a few months ago they also seemed obviously correct. After more thought, I now believe that any go command should be able to download versioned dependencies as needed. I changed my mind for a few reasons.

The first reason is the isolation rule. The fact that every other design mistake I've made in the go command violated the isolation rule strongly suggests that requiring a prepatory go get is a mistake too.

The second reason is that I've found it helpful to think of the downloaded versioned source code as living in a local cache that developers shouldn't need to think about at all. If it's really a cache, cache misses can't be failures.

The third reason is the mental bookkeeping required. Today's go command expects developers to keep track of which packages are and are not downloaded, just as earlier go commands expected developers to keep track of which compiler flags had been used during the most recent package installs. As programs grow and as we add more precision about versioning, the mental burden will grow, even though the go command is already tracking the same information. For example, I think this hypothetical session is a suboptimal developer experience:

$ git clone https://github.com/rsc/hello
$ cd hello
$ go build
go: rsc.io/sampler(v1.3.1) not installed
$ go get
go: installing rsc.io/sampler(v1.3.1)
$ go build
$

If the command knows exactly what it needs, why make the user do it?

The fourth reason is that build systems in other languages already do this. When you check out a Rust repo and build it, cargo build automatically fetches the dependencies as part of the build, no questions asked.

The fifth reason is that downloading on demand allows downloading lazily, which in large programs may mean not downloading many dependencies at all. For example, the popular logging package github.com/sirupsen/logrus depends on golang.org/x/sys, but only when building on Solaris. The eventual go.mod file in logrus would list a specific version of x/sys as a dependency. When vgo sees logrus in a project, it will consult the go.mod file and determine which version satisfies an x/sys import. But all the users not building for Solaris will never see an x/sys import, so they can avoid the download of x/sys entirely. This optimization will become more important as the dependency graph grows.

I do expect resistance from developers who aren't yet ready to think about builds that download code on demand. We may need to make it possible to disable that with an environment variable, but downloads should be enabled by default.

Changing Versions (`go` `get`)

Plain go get, without -u, violates the command isolation rule and must be fixed. Today:

If GOPATH is empty, go get rsc.io/quote downloads and builds the latest version of rsc.io/quote and its dependencies (for example, rsc.io/sampler).
If there is already a rsc.io/quote in GOPATH, from a go get last year, then the new go get builds the old version.
If rsc.io/sampler is already in GOPATH but rsc.io/quote is not, then go get downloads the latest rsc.io/quote and builds it against the old copy of rsc.io/sampler.

Overall, go get depends on the state of GOPATH, which breaks the command isolation rule. We need to fix that. Since go get has at least three meanings today, we have some latitude in defining new behavior. Today, vgo get fetches the latest version of the named modules but then the exact versions of any dependencies requested by those modules, subject to minimal version selection. For example, vgo get rsc.io/quote always fetches the latest version of rsc.io/quote and then builds it with the exact version of rsc.io/sampler that rsc.io/quote has requested.

Vgo also allows module versions to be specified on the command line:

$ vgo get rsc.io/quote@latest  # default
$ vgo get rsc.io/quote@v1.3.0
$ vgo get rsc.io/quote@'<v1.6' # finds v1.5.2

All of these also download (if not already cached) the specific version of rsc.io/sampler named in rsc.io/quote's go.mod file. These commands modify the current module's go.mod file, and in that sense they do influence the operation of future commands. But that influence is through an explicit file that users are expected to know about and edit, not through hidden cache state. Note that if the version requested on the command line is earlier than the one already in go.mod, then vgo get does a downgrade, which will also downgrade other packages if needed, again following minimal version selection.

In contrast to plain go get, the go get -u command behaves the same no matter what the state of the GOPATH source cache: it downloads the latest copy of the named packages and the latest copy of all their dependencies. Since it follows the command isolation rule, we should keep the same behavior: vgo get -u upgrades the named modules to their latest versions and also upgrades all of their dependencies.

One idea that has come up in past few days is to introduce a mode halfway between vgo get (download the exact dependencies of the thing I asked for) and vgo get -u (download the latest dependencies). If we believe that authors are conscientious about being very careful with patch releases and only using them for critical, safe fixes, then it might make sense to have a vgo get -p that is like vgo get but then applies only patch-level upgrades. For example, if rsc.io/quote requires rsc.io/sampler v1.3.0 but v1.3.1 and v1.4.0 are also available, then vgo get -p rsc.io/quote would upgrade rsc.io/sampler to v1.3.1, not v1.4.0. If you think this would be useful, please let us know.

Of course, all the vgo get variants record the effect of their additions and upgrades in the go.mod file. In a sense, we've made these commands follow the isolation rule by introducing go.mod as an explicit, visible input replaces a previously implicit, hidden input: the state of the entire GOPATH.

Module Information (`go` `list`)

In addition to changing the versions being used, we need to provide some way to inspect the current ones. The go list command is already in charge of reporting useful information:

$ go list -f {{.Dir}} rsc.io/quote
/Users/rsc/src/rsc.io/quote
$ go list -f {{context.ReleaseTags}}
[go1.1 go1.2 go1.3 go1.4 go1.5 go1.6 go1.7 go1.8 go1.9 go1.10]
$

It probably makes sense to make module information available to the format template, and we should also provide shorthands for common operations like listing all the current module's dependencies. The vgo prototype already provides correct information for packages in dependency modules. For example:

$ vgo list -f {{.Dir}} rsc.io/quote
/Users/rsc/src/v/rsc.io/quote@v1.5.2
$

It also has a few shorthands. First, vgo list -t lists all available tagged versions of a module:

$ vgo list -t rsc.io/quote
rsc.io/quote
	v1.0.0
	v1.1.0
	v1.2.0
	v1.2.1
	v1.3.0
	v1.4.0
	v1.5.0
	v1.5.1
	v1.5.2
$

Second, vgo list -m lists the current module followed by its dependencies:

$ vgo list -m
MODULE                VERSION
github.com/you/hello  -
golang.org/x/text     v0.0.0-20170915032832-14c0d48ead0c
rsc.io/quote          v1.5.2
rsc.io/sampler        v1.3.0
$

Finally, vgo list -m -u adds a column showing the latest version of each module:

$ vgo list -m -u
MODULE                VERSION                             LATEST
github.com/you/hello  -                                   -
golang.org/x/text     v0.0.0-20170915032832-14c0d48ead0c  v0.0.0-20180208041248-4e4a3210bb54
rsc.io/quote          v1.5.2 (2018-02-14 10:44)           -
rsc.io/sampler        v1.3.0 (2018-02-13 14:05)           v1.99.99 (2018-02-13 17:20)
$

In the long term, these should be shorthands for more general support in the format template, so that other programs can obtain the information in other forms. Today they are just special cases.

Preparing New Versions (`go` `release`)

We want to encourage authors to issue tagged releases of their modules, so we need to make that as easy as possible. We intend to add a go release command that can take care of as much of the bookkeeping as needed. For example, it might:

Check for backwards-incompatible type changes, compared to the previous release. We run a check like this when working on the Go standard library, and it is very helpful.
Suggest whether this release should be a new point release or a new minor release (because there's new API or because many lines of code have changed). Or perhaps always suggest a new minor release unless the author asks for a point release, to keep a potential go get -p useful.
Scan all source files in the module, even ones that aren't normally built, to make sure that all imports can be satisfied by the requirements listed in go.mod. Referring back to the example in the download section, this check would make sure that logrus's go.mod lists x/sys.

As new best practices for releases arise, we can add them to go release so that authors always only have one step to check whether their module is ready for a new release.

Pattern matching

Most go commands take a list of packages as arguments, and that list can include patterns, like rsc.io/... (all packages with import paths beginning with rsc.io/), or ./... (all packages in the current directory or subdirectories), or all (all packages). We need to check that these make sense in the new world of modules.

Originally, patterns did not treat vendor directories specially, so that if github.com/you/hello/vendor/rsc.io/quote existed, then go test github.com/you/hello/... matched and tested it, as did go test ./... when working in the hello source directory. The argument in favor of matching vendored code was that doing so avoided a special case and that it was actually useful to test your dependencies, as configured in your project, along with the rest of your project. The argument against matching vendored code was that many developers wanted an easy way to test just the code in their projects, assuming that dependencies have already been tested separately and are not changing. In Go 1.9, respecting that argument, we changed the ... pattern not to walk into vendor directories, so that go test github.com/you/hello/... does not test vendored dependencies. This sets up nicely for vgo, which naturally would not match dependencies either, since they no longer live in a subdirectory of the main project. That is, there is no change in the behavior of ... patterns when moving from go to vgo, because that change happened from Go 1.8 to Go 1.9 instead.

That leaves the pattern all. When we first wrote the go command, before goinstall and go get, it made sense to talk about building or testing “all packages.” Today, it makes much less sense: most developers work in a GOPATH that has a mix of many different things, including many packages downloaded and forgotten about. I expect that almost no one runs commands like go install all or go test all anymore: it catches too many things that don't matter. The real problem is that go test all violates the isolation rule: its meaning depends on the implicit state of GOPATH set up by previous commands, so no one depends on its meaning anymore. In the vgo prototype, we have redefined all to have a single, consistent meaning: all the packages in the current module, plus all the packages they depend on through one a sequence of one or more imports.

The new all is exactly the packages a developer would need to test in order to sanity check that a particular combination of dependency versions work together, but it leaves out nearby packages that don't matter in the current module. For example, in the overview post, our hello module imported rsc.io/quote but not any other packages, and in particular not the buggy package rsc.io/quote/buggy. Running go test all in the hello module tests all packages in that module and then also rsc.io/quote. It omits rsc.io/quote/buggy, because that one is not needed, even indirectly, by the hello module, so it's irrelevant to test. This definition of all restores repeatability, and combined with Go 1.10's test caching, it should make go test all more useful than it ever has been.

Working outside GOPATH

If there can be multiple versions of a package with a given import path, then it no longer makes sense to require the active development version of that package to reside in a specific directory. What if I need to work on bug fixes for both v1.3 and v1.4 at the same time? Clearly it must be possible to check out modules in different locations. In fact, at that point there's no need to work in GOPATH at all.

GOPATH was doing three things: it defined the versions of dependencies (now in go.mod), it held the source code for those dependencies (now in a separate cache), and it provided a way to infer the import path for code in a particular directory (remove the leading $GOPATH/src). As long as we have some mechanism to decide the import path for the code in the current directory, we can stop requiring that developers work in GOPATH. That mechanism is the go.mod file's module directive. If I'm a directory named buggy and ../go.mod says:

module "rsc.io/quote"

then my directory's import path must be rsc.io/quote/buggy.

The vgo prototype enables work outside GOPATH today, as the examples in the overview post showed. In fact, when inferring a go.mod from other dependency information, vgo will look for import comments in the current directory or subdirectories to try to get its bearings. For example, this worked even before Upspin had introduced a go.mod file:

$ cd $HOME
$ git clone https://github.com/upspin/upspin
$ cd upspin
$ vgo test -short ./...

The vgo command inferred from import comments that the module is named upspin.io, and it inferred a list of dependency version requirements from Gopkg.lock.

What's Next?

This is the last of my initial posts about the vgo design and prototype. There is more to work out, but inflicting 67 pages of posts on everyone seems like enough for one week.

I had planned to post a FAQ today and submit a Go proposal Monday, but I will be away next week after Monday. Rather than disappear for the first four days of official proposal discussion, I think I will post the proposal when I return. Please continue to ask questions on the mailing list threads or on these posts and to try the vgo prototype.

Thanks very much for all your interest and feedback so far. It's very important to me that we all work together to produce something that works well for Go developers and that is easy for us all to switch to.

Update, March 20, 2018: The official Go proposal is at https://golang.org/issue/24301, and the second comment on the issue will be the FAQ.

Defining Go Modules

2018-02-22T17:00:00-05:00

As introduced in the overview post, a Go module is a collection of packages versioned as a unit, along with a go.mod file listing other required modules. The move to modules is an opportunity for us to revisit and fix many details of how the go command manages source code. The current go get model will be about ten years old when we retire it in favor of modules. We need to make sure that the module design will serve us well for the next decade. In particular:

We want to encourage more developers to tag releases of their packages, instead of expecting that users will just pick a commit hash that looks good to them. Tagging explicit releases makes clear what is expected to be useful to others and what is still under development. At the same time, it must still be possible—although maybe not convenient—to request specific commits.
We want to move away from invoking version control tools such as bzr, fossil, git, hg, and svn to download source code. These fragment the ecosystem: packages developed using Bazaar or Fossil, for example, are effectively unavailable to users who cannot or choose not to install these tools. The version control tools have also been a source of exciting security problems. It would be good to move them outside the security perimeter.
We want to allow multiple modules to be developed in a single source code repository but versioned independently. While most developers will likely keep working with one module per repo, larger projects might benefit from having multiple modules in a single repo. For example, we'd like to keep golang.org/x/text a single repository but be able to version experimental new packages separately from established packages.
We want to make it easy for individuals and companies to put caching proxies in front of go get downloads, whether for availability (use a local copy to ensure the download works tomorrow) or security (vet packages before they can be used inside a company).
We want to make it possible, at some future point, to introduce a shared proxy for use by the Go community, similar in spirit to those used by Rust, Node, and other languages. At the same time, the design must work well without assuming such a proxy or registry.
We want to eliminate vendor directories. They were introduced for reproducibility and availability, but we now have better mechanisms. Reproducibility is handled by proper versioning, and availability is handled by caching proxies.

This post presents the parts of the vgo design that address these issues. Everything here is preliminary: we will change the design if we find that it is not right.

Versioned Releases

Abstraction boundaries let projects scale. Originally, all Go packages could be imported by all other Go packages. We introduced the internal directory convention in Go 1.4 to eliminate the problem that developers who chose to structure a program as multiple packages had to worry about other users importing and depending on details of helper packages never meant for public use.

The Go community has a similar visibility problem now with repository commits. Today, it's very common for users to identify package versions by commit identifiers (usually Git hashes), with the result that developers who structure work as a sequence of commits need to worry, at least in the back of their mind, about users pinning to any of those commits, which again were never meant for public use. We need to change the expectations in the Go open source community, to establish a norm that authors tag releases and users prefer those.

I don't think this point, that users should be choosing from versions issued by authors instead of picking out individual commits from the Git history, is particularly controversial. The difficult part is shifting the norm. We need to make it easy for authors to tag commits and easy for users to use those tags.

The most common way authors share code today is on code hosting sites, especially GitHub. For code on GitHub, all authors will need to do is tag a commit and push the tag. We also plan to provide a tool, maybe called go release, to compare different versions of a module for API compatibility at the type level, to catch inadvertent breaking changes that are visible in the type system, and also to help authors decide between issuing should be a minor release (because it adds new API or changes many lines of code) or only a patch release.

For users, vgo itself operates entirely in terms of tagged versions. However, we know that at least during the transition from old practices to new, and perhaps indefinitely as a way to bootstrap new projects, an escape hatch will be necessary, to allow specifying a commit. This is possible in vgo, but it has been designed so as to make users prefer explicitly tagged versions.

Specifically, vgo understands the special pseudo-version v0.0.0-yyyymmddhhmmss-commit as referring to the given commit identifier, which is typically a shortened Git hash and which must have a commit time matching the (UTC) timestamp. This form is a valid semantic version string for a prerelease of v0.0.0. For example, this pair of Gopkg.toml stanzas:

[[projects]]
  name = "google.golang.org/appengine"
  packages = [
    "internal",
    "internal/base",
    "internal/datastore",
    "internal/log",
    "internal/remote_api",
    "internal/urlfetch",
    "urlfetch"
  ]
  revision = "150dc57a1b433e64154302bdc40b6bb8aefa313a"
  version = "v1.0.0"

[[projects]]
  branch = "master"
  name = "github.com/google/go-github"
  packages = ["github"]
  revision = "922ceac0585d40f97d283d921f872fc50480e06e"

correspond to these go.mod lines:

require (
	"google.golang.org/appengine" v1.0.0
	"github.com/google/go-github" v0.0.0-20180116225909-922ceac0585d
)

The pseudo-version form is chosen so that the standard semver precedence rules compare two pseudo-versions by commit time, because the timestamp encoding makes string comparison match time comparison. The form also ensures that vgo will always prefer a tagged semantic version over an untagged pseudo-version, beacuse even if v0.0.1 is very old, it has a greater semver precedence than any v0.0.0 prerelease. (Note also that this matches the choice made by dep when adding a new dependency to a project.) And of course pseudo-version strings are unwieldy: they stand out in go.mod files and vgo list -m output. All these inconveniences help encourage authors and users to prefer explicitly tagged versions, a bit like the extra step of having to write import "unsafe" encourages developers to prefer writing safe code.

The `go.mod` File

A module version is defined by a tree of source files. The go.mod file describes the module and also indicates the root directory. When vgo is run in a directory, it looks in the current directory and then successive parents to find the go.mod marking the root.

The file format is line-oriented, with // comments only. Each line holds a single directive, which is a single verb (module, require, exclude, or replace, as defined by minimum version selection), followed by arguments:

module "my/thing"
require "other/thing" v1.0.2
require "new/thing" v2.3.4
exclude "old/thing" v1.2.3
replace "bad/thing" v1.4.5 => "good/thing" v1.4.5

The leading verb can be factored out of adjacent lines, leading to a block, like in Go imports:

require (
	"new/thing" v2.3.4
	"old/thing" v1.2.3
)

My goals for the file format were that it be (1) clear and simple, (2) easy for people to read, edit, manipulate, and diff, (3) easy for programs like vgo to read, modify, and write back, preserving comments and general structure, and (4) have room for limited future growth. I looked at JSON, TOML, XML, and YAML but none of them seemed to have those four properties all at once. For example, the approach used in Gopkg.toml above leads to three lines for each requirement, making them harder to skim, sort, and diff. Instead I designed a minimal format reminiscent of the top of a Go program, but hopefully not close enough to be confusing. I adapted an existing comment-friendly parser.

The eventual go command integration may change the file format, perhaps even adopting a more standard framing, but for compatibility we will keep the ability to read today's go.mod files, just as vgo can also read requirement information from GLOCKFILE, Godeps/Godeps.json, Gopkg.lock, dependencies.tsv, glide.lock, vendor.conf, vendor.yml, vendor/manifest, and vendor/vendor.json files.

From Repository to Modules

Developers work in version control systems, and clearly vgo must make that as easy as possible. It is not reasonable to expect developers to prepare module archives themselves, for example. Instead, vgo makes it easy to export modules directly from any version control repository following some basic, unobtrusive conventions.

To start, it suffices to create a repository and tag a commit, using a semver-formatted tag like v0.1.0. The leading v is required, and having three numbers is also required. Although vgo itself accepts shorthands like v0.1 on the command line, the canonical form v0.1.0 must be used in repository tags, to avoid ambiguity. Only the tag is required. In order to use commits made without use of vgo, a go.mod file is not strictly required at this point. Creating new tagged commits creates new module versions. Easy.

When developers reach v2, semantic import versioning means that a /v2/ is added to the import path at the end of the module root prefix: my/thing/v2/sub/pkg. There are good reasons for this convention, as described in the earlier post, but it is still a departure from existing tools. Realizing this, vgo will not use any v2 or later tag in a source code repository without first checking that it has a go.mod with a module path declaration ending in that major version (for example, module "my/thing/v2"). Vgo uses that declaration as evidence that the author is using semantic import versioning to name packages within that module. This is especially important for multi-package modules, since the import paths within the module must contain the /v2/ element to avoid referring back to the v1 module.

We expect that most developers will prefer to follow the usual “major branch” convention, in which different major versions live in different branches. In this case, the root directory in a v2 branch would have a go.mod indicating v2, like this:

This is roughly how most developers already work. In the picture, the v1.0.0 tag points to a commit that predates vgo. It has no go.mod file at all, and that works fine. In the commit tagged v1.0.1, the author has added a go.mod file that says module "my/thing". After that commit, however, the author forks a new v2 development branch. In addition to whatever code changes prompted v2 (including the replacement of bar with quux), the go.mod in that new branch is updated to say module "my/thing/v2". The branches can then proceed independently. In truth, vgo really has no idea about branches. It just resolves the tag to a commit and then looks at the go.mod file in the commit. Again, the go.mod file is required for v2 and later so that vgo can use the module line as a sign that the code has been written with semantic import versioning in mind, so the imports in foo say my/thing/v2/foo/quux, not my/thing/foo/quux.

As an alternative, vgo also supports a “major subdirectory” convention, in which major versions above v1 are developed in subdirectories:

In this case, v2.0.0 is created not by forking the whole tree into a separate branch but by copying it into a subdirectory. Again the go.mod must be updated to say "my/thing/v2". Afterward, v1.x.x tags pointing at commits address the files in the root directory, excluding v2/, while v2.x.x tags pointing at commits address the files in the v2/ subdirectory only. The go.mod file lets vgo distinguishes the two cases. It would also be meaningful to have a v1.x.x and a v2.x.x tag pointing at the same commit: they would address different subtrees of the commit.

We expect that developers may feel strongly about choosing one convention or the other. Instead of taking sides, vgo supports both. Note that for major versions above v2, the major subdirectory approach may provide a more graceful transition for users of go get. On the other hand, users of dep or vendoring tools should be able to consume repositories using either convention. Certainly we will make sure dep can.

Multiple-Module Repositories

Developers may also find it useful to maintain a collection of modules in a single source code repository. We want vgo to support this possibility. In general, there is already wide variation in how different developers, teams, projects, and companies apply source control, and we do not believe it is productive to impose a single mapping like “one repository equals one module” onto all developers. Having some flexibility here should also help vgo adapt as best practices around souce control continue to change.

In the major subdirectory convention, v2/ contains the module "my/thing/v2". A natural extension is to allow subdirectories not named for major versions. For example, we could add a blue/ subdirectory that contains the module "my/thing/blue", confirmed by a blue/go.mod file with that module path. In this case, the source control commit tags addressing that module would take the form blue/v1.x.x. Similarly, the tag blue/v2.x.x would address the blue/v2/ subdirectory. The existence of the blue/go.mod file excludes the blue/ tree from the outer my/thing module.

In the Go project, we intend to explore using this convention to allow repositories like golang.org/x/text to define multiple, independent modules. This lets us retain the convenience of coarse-grained source control but still promote different subtrees to v1 at different times.

Deprecated Versions

Authors also need to be able to deprecate a version, to indicate that it should not be used anymore. This is not yet implemented in the vgo prototype, but one way it could work would be to define that on code hosting sites, the existence of a tag v1.0.0+deprecated (ideally pointing at the same commit as v1.0.0) would indicate that the commit is deprecated. It is of course important not to remove the tag entirely, because that will break builds. Deprecated modules would be highlighted in some way in vgo list -m -u output (“show me my modules and information about updates”), so that users would know to update.

Also, because programs will have access to their own module lists and versions at runtime, a program could also be configured to check its own module versions against some chosen authority and self-report in some way when it is running deprecated versions. Again, the details here are not worked out, but it's a good example of something that's possible once developers and tools share a vocabulary for describing versions.

Publishing

Given a source control repository, developers need to be able to publish it in a form that vgo can consume. In the general case, we will provide a command that authors run to turn their source control repositories into file trees that can be served to vgo by any web server capable of serving static files. Similar to current go get, vgo expects a page with a <meta> tag to help translate from a module name to the tree of files for that module. For example, to look up swtch.com/testmod, the vgo command fetches the usual page:

$ curl -sSL 'https://swtch.com/testmod?go-get=1'
<!DOCTYPE html>
<meta name="go-import" content="swtch.com/testmod mod https://storage.googleapis.com/gomodules/rsc">
Nothing to see here.
$

The mod server type indicates that modules are served in a file tree at that base URL. The relevant files at storage.googleapis.com/gomodules/rsc in this simple case are:

The exact meaning of these URLs is discussed in the “Download Protocol” section later in the post.

Code Hosting Sites

A huge amount of development happens on code hosting sites, and we want that work to integrate into vgo as smoothly as possible. Instead of expecting developers to publish modules elsewhere, vgo will have support for reading the information it needs from those sites directly, using their HTTP-based APIs. In general, archive downloads can be significantly faster than the existing version control checkouts. For example, working on a laptop with a gigabit internet connection, it takes 10 seconds to download the CockroachDB source tree as a zip file from GitHub but almost four minutes to git clone it. Sites need only provide an archive of any form that can be fetched with a simple HTTP GET. Gerrit servers, for example, only support downloading gzipped tar files. Vgo translates downloaded archives into the standard form.

The initial prototype only includes support for GitHub and the Go project's Gerrit server, but we will add support for Bitbucket and other major hosting sites too, before shipping anything in the main Go toolchain.

With the combination of the lightweight repository conventions, which mostly match what developers are already doing, and the support for known code hosting sites, we expect that most open source activity will be unaffected by the move to modules, other than simply adding a go.mod to each repository.

Companies taking advantage of old go get's direct use of git and other source control tools will need to adjust. Perhaps it would make sense to write a proxy that serves the vgo expectations but using version control tools. Companies could then run one of those to produce an experience much like using the open source hosting sites.

Module Archives

The mapping from repositories to modules is a bit complex, because the way developers use source control varies. The end goal is to map all that complexity down into a common, single format for Go modules for use by proxies or other code consumers (for example, godoc.org or any code checking tools).

The standard format in the vgo prototype is zip archives in which all paths begin with the module path and version. For example, after running vgo get of rsc.io/quote v1.5.2, you can find the zip file in vgo's download cache:

$ unzip -l $GOPATH/src/v/cache/rsc.io/quote/@v/v1.5.2.zip
     1479  00-00-1980 00:00   rsc.io/quote@v1.5.2/LICENSE
      131  00-00-1980 00:00   rsc.io/quote@v1.5.2/README.md
      240  00-00-1980 00:00   rsc.io/quote@v1.5.2/buggy/buggy_test.go
       55  00-00-1980 00:00   rsc.io/quote@v1.5.2/go.mod
      793  00-00-1980 00:00   rsc.io/quote@v1.5.2/quote.go
      917  00-00-1980 00:00   rsc.io/quote@v1.5.2/quote_test.go
$

I used zip because it is well-specified, widely supported, and cleanly extensible if needed, and it allows random access to individual files. (In contrast, tar files, the other obvious choice, are none of these things and don't.)

Download Protocol

To download information about modules, as well as the modules themselves, the vgo prototype issues only simple HTTP GET requests. A key design goal was to make it possible to serve modules from static hosting sites, so the requests have no URL query parameters.

As we saw earlier, custom domains can specify that a module is hosted at a particular base URL. As implemented in vgo today (but, like all of vgo, subject to change), that module-hosting server must serve four request forms:

GET baseURL/module/@v/list fetches a list of all known versions, one per line.
GET baseURL/module/@v/version.info fetches JSON-formatted metadata about that version.
GET baseURL/module/@v/version.mod fetches the go.mod file for that version.
GET baseURL/module/@v/version.zip fetches the zip file for that version.

The JSON information served in the version.info form will likely evolve, but today it corresponds to this struct:

type RevInfo struct {
	Version string    // version string
	Name    string    // complete ID in underlying repository
	Short   string    // shortened ID, for use in pseudo-version
	Time    time.Time // commit time
}

The vgo list -m -u command shows the commit time of each available update by using the Time field.

A general module-hosting server may optionally respond to version.info requests for non-semver versions as well. A vgo command like

vgo get my/thing/v2@1459def

will fetch 1459def.info and then derive a pseudo-version using the Time and Short fields.

There are two more optional request forms:

GET baseURL/module/@t/yyyymmddhhmmss returns the .info JSON for the latest version at or before the given timestamp.
GET baseURL/module/@t/yyyymmddhhmmss/branch does the same, but limiting the search to commits on a given branch.

These support the use of untagged commits in vgo. If vgo is adding a module and finds no tagged commits at all, it uses the first form to find the latest commit as of now. It does the same when looking for available updates, assuming there are still no tagged commits. The branch-limited form is used for the internal simulation of gopkg.in. These forms also support the command line syntaxes:

vgo get my/thing/v2@2018-02-01T15:34:45
vgo get my/thing/v2@2018-02-01T15:34:45@branch

These might be a mistake, but they're in the prototype today, so I'm mentioning them.

Proxy Servers

Both individuals and companies may prefer to download Go modules from proxy servers, whether for efficiency, availability, security, license compliance, or any other reason. Having a standard Go module format and a standard download protocol, as described in the last two sections, makes it trivial to introduce support for proxies. If the $GOPROXY environment variable is set, vgo fetches all modules from the server at the given base URL, not from their usual locations. For easy debugging, $GOPROXY can even be a file:/// URL pointing at a local tree.

We intend to write a basic proxy server that serves from vgo's own local cache, downloading new modules as needed. Sharing such a proxy among a set of computers would help reduce redundant downloads from the proxy’s users but more importantly would ensure future availability, even if the original copies disappear. The proxy will also have an option not to allow downloads of new modules. In this mode, the proxy would limit the available modules to exactly those whitelisted by the proxy administrator. Both proxy modes are frequently requested features in corporate environments.

Perhaps some day it would make sense to establish a distributed collection of proxy servers used by default in go get, to ensure module availability and fast downloads for Go developers worldwide. But not yet. Today, we are focused on making sure that go get works as well as it can without assuming any kind of central proxy servers.

The End of Vendoring

Vendor directories serve two purposes. First, they specify by their contents the exact version of the dependencies to use during go build. Second, they ensure the availability of those dependencies, even if the original copies disappear. On the other hand, vendor directories are also difficult to manage and bloat the repositories in which they appear. With the go.mod file specifying the exact version of dependencies to use during vgo build, and with proxy servers for ensuring availability, vendor directories are now almost entirely redundant. They can, however, serve one final purpose: to enable a smooth transition to the new versioned world.

When building a module, vgo (and later go) will completely ignore vendored dependencies; those dependencies will also not be included in the module's zip file. To make it possible for authors to move to vgo and go.mod while still supporting users who haven't converted, the new vgo vendor command populates a module's vendor directory with the packages users need to reproduce the vgo-based build.

What's Next?

The details here may be revised, but today's go.mod files will be understood by any future tooling. Please start tagging your packages with release tags; add go.mod files if that makes sense for your project.

The next post in the series will cover changes to the go tool command line experience.

Reproducible, Verifiable, Verified Builds

2018-02-21T21:28:00-05:00

Once both Go developers and tools share a vocabulary around package versions, it's relatively straightforward to add support in the toolchain for reproducible, verifiable, and verified builds. In fact, the basics are already in the vgo prototype.

Since people sometimes disagree about the exact definitions of these terms, let's establish some basic terminology. For this post:

A reproducible build is one that, when repeated, produces the same result.
A verifiable build is one that records enough information to be precise about exactly how to repeat it.
A verified build is one that checks that it is using the expected source code.

Vgo delivers reproducible builds by default. The resulting binaries are verifiable, in that they record versions of the exact source code that went into the build. And it is possible to configure your repository so that users rebuilding your software verify that their builds match yours, using cryptographic hashes, no matter how they obtain the dependencies.

Reproducible Builds

At the very least, we want to make sure that when you build my program, the build system decides to use the same versions of the code. Minimal version selection delivers this property by default. The go.mod file alone is enough to uniquely determine which module versions should be used for the build (assuming dependencies are available), and that decision is stable even as new versions of a module are introduced into the ecosystem. This differs from most other systems, which adopt new versions automatically and need to be constrained to yield reproducible builds. I covered this in the minimal version selection post, but it's an important, subtle detail, so I'll try to give a short reprise here.

To make this concrete, let's look at a few real packages from Cargo, Rust's package manager. To be clear, I am not picking on Cargo. I think Cargo is an example of the current state of the art in package managers, and there's much to learn from it. If we can make Go package management as smooth as Cargo's, I'll be happy. But I also think that it is worth exploring whether we would benefit from choosing a different default when it comes to version selection.

Cargo prefers maximum versions in the following sense. Over at crates.io, the latest toml is 0.4.5 as I write this post. It lists a dependency on serde 1.0 or later; the latest serde is 1.0.27. If you start a new project and add a dependency on toml 0.4.1 or later, Cargo has a choice to make. According to the constraints, any of 0.4.1, 0.4.2, 0.4.3, 0.4.4, or 0.4.5 would be acceptable. All other things being equal, Cargo prefers the newest acceptable version, 0.4.5. Similarly, any of serde 1.0.0 through 1.0.27 are acceptable, and Cargo chooses 1.0.27. These choices change as new versions are introduced. If serde 1.0.28 is released tonight and I add toml 0.4.5 to a project tomorrow, I'll get 1.0.28 instead of 1.0.27. As described so far, Cargo's builds are not repeatable. Cargo's (entirely reasonable) answer to this problem is to have not just a constraint file (the manifest, Cargo.toml) but also a list of the exact artifacts to use in the build (the lock file, Cargo.lock). The lock file stops future upgrades; once it is written, your build stays on serde 1.0.27 even when 1.0.28 is released.

In contrast, minimal version selection prefers the minimum allowed version, which is the exact version requested by some go.mod in the project. That answer does not change as new versions are added. Given the same choices in the Cargo example, vgo would select toml 0.4.1 (what you requested) and then serde 1.0 (what toml requested). Those choices are stable, without a lock file. This is what I mean when I say that vgo's builds are reproducible by default.

Verifiable Builds

Go binaries have long included a string indicating the version of Go they were built with. Last year I wrote a tool rsc.io/goversion that fetches that information from a given executable or tree of executables. For example, on my Ubuntu Linux laptop, I can look to see which system utilities are implemented in Go:

$ go get -u rsc.io/goversion
$ goversion /usr/bin
/usr/bin/containerd go1.8.3
/usr/bin/containerd-shim go1.8.3
/usr/bin/ctr go1.8.3
/usr/bin/go go1.8.3
/usr/bin/gofmt go1.8.3
/usr/bin/kbfsfuse go1.8.3
/usr/bin/kbnm go1.8.3
/usr/bin/keybase go1.8.3
/usr/bin/snap go1.8.3
/usr/bin/snapctl go1.8.3
$

Now that the vgo prototype understands module versions, it includes that information in the final binary too, and the new goversion -m flag prints it back out. Using our “hello, world” program from the tour:

$ go get -u rsc.io/goversion
$ goversion ./hello
./hello go1.10
$ goversion -m hello
./hello go1.10
	path  github.com/you/hello
	mod   github.com/you/hello  (devel)
	dep   golang.org/x/text     v0.0.0-20170915032832-14c0d48ead0c
	dep   rsc.io/quote          v1.5.2
	dep   rsc.io/sampler        v1.3.0
$

The main module, supposedly github.com/you/hello, has no version information, because it's the local development copy, not a specific version we downloaded. But if instead we build a command directly from a versioned module, then the listing does report versions for all modules:

$ vgo build -o hello2 rsc.io/hello
vgo: resolving import "rsc.io/hello"
vgo: finding rsc.io/hello (latest)
vgo: adding rsc.io/hello v1.0.0
vgo: finding rsc.io/hello v1.0.0
vgo: finding rsc.io/quote v1.5.1
vgo: downloading rsc.io/hello v1.0.0
$ goversion -m ./hello2
./hello2 go1.10
	path  rsc.io/hello
	mod   rsc.io/hello       v1.0.0
	dep   golang.org/x/text  v0.0.0-20170915032832-14c0d48ead0c
	dep   rsc.io/quote       v1.5.2
	dep   rsc.io/sampler     v1.3.0
$

When we do integrate versions into the main Go toolchain, we will add APIs to access this information from inside a running binary, just like runtime.Version provides access to the more limited Go version information.

For the purpose of attempting to reconstruct the binary, the information listed by goversion -m suffices: put the versions into a go.mod file and build the target named on the path line. But if the result is not the same binary, you might wonder about ways to narrow down what's different. What changed?

When vgo downloads each module, it computes a hash of the file tree corresponding to that module. That hash is also included in the binary, alongside the version information, and goversion -mh prints it:

$ goversion -mh ./hello
hello go1.10
	path  github.com/you/hello
	mod   github.com/you/hello  (devel)
	dep   golang.org/x/text     v0.0.0-20170915032832-14c0d48ead0c  h1:qgOY6WgZOaTkIIMiVjBQcw93ERBE4m30iBm00nkL0i8=
	dep   rsc.io/quote          v1.5.2                              h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
	dep   rsc.io/sampler        v1.3.1                              h1:F0c3J2nQCdk9ODsNhU3sElnvPIxM/xV1c/qZuAeZmac=
$ goversion -mh ./hello2
hello go1.10
	path  rsc.io/hello
	mod   rsc.io/hello       v1.0.0                              h1:CDmhdOARcor1WuRUvmE46PK91ahrSoEJqiCbf7FA56U=
	dep   golang.org/x/text  v0.0.0-20170915032832-14c0d48ead0c  h1:qgOY6WgZOaTkIIMiVjBQcw93ERBE4m30iBm00nkL0i8=
	dep   rsc.io/quote       v1.5.2                              h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
	dep   rsc.io/sampler     v1.3.0                              h1:7uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
$

The h1: prefix indicates which hash is being reported. Today, there is only “hash 1,” a SHA-256 hash of a list of files and the SHA-256 hashes of their contents. If we need to update to a new hash later, the prefix will help us tell old from new hashes.

I must stress that these hashes are self-reported by the build system. If someone gives you a binary with certain hashes in its build information, there's no guarantee they are accurate. They are very useful information supporting a later verification, not a signature that can be trusted by themselves.

Verified Builds

An author distributing a program in source form might want to let users verify that they are building it with exactly the expected dependencies. We know vgo will make the same decisions about which versions of dependencies to use, but there is still the problem of mapping a version like v1.5.2 to an actual source tree. What if the author of v1.5.2 changes the tag to point at a different file tree? What if a malicious middlebox intercepts the download request and delivers a different zip file? What if the user has accidentally edited the source files in the local copy of v1.5.2? The vgo prototype supports this kind of verification too.

The final form may be somewhat different, but if you create a file named go.modverify next to go.mod, then builds will keep that file up-to-date with known hashes for specific versions of modules:

$ echo >go.modverify
$ vgo build
$ tcat go.modverify  # go get rsc.io/tcat, or use cat
golang.org/x/text  v0.0.0-20170915032832-14c0d48ead0c  h1:qgOY6WgZOaTkIIMiVjBQcw93ERBE4m30iBm00nkL0i8=
rsc.io/quote       v1.5.2                              h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
rsc.io/sampler     v1.3.0                              h1:7uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
$

The go.modverify file is a log of the hash of all versions ever encountered: lines are only added, never removed. If we update rsc.io/sampler to v1.3.1, then the log will now contain hashes for both versions:

$ vgo get rsc.io/sampler@v1.3.1
$ tcat go.modverify
golang.org/x/text  v0.0.0-20170915032832-14c0d48ead0c  h1:qgOY6WgZOaTkIIMiVjBQcw93ERBE4m30iBm00nkL0i8=
rsc.io/quote       v1.5.2                              h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
rsc.io/sampler     v1.3.0                              h1:7uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
rsc.io/sampler     v1.3.1                              h1:F0c3J2nQCdk9ODsNhU3sElnvPIxM/xV1c/qZuAeZmac=
$

When go.modverify exists, vgo checks that all downloaded modules used in a given build are consistent with entries already in the file. For example, if we change the first digit of the rsc.io/quote hash from w to v:

$ vgo build
vgo: verifying rsc.io/quote v1.5.2: module hash mismatch
	downloaded:   h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
	go.modverify: h1:v5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
$

Or suppose we fix that one but then modify the v1.3.0 hash. Now our build succeeds, because v1.3.0 is not being used by the build, so its line is (correctly) ignored. But if we try to downgrade to v1.3.0, then the build verification will correctly begin failing:

$ vgo build
$ vgo get rsc.io/sampler@v1.3.0
vgo: verifying rsc.io/sampler v1.3.0: module hash mismatch
	downloaded:   h1:7uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
	go.modverify: h1:8uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
$

Developers who want to ensure that others rebuild their program with exactly the same sources they did can store a go.modverify in their repository. Then others building using the same repo will automatically get verified builds. For now, only the go.modverify in the top-level module of the build applies. But note that go.modverify lists all dependencies, including indirect dependencies, so the whole build is verified.

The go.modverify feature helps detect unexpected mismatches between downloaded dependencies on different machines. It compares the hashes in go.modverify against hashes computed and saved at module download time. It is also useful to check that downloaded modules have not changed on the local machine since it was downloaded. This is less about security from attacks and more about avoiding mistakes. For example, because source file paths appear in stack traces, it's common to open those files when debugging. If you accidentally (or, I suppose, intentionally) modify the file during the debugging session, it would be nice to be able to detect that later. The vgo verify command does this:

$ go get -u golang.org/x/vgo  # fixed a bug, sorry! :-)
$ vgo verify
all modules verified
$

If a source file changes, vgo verify notices:

$ echo >>$GOPATH/src/v/rsc.io/quote@v1.5.2/quote.go
$ vgo verify
rsc.io/quote v1.5.2: dir has been modified (/Users/rsc/src/v/rsc.io/quote@v1.5.2)
$

If we restore the file, all is well:

$ gofmt -w $GOPATH/src/v/rsc.io/quote@v1.5.2/quote.go
$ vgo verify
all modules verified
$

If cached zip files are modified after download, vgo verify notices that too, although I can't plausibly explain how that might happen:

$ zip $GOPATH/src/v/cache/rsc.io/quote/@v/v1.5.2.zip /etc/resolv.conf
  adding: etc/resolv.conf (deflated 36%)
$ vgo verify
rsc.io/quote v1.5.2: zip has been modified (/Users/rsc/src/v/cache/rsc.io/quote/@v/v1.5.2.zip)
$

Because vgo keeps the original zip file after unpacking it, if vgo verify decides that only one of the zip file and the directory tree have been modified, it could even print a diff of the two.

What's Next?

This is implemented already in vgo. You can try it out and use it. As with the rest of vgo, feedback about what doesn't work right (or works great) is appreciated.

The functionality presented here is more the start of something than a finished feature. A cryptographic hash of the file tree is a building block. The go.modverify built on top of it checks that developers all build a particular module with precisely the same dependencies, but there's no verification when downloading a new version of a module (unless someone else already added it to go.modverify), nor is there any sharing of expected hashes between modules.

The exact details of how to fix those two shortcomings are not obvious. It may make sense to allow some kind of cryptographic signatures of the file tree, and to verify that an upgrade finds a version signed with the same key as the previous version. Or it may make sense to adopt an approach along the lines of The Update Framework (TUF), although using their network protocols directly is not practical. Or, instead of using per-repo go.modverify logs, it might make sense to establish some kind of shared global log, a bit like Certificate Transparency, or to use a public identity server like Upspin. There are many avenues we might explore, but all this is getting a little ahead of ourselves. For now we are focused on successfully integrating versioning into the go command.

Minimal Version Selection

2018-02-21T16:41:00-05:00

A versioned Go command must decide which module versions to use in each build. I call this list of modules and versions for use in a given build the build list. For stable development, today's build list must also be tomorrow's build list. But then developers must also be allowed to change the build list: to upgrade all modules, to upgrade one module, or to downgrade one module.

The version selection problem therefore is to define the meaning of, and to give algorithms implementing, these four operations on build lists:

Construct the current build list.
Upgrade all modules to their latest versions.
Upgrade one module to a specific newer version.
Downgrade one module to a specific older version.

The last two operations specify one module to upgrade or downgrade, but doing so may require upgrading, downgrading, adding, or removing other modules, ideally as few as possible, to satisfy dependencies.

This post presents minimal version selection, a new, simple approach to the version selection problem. Minimal version selection is easy to understand and predict, which should make it easy to work with. It also produces high-fidelity builds, in which the dependencies a user builds are as close as possible to the ones the author developed against. It is also efficient to implement, using nothing more complex than recursive graph traversals, so that a complete minimal version selection implementation in Go is only a few hundred lines of code.

Minimal version selection assumes that each module declares its own dependency requirements: a list of minimum versions of other modules. Modules are assumed to follow the import compatibility rule—packages in any newer version should work as well as older ones—so a dependency requirement gives only a minimum version, never a maximum version or a list of incompatible later versions.

Then the definitions of the four operations are:

To construct the build list for a given target: start the list with the target itself, and then append each requirement's own build list. If a module appears in the list multiple times, keep only the newest version.
To upgrade all modules to their latest versions: construct the build list, but read each requirement as if it requested the latest module version.
To upgrade one module to a specific newer version: construct the non-upgraded build list and then append the new module's build list. If a module appears in the list multiple times, keep only the newest version.
To downgrade one module to a specific older version: rewind the required version of each top-level requirement until that requirement's build list no longer refers to newer versions of the downgraded module.

These operations are simple, efficient, and easy to implement.

Example

Before we examine minimal version selection in more detail, let's look at why a new approach is necessary. We'll use the following set of modules as a running example throughout the post:

The diagram shows the module requirement graph for seven modules (dotted boxes) with one or more versions. Following semantic versioning, all versions of a given module share a major version number. We are developing module A 1, and we will run commands to update its dependency requirements. The diagram shows both A 1's current requirements and the requirements declared by various versions of released modules B 1 through F 1.

Because the major version is part of the module's identifier, we must know that we are working on A 1 as opposed to A 2, but otherwise the exact version of A is unspecified—our work is unreleased. Similarly, different major versions are just different modules: for the purposes of these algorithms, B 1 is no more related to B 2 than to C 1. We could replace B 1 through F 1 in the diagram with A 2 through A 7 at a significant loss in clarity but without any change in how the algorithms handle the example. Because all the modules in the example do have major version 1, from now on we will omit the major version when possible, shortening A 1 to A. Our current development copy of A requires B 1.2 and C 1.2. B 1.2 in turn requires D 1.3. An earlier version, B 1.1, required D 1.1. And so on. Note that F 1.1 requires G 1.1, but G 1.1 also requires F 1.1. Declaring this kind of cycle can be important when singleton functionality moves from one module to another. Our algorithms must not assume the module requirement graph is acyclic.

Low-Fidelity Builds

Go's current version selection algorithm is simplistic, providing two different version selection algorithms, neither of which is right.

The first algorithm is the default behavior of go get: if you have a local version, use that one, or else download and use the latest version. This mode can use versions that are too old: if you have B 1.1 installed and run go get to download A, go get would not update to B 1.2, causing a failed or buggy build.

The second algorithm is the behavior of go get -u: download and use the latest version of everything. This mode fails by using versions that are too new: if you run go get -u to download A, it will correctly update to B 1.2, but it will also update to C 1.3 and E 1.3, which aren't what A asks for, may not have been tested, and may not work.

I call both these outcomes low-fidelity builds: viewed as attempts to reproduce the build that A's author used, these builds differ for no good reason. After we've seen the details of the minimal version selection algorithms, we'll look at why they produce high-fidelity builds instead.

Algorithms

Now let's look at the algorithms in more detail.

Algorithm 1: Construct Build List

There are two useful (and equivalent) ways to define build list construction: as a recursive process and as a graph traversal.

The recursive definition of build list construction is as follows. Construct the rough build list for M by starting an empty list, adding M, and then appending the build list for each of M's requirements. Simplify the rough build list to produce the final build list, by keeping only the newest version of any listed module.

The recursive construction of build lists is useful mainly as a mental model. A literal implementation of that definition would be too inefficient, potentially requiring time exponential in the size of an acyclic module requirement graph and running forever on a cyclic graph.

An equivalent, more efficient construction is based on graph reachability. The rough build list for M is also just the list of all modules reachable in the requirement graph starting at M and following arrows. This can be computed by a trivial recursive traversal of the graph, taking care not to visit a node that has already been visited. For example, A's rough build list is the highlighted module versions found by starting at A and following the highlighted arrows:

(The simplification from rough build list to final build list remains the same.)

Note that this algorithm only visits each module in the rough build list once, and only those modules, so the execution time is proportional to the rough build list size |B| plus the number of arrows that must be traversed (at most |B|²). The algorithm completely ignores versions left off the rough build list: for example, it loads information about D 1.3, D 1.4, and E 1.2, but it does not load information about D 1.2, E 1.1 or E 1.3. In a dependency management setting, where loading information about each module version may mean a separate network round trip, avoiding unnecessary module versions is an important optimization.

Algorithm 2. Upgrade All Modules

Upgrading all modules is perhaps the most common modification made to build lists. It is what go get -u does today.

We compute an upgraded build list by upgrading the module requirement graph and then applying the previous algorithm. An upgraded module requirement graph is one in which every arrow pointing at any version of a module has been replaced by one pointing at the latest version of that module. (It is then also possible to discard all older versions from the graph, but the build list construction won't look at them anyway, so there's no need to clean up the graph.)

For example, here is the upgraded module requirement graph, with the original build list still marked in yellow and the upgraded build list now marked in red:

Although this tells us the upgraded build list, it does not yet tell us how to cause future builds to use that build list instead of the old build list (still marked in yellow). To upgrade the graph we changed the requirements for all modules, but an upgrade during development of module A must somehow be recorded only in A's requirement list (in A's go.mod file) in a way that causes Algorithm 1 to produce the build list we want, to pick the red modules instead of the yellow ones. To decide what to add to A's requirement list to cause that effect, we introduce a helper, Algorithm R.

Algorithm R. Compute a Minimal Requirement List

Given a build list compatible with the module requirement graph below the target, we want to compute a requirement list for the target that will induce that build list. It is always sufficient to list every module in the build list other than the target itself. For example, the upgrade we considered above could add C 1.3 (replacing C 1.2), D 1.4, E 1.3, F 1.1, and G 1.1 to A's requirement list. But in general not all of these additions are necessary, and we want to list as few additional modules as possible. For example, F 1.1 implies G 1.1 (and vice versa), so we need not list both. At first glance it seems natural to start by adding the module versions marked in red but not yellow (on the new list but missing from the old list). That heuristic would incorrectly drop D 1.4, which is implied by the old requirement C 1.2 but not by the new requirement C 1.3.

Instead, it is correct to visit the modules in reverse postorder—that is, to visit a module only after considering all modules that point into it—and only keep a module if it is not implied by modules already visited. For an acyclic graph, the result is a unique, minimal set of additions. For a cyclic graph, the reverse-postorder traversal must break cycles, and then the set of additions is unique and minimal for the modules not involved in cycles. As long as the result is correct and stable, we'll accept non-minimal answers in the case of cycles. In this example, the upgrade needs to add C 1.3 (replacing C 1.2), D 1.4, and E 1.3. It can drop F 1.1 (implied by C 1.3) and G 1.1 (also implied by C 1.3).

Algorithm 3. Upgrade One Module

Instead of upgrading all modules, cautious developers typically want to upgrade only one module, with as few other changes to the build list as possible. For example, we may want to upgrade to C 1.3, and we do not want that operation to make unnecessary changes like upgrading to E 1.3. Like in Algorithm 2, we can upgrade one module by upgrading the requirement graph, constructing a build list from it (Algorithm 1), and then reducing that list back to a set of requirements for the top-level module (Algorithm R). To upgrade the requirement graph, we add one new arrow from the top-level module to the upgraded module version.

For example, if we want to change A's build to upgrade to C 1.3, here is the upgraded requirement graph:

Like before, the new build list's modules are marked in red, and the old build list's are in yellow.

The upgrade's effect on the build list is the unique minimal way to make the upgrade, adding the new module version and any implied requirements but nothing else. Note that when constructing the upgraded graph, we must only add new arrows, not replace or remove old ones. For example, if the new arrow from A to C 1.3 replaced the old arrow from A to C 1.2, the upgraded build list would omit D 1.4. That is, the upgrade of C would downgrade D, an unexpected, unwanted, and non-minimal change. Once we've computed the build list for the upgrade, we can run Algorithm R (above) to decide how to update the requirements list. In this case we'd end up replacing C 1.2 with C 1.3 but then also adding a new requirement on D 1.4, to avoid the accidental downgrade of D. Note that this selective upgrade only updates other modules to C's minimum requirements: the upgrade of C does not simply fetch the latest of each of C's dependencies.

Algorithm 4. Downgrade One Module

We may also discover, perhaps after upgrading all modules, that the latest module version is buggy and must be avoided. In that situation, we need to be able to downgrade to an earlier version of the module. Downgrading one module may require downgrading other modules, but we want to downgrade as few other modules as possible. Like upgrades, downgrades must make their changes to the build list by modifying a target's requirements list. Unlike upgrades, downgrades must work by removing requirements, not adding them. This observation leads to a very simple downgrade algorithm that considers each of the target's requirements individually. If a requirement is incompatible with the proposed downgrade—that is, if the requirement's build list includes a now-disallowed module version—then try successively older versions until finding one that is compatible with the downgrade.

For example, starting with the original build graph, suppose we discover that there is a problem with D 1.4, actually introduced in D 1.3, and so we decide to downgrade to D 1.2. Our target module A depends on B 1.2 and C 1.2. To downgrade from D 1.4 to D 1.2, we must find earlier versions of B and C that do not require (directly or indirectly) versions of D later than D 1.2.

Although we can consider each requirement separately, it is more efficient to consider the module requirement graph as a whole. In our example, the downgrade rule amounts to crossing out the unavailable versions of D and then following arrows backwards from unavailable modules to find and cross out other unavailable modules. At the end, the latest versions of A's requirements that remain can be recorded as the new requirements.

In this case, downgrading to D 1.2 implies downgrading to B 1.1 and C 1.1. To avoid an unnecessary downgrade to E 1.1, we must also add a new requirement on E 1.2. We can apply Algorithm R to find the minimal set of new requirements to write to go.mod.

Note that if we'd first upgraded to C 1.3, then the downgrade to D 1.2 would have continued to use C 1.3, which doesn't use any version of D at all. But downgrades are constrained to only downgrade packages, not also upgrade them; if an upgrade before downgrade is needed, the user must ask for it explicitly.

Theory

Minimal version selection is very simple. It achieves simplicity by eliminating all flexibility about what the answer must be: the build list is exactly the versions specified in the requirements. A real system needs more flexibility, for example the ability to exclude certain module versions or replace others. Before we add those, it is worth examining the theoretical basis for the current system's simplicity, so we understand which kinds of extensions preserve that simplicity and which do not.

If you are familiar with the way most other systems approach version selection, or if you remember my Version SAT post from a year ago, probably the most striking feature of Minimal version selection is that it does not solve general Boolean satisfiability, or SAT. As I explained in my earlier post, it takes very little for a version search to fall into solving SAT; version searches in these systems are inherently intricate, complex problems for which we know no general efficient solutions. If we want to avoid this fate, we need to know where the boundaries are, where not to step as we explore the design space. Conveniently, Schaefer's Dichotomy Theorem describes those boundaries precisely. It identifies six restricted classes of Boolean formulas for which satisfiability can be decided in polynomial time and then proves that for any class of formulas beyond those, satisfiability is NP-complete. To avoid NP-completeness, we need to limit the version selection problem to stay within one of Schaefer's restricted classes.

It turns out that minimal version selection lies in the intersection of three of the six tractable SAT subproblems: 2-SAT, Horn-SAT, and Dual-Horn-SAT. The formula corresponding to a build in minimal version selection is the AND of a set of clauses, each of which is either a single positive literal (this version must be installed, such as during an upgrade), a single negative literal (this version is not available, such as during a downgrade), or the OR of one negative and one positive literal (an implication: if this version is installed, this other version must also be installed). The formula is a 2-CNF formula, because each clause has at most two variables. The formula is also a Horn formula, because each clause has at most one positive literal. The formula is also a dual-Horn formula, because each clause has at most one negative literal. That is, every satisfiability problem posed by minimal version selection can be solved by your choice of three different efficient algorithms. It is even simpler and more efficient to specialize further, as we did above, taking advantage of the very limited structure of these problems.

Although 2-SAT is the most well-known example of a SAT subproblem with an efficient solution, the fact that these problems are both Horn and dual-Horn formulas is more interesting. Every Horn formula has a unique satisfying assignment with the fewest variables set to true. This proves that there is a unique minimal answer for constructing a build list, as well for each upgrade. The unique minimal upgrade does not use a newer version of a given module unless absolutely necessary. Conversely, every dual-Horn formula also has a unique satisfying assignment with the fewest variables set to false. This proves that there is a unique minimal answer for each downgrade. The unique minimal downgrade does not use an older version of a given module unless absolutely necessary. If we want to extend minimal version selection, for example with the ability to exclude certain modules, we can only keep the uniqueness and mimimality properties by continuing to use constraints expressible as both Horn and dual-Horn formulas.

(Digression: The problem minimal version selection solves is NL-complete: it's in NL because it's a subset of 2-SAT, and it's NL-hard because st-connectivity can be trivially transformed into a minimal version selection build list construction problem. It's delightful that we've replaced an NP-complete problem with an NL-complete problem, but there's little practical value to knowing that: being in NL only guarantees a polynomial-time solution, and we already have a linear-time one.)

Excluding Modules

Minimal version selection always selects the minimal (oldest) module version that satisfies the overall requirements of a build. If that version is buggy in some way, an upgrade or downgrade operation can modify the top-level target's requirements list to force selection of a different version.

It can also be useful to record explicitly that the version is buggy, to avoid reintroducing it in any future upgrade or downgrade operations. But we want to do that in a way that keeps the uniqueness and minimality properties of the previous section, so we must use constraints that are both Horn and dual-Horn formulas. That means build constraints can only be unconditional positive assertions (X: X must be installed), unconditional negative assertions (¬ Y: Y must not be installed), and positive implications (X → Z, equivalently ¬ X ∨ Z: if X is installed, then Z must be installed). Negative implications (X → ¬ Y, equivalently ¬ X ∨ ¬ Y: if X is installed, then Y must not be installed) cannot be added as constraints without breaking the form. Module exclusions must therefore be unconditional: they must be decided independent of selections made during build list construction.

What we can do is allow a module to declare its own local list of excluded module versions. By local, I mean that the list is consulted only for builds within that module; a larger build using the module only as a dependency would ignore the exclusion list. In our example, if A's build consulted D 1.3's list, then the exact set of exclusions would depend on whether the build selected, say, D 1.3 or D 1.4, making the exclusions conditional and leading to an NP-complete search problem. Only the top-level module is guaranteed to be in the build, so only the top-level module's exclusion list is used. Note that it would be fine to consult exclusion lists from other sources, such as a global exclusion list loaded over the network, as long as the decision to use the list is made before the build begins and the list content does not depend on which modules are selected during the build.

Despite all the focus on making exclusions unconditional, it might seem like we already have conditional exclusions: C 1.2 requires D 1.4 and so implicitly excludes D 1.3. But our algorithms do not treat this as an exclusion. When Algorithm 1 runs, it adds both D 1.3 (for B) and D 1.4 (for C) to the rough build list, along with their minimum requirements. The final simplification pass removes D 1.3 only because D 1.4 is present. The difference here between declaring an incompatibility and declaring a minimum requirement is critical. Declaring that C 1.2 must not be built with D 1.3 only describes how to fail. Declaring that C 1.2 must be built with D 1.4 instead describes how to succeed.

Exclusions then must be unconditional. Knowing that fact is important, but it does not tell us exactly how to implement exclusions. A simple answer is to add exclusions as the build constraints, with clauses like “D 1.3 must not be installed.” Unfortunately, adding that clause alone would make modules that require D 1.3, like B 1.2, uninstallable. We need to express somehow that B 1.2 can choose D 1.4. The simple way to do that is to revise the build constraint, changing “B 1.2 → D 1.3” to “B 1.2 → D 1.3 ∨ D 1.4” and in general allowing all future versions of D. But that clause (equivalently, ¬ B 1.2 ∨ D 1.3 ∨ D 1.4) has two positive literals, making the overall build formula not a Horn formula anymore. It is still a dual-Horn formula, so we can still define a linear-time build list construction, but that construction—and therefore the question of how to perform an upgrade—would no longer be guaranteed to have a unique, minimal answer.

Instead of implementing exclusions as new build constraints, we can implement them by changing existing ones. That is, we can modify the requirements graph, just as we did for upgrades and downgrades. If a specific module is excluded, then we can remove it from the module requirement graph but also change any existing requirements on that module to require the next newer version instead. For example, if we excluded D 1.3, then we'd also update B 1.2 to require D 1.4:

If the latest version of a module is removed, then any modules requiring that version also need to be removed, as in the downgrade algorithm. For example, if G 1.1 were removed, then C 1.3 would need to be removed as well.

Once the exclusions have been applied to the module requirement graph, the algorithms proceed as before.

Replacing Modules

During development of A, suppose we find a bug in D 1.4, and we want to test a potential fix. We need some way to replace D 1.4 in our build with an unreleased copy U. We can allow a module to declare this as a replacement: “proceed as if D 1.4's source code and requirements have been replaced by U's.”

Like exclusions, replacements can be implemented by modifying the module requirement graph in a preprocessing step, not by adding complexity to the algorithms that process the graph. Also like exclusions, the replacement list is local to one module. The build of A consults the replacement list from A but not from B 1.2, C 1.2, or any of the other modules in the build. This avoids making replacements conditional, which would be difficult to implement, and it also avoids the possibility of conflicting replacements: what if B 1.2 and C 1.2 specify different replacements for E 1.2? More generally, keeping exclusions and replacements local to one module limits the control that module exerts on other builds.

Who Controls Your Build?

The dependencies of a top-level module must be given some control over the top-level build. B 1.2 needs to be able to make sure it is built with D 1.3 or later, not with D 1.2. Otherwise we end up with the current go get's stale dependency failure mode.

At the same time, for builds to remain predictable and understandable, we cannot give dependencies arbitrary, fine-grained control over the top-level build. That leads to conflicts and surprises. For example, suppose B declares that it requires an even version of D, while C declares that it requires a prime version of D. D is frequently updated and is up to D 1.99. Using B or C in isolation, it's always possible to use a relatively recent version of D (D 1.98 or D 1.97, respectively). But when A uses both B and C, the build silently selects the much older (and buggier) D 1.2 instead. That's an extreme example, but it raises the question: why should the authors of B and C be given such extreme control over A's build? As I write this post, there is an open bug report that the Kubernetes Go client declares a requirement on a specific, two-year-old version of gopkg.in/yaml.v2. When a developer tried to use a new feature of that YAML library in a program that already used the Kubernetes Go client, even after attempting to upgrade to the latest possible version, code using the new feature failed to compile, because “latest” had been constrained by the Kubernetes requirement. In this case, the use of a two-year-old YAML library version may be entirely reasonable within the context of the Kubernetes code base, and clearly the Kubernetes authors should have complete control over their own builds, but that level of control does not make sense to extend to other developers' builds.

In the design of module requirements, exclusions, and replacements, I've tried to balance the competing concerns of allowing dependencies enough control to ensure a succesful build without allowing them so much control that they harm the build. Minimum requirements combine without conflict, so it is feasible (even easy) to gather them from all dependencies. But exclusions and replacements can and do conflict, so we allow them to be specified only by the top-level module.

A module author is therefore in complete control of that module's build when it is the main program being built, but not in complete control of other users' builds that depend on the module. I believe this distinction will make minimal version selection scale to much larger, more distributed code bases than existing systems.

High-Fidelity Builds

Let's return now to the question of high-fidelity builds.

At the start of the post we saw that, using go get to build A, it was possible to use dependencies different than the ones A's author had used, without a good reason. I called this as a low-fidelity build, because it is a poor reproduction of the original build of A. Using minimal version selection, builds are instead high-fidelity. The module requirements, which are included with the module's source code, uniquely determine how to build it directly. The user's build of A will match the author's build exactly: a reproducible build. But high-fidelity means more.

Having a reproducible build is generally understood to be a binary property, for a whole-program build: a user's build is exactly the same the author's, or it isn't. What about when building a library module as part of a larger program? It would be helpful for a user's build of a library to match the author's whenever possible. Then the user runs the same code (including dependencies) that the author developed and tested with. In a larger project, of course, it may be impossible for a user's build of a library to match the author's build exactly. Another part of that build may force the use of a newer dependency, making the user's build of the library deviate from the author's build. Let's refer to a build as high-fidelity when it deviates from the author's own build only to satisfy a requirement elsewhere in the build.

Consider again our original example:

In this example, the build of A combines B 1.2 and D 1.4, even though B's author was using D 1.3. That change is necessary because A also uses C 1.2, which requires D 1.4. The build of A is still a high-fidelity build of B 1.2: it deviates by using D 1.4, but only because it must. In contrast, if the build used E 1.3, as go get -u, Dep, and Cargo typically do, that build would be low-fidelity: it deviates unnecessarily.

Minimal version selection provides high-fidelity builds by using the oldest version available that meets the requirements. The release of a new version has no effect on the build. In contrast, most other systems, including Cargo and Dep, use the newest version available that meets requirements listed in a “manifest file.” The release of a new version changes their build decisions. To get reproducible builds, these systems add a second mechanism, the “lock file,” which lists the specific versions a build should use. The lock file ensures reproducible builds for whole programs, but it is ignored for library modules; the Cargo FAQ explains that this is “precisely because a library should not be deterministically recompiled for all users of the library.” It's true that a perfect reproduction is not always possible, but by giving up entirely, the Cargo approach admits unnecessary deviation from the library author's builds. That is, it delivers low-fidelity builds. In our example, when A first adds B 1.2 or C 1.2 to its build, Cargo will see that they require E 1.2 or later and will choose E 1.3. Until directed otherwise, however, it seems better to continue to build with E 1.2, as the authors of B and C did. Using the oldest allowed version also eliminates the redundancy of having two different files (manifest and lock) that both specify which modules versions to use.

Automatically using newer versions also makes it easy for minimum requirements to be wrong. Suppose we start working on A using B 1.1, the latest version at the time, and we record that A requires only B 1.1. But then B 1.2 comes out and we start using it in our own builds and lock file, without updating the manifest. At this point there is no longer any development or testing of A with B 1.1. We may start using features or depending on bug fixes from B 1.2, but now A incorrectly lists its minimum requirement as B 1.1. If users always also choose newer versions than the minimum requirement, then there is not much harm done: they'll use B 1.2 as well. But when the system does try to use the declared minimum, it will break. For example, when a user attempts a limited update of A, the system cannot see that updating to B 1.2 is also required. More generally, whenever the minimum versions (in the manifest) and the built versions (in the lock file) differ, why should we believe that building with the minimum versions will produce a working library? To try to detect this problem, Cargo developers have proposed that cargo publish try a build with the minimum versions of all dependencies before publishing. That will detect when A starts using a new feature in B 1.2—building with B 1.1 will fail—but it will not detect when A starts depending on a new bug fix.

The fundamental problem is that preferring the newest allowed version of a module during version selection produces a low-fidelity build. Lock files are a partial solution, targeting whole-program builds; additional build checks like in cargo publish are also a partial solution. A more complete solution is to use the version of the module the author did. That makes a user's build as close as possible to the author's build: a high-fidelity build.

Upgrade Speed

Given that minimal version selection takes the minimum allowed version of each dependency, it's easy to think that this would lead to use of very old copies of packages, which in turn might lead to unnecessary bugs or security problems. In practice, however, I think the opposite will happen, because the minimum allowed version is the maximum of all the constraints, so the one lever of control made available to all modules in a build is the ability to force the use of a newer version of a dependency than would otherwise be used. I expect that users of minimal version selection will end up with programs that are almost as up-to-date as their friends using more aggressive systems like Cargo.

For example, suppose you are writing a program that depends on a handful of other modules, all of which depend on some very common module, like gopkg.in/yaml.v2. Your program's build will use the newest YAML version among the ones requested by your module and that handful of dependencies. Even just one conscientious dependency can force your build to update many other dependencies. This is the opposite of the Kubernetes Go client problem I mentioned earlier.

If anything, minimal version selection would instead suffer the opposite problem, that this “max of the minimums” answer serves as a ratchet that forces dependencies forward too quickly. But I think in practice dependencies will move forward at just the right speed, which ends up being just the right amount slower than Cargo and friends.

Upgrade Timing

A key feature of minimal version selection is that upgrade do not happen until a developer asks for them to happen. You don't get an untested version of a module unless you asked for that module to be upgraded.

For example, in Cargo, if package B depends on package C 2.9 and you add B to your build, you don't get C 2.9. You get the latest allowed version at that moment, maybe C 2.15. Maybe C 2.15 was released just a few minutes ago and the author hasn't yet been told about an important bug. That's too bad for you and your build. On the other hand, in minimal version selection, module B's go.mod file will list the exact version of C that B's author developed and tested with. You'll get that version. Or maybe some other module in your program developed and tested with a newer version of C. Then you'll get that version. But you will never get a version of C that some module in the program did not explicitly request in its go.mod file. This should mean you only ever get a version of C that worked for someone else, not the very latest version that maybe hasn't worked for anyone.

To be clear, my purpose here is not to pick on Cargo, which I think is a very well-designed system. I'm using Cargo here as an example of a model that many developers are familiar with, to try to convey what would be different in minimal version selection.

Minimality

I call this system minimal version selection because the system as a whole appears to be minimal: I don't see how to remove anything more without breaking it. Some people will undoubtedly say that too much has been removed already, but so far it seems perfectly able to handle the real-world cases I've examined. We'll find out more by experimenting with the vgo prototype.

The key to minimal version selection is its preference for the minimum allowed version of a module. When I compared go get -u's “upgrade everything to latest” approach to Cargo's “manifest and lock” approach in the context of a system that can rely on the import compatibility rule, I realized that both manifest and lock exist for the same purpose: to work around the “upgrade everything to latest” default behavior. The manifest describes which newer versions are unneeded, and the lock describes which newer versions are unwanted. Instead, why not change the default? Use the minimum version allowed, typically the exact version the author used, and leave timing of upgrades completely to user control. This approach leads to reproducible builds without lock files, and more generally to high-fidelity builds that deviate from the author's own build only when required.

More than anything else, I wanted to find a version selection algorithm that was understandable. Predictable. Boring. Where other systems instead seem to optimize for displays of raw flexibility and power, minimal version selection aims to be invisible. I hope it succeeds.

Go and Dogma

2017-01-09T09:00:00-05:00

[Cross-posting from last year’s Go contributors AMA on Reddit, because it’s still important to remember.]

One of the perks of working on Go these past years has been the chance to have many great discussions with other language designers and implementers, for example about how well various design decisions worked out or the common problems of implementing what look like very different languages (for example both Go and Haskell need some kind of “green threads”, so there are more shared runtime challenges than you might expect). In one such conversation, when I was talking to a group of early Lisp hackers, one of them pointed out that these discussions are basically never dogmatic. Designers and implementers remember working through the good arguments on both sides of a particular decision, and they’re often eager to hear about someone else’s experience with what happens when you make that decision differently. Contrast that kind of discussion with the heated arguments or overly zealous statements you sometimes see from users of the same languages. There’s a real disconnect, possibly because the users don’t have the experience of weighing the arguments on both sides and don’t realize how easily a particular decision might have gone the other way.

Language design and implementation is engineering. We make decisions using evaluations of costs and benefits or, if we must, using predictions of those based on past experience. I think we have an important responsibility to explain both sides of a particular decision, to make clear that the arguments for an alternate decision are actually good ones that we weighed and balanced, and to avoid the suggestion that particular design decisions approach dogma. I hope the Reddit AMA as well as discussion on golang-nuts or StackOverflow or the Go Forum or at conferences help with that.

But we need help from everyone. Remember that none of the decisions in Go are infallible; they’re just our best attempts at the time we made them, not wisdom received on stone tablets. If someone asks why Go does X instead of Y, please try to present the engineering reasons fairly, including for Y, and avoid argument solely by appeal to authority. It’s too easy to fall into the “well that’s just not how it’s done here” trap. And now that I know about and watch for that trap, I see it in nearly every technical community, although some more than others.

A Tour of Acme

2012-09-17T11:00:00-04:00

People I work with recognize my computer easily: it's the one with nothing but yellow windows and blue bars on the screen. That's the text editor acme, written by Rob Pike for Plan 9 in the early 1990s. Acme focuses entirely on the idea of text as user interface. It's difficult to explain acme without seeing it, though, so I've put together a screencast explaining the basics of acme and showing a brief programming session. Remember as you watch the video that the 854x480 screen is quite cramped. Usually you'd run acme on a larger screen: even my MacBook Air has almost four times as much screen real estate.

The video doesn't show everything acme can do, nor does it show all the ways you can use it. Even small idioms like where you type text to be loaded or executed vary from user to user. To learn more about acme, read Rob Pike's paper “Acme: A User Interface for Programmers” and then try it.

Acme runs on most operating systems. If you use Plan 9 from Bell Labs, you already have it. If you use FreeBSD, Linux, OS X, or most other Unix clones, you can get it as part of Plan 9 from User Space. If you use Windows, I suggest trying acme as packaged in acme stand alone complex, which is based on the Inferno programming environment.

Mini-FAQ:

Q. Can I use scalable fonts? A. On the Mac, yes. If you run acme -f /mnt/font/Monaco/16a/font you get 16-point anti-aliased Monaco as your font, served via fontsrv. If you'd like to add X11 support to fontsrv, I'd be happy to apply the patch.
Q. Do I need X11 to build on the Mac? A. No. The build will complain that it cannot build ‘snarfer’ but it should complete otherwise. You probably don't need snarfer.

If you're interested in history, the predecessor to acme was called help. Rob Pike's paper “A Minimalist Global User Interface” describes it. See also “The Text Editor sam”

Correction: the smiley program in the video was written by Ken Thompson. I got it from Dennis Ritchie, the more meticulous archivist of the pair.

Minimal Boolean Formulas

2011-05-18T00:00:00-04:00

28. That's the minimum number of AND or OR operators you need in order to write any Boolean function of five variables. Alex Healy and I computed that in April 2010. Until then, I believe no one had ever known that little fact. This post describes how we computed it and how we almost got scooped by Knuth's Volume 4A which considers the problem for AND, OR, and XOR.

A Naive Brute Force Approach

Any Boolean function of two variables can be written with at most 3 AND or OR operators: the parity function on two variables X XOR Y is (X AND Y') OR (X' AND Y), where X' denotes “not X.” We can shorten the notation by writing AND and OR like multiplication and addition: X XOR Y = X*Y' + X'*Y.

For three variables, parity is also a hardest function, requiring 9 operators: X XOR Y XOR Z = (X*Z'+X'*Z+Y')*(X*Z+X'*Z'+Y).

For four variables, parity is still a hardest function, requiring 15 operators: W XOR X XOR Y XOR Z = (X*Z'+X'*Z+W'*Y+W*Y')*(X*Z+X'*Z'+W*Y+W'*Y').

The sequence so far prompts a few questions. Is parity always a hardest function? Does the minimum number of operators alternate between 2ⁿ−1 and 2ⁿ+1?

I computed these results in January 2001 after hearing the problem from Neil Sloane, who suggested it as a variant of a similar problem first studied by Claude Shannon.

The program I wrote to compute a(4) computes the minimum number of operators for every Boolean function of n variables in order to find the largest minimum over all functions. There are 2⁴ = 16 settings of four variables, and each function can pick its own value for each setting, so there are 2¹⁶ different functions. To make matters worse, you build new functions by taking pairs of old functions and joining them with AND or OR. 2¹⁶ different functions means 2¹⁶·2¹⁶ = 2³² pairs of functions.

The program I wrote was a mangling of the Floyd-Warshall all-pairs shortest paths algorithm. That algorithm is:

// Floyd-Warshall all pairs shortest path
func compute():
    for each node i
        for each node j
            dist[i][j] = direct distance, or ∞
    
    for each node k
        for each node i
            for each node j
                d = dist[i][k] + dist[k][j]
                if d < dist[i][j]
                    dist[i][j] = d
    return

The algorithm begins with the distance table dist[i][j] set to an actual distance if i is connected to j and infinity otherwise. Then each round updates the table to account for paths going through the node k: if it's shorter to go from i to k to j, it saves that shorter distance in the table. The nodes are numbered from 0 to n, so the variables i, j, k are simply integers. Because there are only n nodes, we know we'll be done after the outer loop finishes.

The program I wrote to find minimum Boolean formula sizes is an adaptation, substituting formula sizes for distance.

// Algorithm 1
func compute()
    for each function f
        size[f] = ∞
    
    for each single variable function f = v
        size[f] = 0
    
    loop
        changed = false
        for each function f
            for each function g
                d = size[f] + 1 + size[g]
                if d < size[f OR g]
                    size[f OR g] = d
                    changed = true
                if d < size[f AND g]
                    size[f AND g] = d
                    changed = true
        if not changed
            return

Algorithm 1 runs the same kind of iterative update loop as the Floyd-Warshall algorithm, but it isn't as obvious when you can stop, because you don't know the maximum formula size beforehand. So it runs until a round doesn't find any new functions to make, iterating until it finds a fixed point.

The pseudocode above glosses over some details, such as the fact that the per-function loops can iterate over a queue of functions known to have finite size, so that each loop omits the functions that aren't yet known. That's only a constant factor improvement, but it's a useful one.

Another important detail missing above is the representation of functions. The most convenient representation is a binary truth table. For example, if we are computing the complexity of two-variable functions, there are four possible inputs, which we can number as follows.

X	Y	Value
false	false	00₂ = 0
false	true	01₂ = 1
true	false	10₂ = 2
true	true	11₂ = 3

The functions are then the 4-bit numbers giving the value of the function for each input. For example, function 13 = 1101₂ is true for all inputs except X=false Y=true. Three-variable functions correspond to 3-bit inputs generating 8-bit truth tables, and so on.

This representation has two key advantages. The first is that the numbering is dense, so that you can implement a map keyed by function using a simple array. The second is that the operations “f AND g” and “f OR g” can be implemented using bitwise operators: the truth table for “f AND g” is the bitwise AND of the truth tables for f and g.

That program worked well enough in 2001 to compute the minimum number of operators necessary to write any 1-, 2-, 3-, and 4-variable Boolean function. Each round takes asymptotically O(2^2ⁿ·2^2ⁿ) = O(2^2ⁿ⁺¹) time, and the number of rounds needed is O(the final answer). The answer for n=4 is 15, so the computation required on the order of 15·2^2⁵ = 15·2³² iterations of the innermost loop. That was plausible on the computer I was using at the time, but the answer for n=5, likely around 30, would need 30·2⁶⁴ iterations to compute, which seemed well out of reach. At the time, it seemed plausible that parity was always a hardest function and that the minimum size would continue to alternate between 2ⁿ−1 and 2ⁿ+1. It's a nice pattern.

Exploiting Symmetry

Five years later, though, Alex Healy and I got to talking about this sequence, and Alex shot down both conjectures using results from the theory of circuit complexity. (Theorists!) Neil Sloane added this note to the entry for the sequence in his Online Encyclopedia of Integer Sequences:

%E A056287 Russ Cox conjectures that X₁ XOR ... XOR X_n is always a worst f and that a(5) = 33 and a(6) = 63. But (Jan 27 2006) Alex Healy points out that this conjecture is definitely false for large n. So what is a(5)?

Indeed. What is a(5)? No one knew, and it wasn't obvious how to find out.

In January 2010, Alex and I started looking into ways to speed up the computation for a(5). 30·2⁶⁴ is too many iterations but maybe we could find ways to cut that number.

In general, if we can identify a class of functions f whose members are guaranteed to have the same complexity, then we can save just one representative of the class as long as we recreate the entire class in the loop body. What used to be:

for each function f
    for each function g
        visit f AND g
        visit f OR g

can be rewritten as

for each canonical function f
    for each canonical function g
        for each ff equivalent to f
            for each gg equivalent to g
                visit ff AND gg
                visit ff OR gg

That doesn't look like an improvement: it's doing all the same work. But it can open the door to new optimizations depending on the equivalences chosen. For example, the functions “f” and “¬f” are guaranteed to have the same complexity, by DeMorgan's laws. If we keep just one of those two on the lists that “for each function” iterates over, we can unroll the inner two loops, producing:

for each canonical function f
    for each canonical function g
        visit f OR g
        visit f AND g
        visit ¬f OR g
        visit ¬f AND g
        visit f OR ¬g
        visit f AND ¬g
        visit ¬f OR ¬g
        visit ¬f AND ¬g

That's still not an improvement, but it's no worse. Each of the two loops considers half as many functions but the inner iteration is four times longer. Now we can notice that half of tests aren't worth doing: “f AND g” is the negation of “¬f OR ¬g,” and so on, so only half of them are necessary.

Let's suppose that when choosing between “f” and “¬f” we keep the one that is false when presented with all true inputs. (This has the nice property that f ^ (int32(f) >> 31) is the truth table for the canonical form of f.) Then we can tell which combinations above will produce canonical functions when f and g are already canonical:

for each canonical function f
    for each canonical function g
        visit f OR g
        visit f AND g
        visit ¬f AND g
        visit f AND ¬g

That's a factor of two improvement over the original loop.

Another observation is that permuting the inputs to a function doesn't change its complexity: “f(V, W, X, Y, Z)” and “f(Z, Y, X, W, V)” will have the same minimum size. For complex functions, each of the 5! = 120 permutations will produce a different truth table. A factor of 120 reduction in storage is good but again we have the problem of expanding the class in the iteration. This time, there's a different trick for reducing the work in the innermost iteration. Since we only need to produce one member of the equivalence class, it doesn't make sense to permute the inputs to both f and g. Instead, permuting just the inputs to f while fixing g is guaranteed to hit at least one member of each class that permuting both f and g would. So we gain the factor of 120 twice in the loops and lose it once in the iteration, for a net savings of 120. (In some ways, this is the same trick we did with “f” vs “¬f.”)

A final observation is that negating any of the inputs to the function doesn't change its complexity, because X and X' have the same complexity. The same argument we used for permutations applies here, for another constant factor of 2⁵ = 32.

The code stores a single function for each equivalence class and then recomputes the equivalent functions for f, but not g.

for each canonical function f
    for each function ff equivalent to f
        for each canonical function g
            visit ff OR g
            visit ff AND g
            visit ¬ff AND g
            visit ff AND ¬g

In all, we just got a savings of 2·120·32 = 7680, cutting the total number of iterations from 30·2⁶⁴ = 5×10²⁰ to 7×10¹⁶. If you figure we can do around 10⁹ iterations per second, that's still 800 days of CPU time.

The full algorithm at this point is:

// Algorithm 2
func compute():
    for each function f
        size[f] = ∞
    
    for each single variable function f = v
        size[f] = 0
    
    loop
        changed = false
        for each canonical function f
            for each function ff equivalent to f
                for each canonical function g
                    d = size[ff] + 1 + size[g]
                    changed |= visit(d, ff OR g)
                    changed |= visit(d, ff AND g)
                    changed |= visit(d, ff AND ¬g)
                    changed |= visit(d, ¬ff AND g)
        if not changed
            return

func visit(d, fg):
    if size[fg] != ∞
        return false
    
    record fg as canonical

    for each function ffgg equivalent to fg
        size[ffgg] = d
    return true

The helper function “visit” must set the size not only of its argument fg but also all equivalent functions under permutation or inversion of the inputs, so that future tests will see that they have been computed.

Methodical Exploration

There's one final improvement we can make. The approach of looping until things stop changing considers each function pair multiple times as their sizes go down. Instead, we can consider functions in order of complexity, so that the main loop builds first all the functions of minimum complexity 1, then all the functions of minimum complexity 2, and so on. If we do that, we'll consider each function pair at most once. We can stop when all functions are accounted for.

Applying this idea to Algorithm 1 (before canonicalization) yields:

// Algorithm 3
func compute()
    for each function f
        size[f] = ∞
    
    for each single variable function f = v
        size[f] = 0
    
    for k = 1 to ∞
        for each function f
            for each function g of size k − size(f) − 1
                if size[f AND g] == ∞
                    size[f AND g] = k
                    nsize++
                if size[f OR g] == ∞
                    size[f OR g] = k
                    nsize++
        if nsize == 2^2ⁿ
            return

Applying the idea to Algorithm 2 (after canonicalization) yields:

// Algorithm 4
func compute():
    for each function f
        size[f] = ∞
    
    for each single variable function f = v
        size[f] = 0
    
    for k = 1 to ∞
        for each canonical function f
            for each function ff equivalent to f
                for each canonical function g of size k − size(f) − 1
                    visit(k, ff OR g)
                    visit(k, ff AND g)
                    visit(k, ff AND ¬g)
                    visit(k, ¬ff AND g)
        if nvisited == 2^2ⁿ
            return

func visit(d, fg):
    if size[fg] != ∞
        return
    
    record fg as canonical

    for each function ffgg equivalent to fg
        if size[ffgg] != ∞
            size[ffgg] = d
            nvisited += 2  // counts ffgg and ¬ffgg
    return

The original loop in Algorithms 1 and 2 considered each pair f, g in every iteration of the loop after they were computed. The new loop in Algorithms 3 and 4 considers each pair f, g only once, when k = size(f) + size(g) + 1. This removes the leading factor of 30 (the number of times we expected the first loop to run) from our estimation of the run time. Now the expected number of iterations is around 2⁶⁴/7680 = 2.4×10¹⁵. If we can do 10⁹ iterations per second, that's only 28 days of CPU time, which I can deliver if you can wait a month.

Our estimate does not include the fact that not all function pairs need to be considered. For example, if the maximum size is 30, then the functions of size 14 need never be paired against the functions of size 16, because any result would have size 14+1+16 = 31. So even 2.4×10¹⁵ is an overestimate, but it's in the right ballpark. (With hindsight I can report that only 1.7×10¹⁴ pairs need to be considered but also that our estimate of 10⁹ iterations per second was optimistic. The actual calculation ran for 20 days, an average of about 10⁸ iterations per second.)

Endgame: Directed Search

A month is still a long time to wait, and we can do better. Near the end (after k is bigger than, say, 22), we are exploring the fairly large space of function pairs in hopes of finding a fairly small number of remaining functions. At that point it makes sense to change from the bottom-up “bang things together and see what we make” to the top-down “try to make this one of these specific functions.” That is, the core of the current search is:

for each canonical function f
    for each function ff equivalent to f
        for each canonical function g of size k − size(f) − 1
            visit(k, ff OR g)
            visit(k, ff AND g)
            visit(k, ff AND ¬g)
            visit(k, ¬ff AND g)

We can change it to:

for each missing function fg
    for each canonical function g
        for all possible f such that one of these holds
                * fg = f OR g
                * fg = f AND g
                * fg = ¬f AND g
                * fg = f AND ¬g
            if size[f] == k − size(g) − 1
                visit(k, fg)
                next fg

By the time we're at the end, exploring all the possible f to make the missing functions—a directed search—is much less work than the brute force of exploring all combinations.

As an example, suppose we are looking for f such that fg = f OR g. The equation is only possible to satisfy if fg OR g == fg. That is, if g has any extraneous 1 bits, no f will work, so we can move on. Otherwise, the remaining condition is that f AND ¬g == fg AND ¬g. That is, for the bit positions where g is 0, f must match fg. The other bits of f (the bits where g has 1s) can take any value. We can enumerate the possible f values by recursively trying all possible values for the “don't care” bits.

func find(x, any, xsize):
    if size(x) == xsize
        return x
    while any != 0
        bit = any AND −any  // rightmost 1 bit in any
        any = any AND ¬bit
        if f = find(x OR bit, any, xsize) succeeds
            return f
    return failure

It doesn't matter which 1 bit we choose for the recursion, but finding the rightmost 1 bit is cheap: it is isolated by the (admittedly surprising) expression “any AND −any.”

Given find, the loop above can try these four cases:

Formula	Condition	Base x	“Any” bits
fg = f OR g	fg OR g == fg	fg AND ¬g	g
fg = f OR ¬g	fg OR ¬g == fg	fg AND g	¬g
¬fg = f OR g	¬fg OR g == fg	¬fg AND ¬g	g
¬fg = f OR ¬g	¬fg OR ¬g == ¬fg	¬fg AND g	¬g

Rewriting the Boolean expressions to use only the four OR forms means that we only need to write the “adding bits” version of find.

The final algorithm is:

// Algorithm 5
func compute():
    for each function f
        size[f] = ∞
    
    for each single variable function f = v
        size[f] = 0
    
    // Generate functions.
    for k = 1 to max_generate
        for each canonical function f
            for each function ff equivalent to f
                for each canonical function g of size k − size(f) − 1
                    visit(k, ff OR g)
                    visit(k, ff AND g)
                    visit(k, ff AND ¬g)
                    visit(k, ¬ff AND g)

    // Search for functions.
    for k = max_generate+1 to ∞
        for each missing function fg
            for each canonical function g
                fsize = k − size(g) − 1
                if fg OR g == fg
                    if f = find(fg AND ¬g, g, fsize) succeeds
                        visit(k, fg)
                        next fg
                if fg OR ¬g == fg
                    if f = find(fg AND g, ¬g, fsize) succeeds
                        visit(k, fg)
                        next fg
                if ¬fg OR g == ¬fg
                    if f = find(¬fg AND ¬g, g, fsize) succeeds
                        visit(k, fg)
                        next fg
                if ¬fg OR ¬g == ¬fg
                    if f = find(¬fg AND g, ¬g, fsize) succeeds
                        visit(k, fg)
                        next fg
        if nvisited == 2^2ⁿ
            return

func visit(d, fg):
    if size[fg] != ∞
        return
    
    record fg as canonical

    for each function ffgg equivalent to fg
        if size[ffgg] != ∞
            size[ffgg] = d
            nvisited += 2  // counts ffgg and ¬ffgg
    return

func find(x, any, xsize):
    if size(x) == xsize
        return x
    while any != 0
        bit = any AND −any  // rightmost 1 bit in any
        any = any AND ¬bit
        if f = find(x OR bit, any, xsize) succeeds
            return f
    return failure

To get a sense of the speedup here, and to check my work, I ran the program using both algorithms on a 2.53 GHz Intel Core 2 Duo E7200.

	————— # of Functions —————			———— Time ————
Size	Canonical	All	All, Cumulative	Generate	Search
0	1	10	10
1	2	82	92	< 0.1 seconds	3.4 minutes
2	2	640	732	< 0.1 seconds	7.2 minutes
3	7	4420	5152	< 0.1 seconds	12.3 minutes
4	19	25276	29696	< 0.1 seconds	30.1 minutes
5	44	117440	147136	< 0.1 seconds	1.3 hours
6	142	515040	662176	< 0.1 seconds	3.5 hours
7	436	1999608	2661784	0.2 seconds	11.6 hours
8	1209	6598400	9260184	0.6 seconds	1.7 days
9	3307	19577332	28837516	1.7 seconds	4.9 days
10	7741	50822560	79660076	4.6 seconds	[ 10 days ? ]
11	17257	114619264	194279340	10.8 seconds	[ 20 days ? ]
12	31851	221301008	415580348	21.7 seconds	[ 50 days ? ]
13	53901	374704776	790285124	38.5 seconds	[ 80 days ? ]
14	75248	533594528	1323879652	58.7 seconds	[ 100 days ? ]
15	94572	667653642	1991533294	1.5 minutes	[ 120 days ? ]
16	98237	697228760	2688762054	2.1 minutes	[ 120 days ? ]
17	89342	628589440	3317351494	4.1 minutes	[ 90 days ? ]
18	66951	468552896	3785904390	9.1 minutes	[ 50 days ? ]
19	41664	287647616	4073552006	23.4 minutes	[ 30 days ? ]
20	21481	144079832	4217631838	57.0 minutes	[ 10 days ? ]
21	8680	55538224	4273170062	2.4 hours	2.5 days
22	2730	16099568	4289269630	5.2 hours	11.7 hours
23	937	4428800	4293698430	11.2 hours	2.2 hours
24	228	959328	4294657758	22.0 hours	33.2 minutes
25	103	283200	4294940958	1.7 days	4.0 minutes
26	21	22224	4294963182	2.9 days	42 seconds
27	10	3602	4294966784	4.7 days	2.4 seconds
28	3	512	4294967296	[ 7 days ? ]	0.1 seconds

The bracketed times are estimates based on the work involved: I did not wait that long for the intermediate search steps. The search algorithm is quite a bit worse than generate until there are very few functions left to find. However, it comes in handy just when it is most useful: when the generate algorithm has slowed to a crawl. If we run generate through formulas of size 22 and then switch to search for 23 onward, we can run the whole computation in just over half a day of CPU time.

The computation of a(5) identified the sizes of all 616,126 canonical Boolean functions of 5 inputs. In contrast, there are just over 200 trillion canonical Boolean functions of 6 inputs. Determining a(6) is unlikely to happen by brute force computation, no matter what clever tricks we use.

Adding XOR

We've assumed the use of just AND and OR as our basis for the Boolean formulas. If we also allow XOR, functions can be written using many fewer operators. In particular, a hardest function for the 1-, 2-, 3-, and 4-input cases—parity—is now trivial. Knuth examines the complexity of 5-input Boolean functions using AND, OR, and XOR in detail in The Art of Computer Programming, Volume 4A. Section 7.1.2's Algorithm L is the same as our Algorithm 3 above, given for computing 4-input functions. Knuth mentions that to adapt it for 5-input functions one must treat only canonical functions and gives results for 5-input functions with XOR allowed. So another way to check our work is to add XOR to our Algorithm 4 and check that our results match Knuth's.

Because the minimum formula sizes are smaller (at most 12), the computation of sizes with XOR is much faster than before:

	————— # of Functions —————
Size	Canonical	All	All, Cumulative	Time
0	1	10	10
1	3	102	112	< 0.1 seconds
2	5	1140	1252	< 0.1 seconds
3	20	11570	12822	< 0.1 seconds
4	93	109826	122648	< 0.1 seconds
5	366	936440	1059088	0.1 seconds
6	1730	7236880	8295968	0.7 seconds
7	8782	47739088	56035056	4.5 seconds
8	40297	250674320	306709376	24.0 seconds
9	141422	955812256	1262521632	95.5 seconds
10	273277	1945383936	3207905568	200.7 seconds
11	145707	1055912608	4263818176	121.2 seconds
12	4423	31149120	4294967296	65.0 seconds

Knuth does not discuss anything like Algorithm 5, because the search for specific functions does not apply to the AND, OR, and XOR basis. XOR is a non-monotone function (it can both turn bits on and turn bits off), so there is no test like our “if fg OR g == fg” and no small set of “don't care” bits to trim the search for f. The search for an appropriate f in the XOR case would have to try all f of the right size, which is exactly what Algorithm 4 already does.

Volume 4A also considers the problem of building minimal circuits, which are like formulas but can use common subexpressions additional times for free, and the problem of building the shallowest possible circuits. See Section 7.1.2 for all the details.

Code and Web Site

The web site boolean-oracle.swtch.com lets you type in a Boolean expression and gives back the minimal formula for it. It uses tables generated while running Algorithm 5; those tables and the programs described in this post are also available on the site.

Postscript: Generating All Permutations and Inversions

The algorithms given above depend crucially on the step “for each function ff equivalent to f,” which generates all the ff obtained by permuting or inverting inputs to f, but I did not explain how to do that. We already saw that we can manipulate the binary truth table representation directly to turn f into ¬f and to compute combinations of functions. We can also manipulate the binary representation directly to invert a specific input or swap a pair of adjacent inputs. Using those operations we can cycle through all the equivalent functions.

To invert a specific input, let's consider the structure of the truth table. The index of a bit in the truth table encodes the inputs for that entry. For example, the low bit of the index gives the value of the first input. So the even-numbered bits—at indices 0, 2, 4, 6, ...—correspond to the first input being false, while the odd-numbered bits—at indices 1, 3, 5, 7, ...—correspond to the first input being true. Changing just that bit in the index corresponds to changing the single variable, so indices 0, 1 differ only in the value of the first input, as do 2, 3, and 4, 5, and 6, 7, and so on. Given the truth table for f(V, W, X, Y, Z) we can compute the truth table for f(¬V, W, X, Y, Z) by swapping adjacent bit pairs in the original truth table. Even better, we can do all the swaps in parallel using a bitwise operation. To invert a different input, we swap larger runs of bits.

Function		Truth Table (`f` = f(V, W, X, Y, Z))
f(¬V, W, X, Y, Z)		`(f&0x55555555)<< 1 \| (f>> 1)&0x55555555`
f(V, ¬W, X, Y, Z)		`(f&0x33333333)<< 2 \| (f>> 2)&0x33333333`
f(V, W, ¬X, Y, Z)		`(f&0x0f0f0f0f)<< 4 \| (f>> 4)&0x0f0f0f0f`
f(V, W, X, ¬Y, Z)		`(f&0x00ff00ff)<< 8 \| (f>> 8)&0x00ff00ff`
f(V, W, X, Y, ¬Z)		`(f&0x0000ffff)<<16 \| (f>>16)&0x0000ffff`

Being able to invert a specific input lets us consider all possible inversions by building them up one at a time. The Gray code lets us enumerate all possible 5-bit input codes while changing only 1 bit at a time as we move from one input to the next:

0, 1, 3, 2, 6, 7, 5, 4,
12, 13, 15, 14, 10, 11, 9, 8,
24, 25, 27, 26, 30, 31, 29, 28,
20, 21, 23, 22, 18, 19, 17, 16

This minimizes the number of inversions we need: to consider all 32 cases, we only need 31 inversion operations. In contrast, visiting the 5-bit input codes in the usual binary order 0, 1, 2, 3, 4, ... would often need to change multiple bits, like when changing from 3 to 4.

To swap a pair of adjacent inputs, we can again take advantage of the truth table. For a pair of inputs, there are four cases: 00, 01, 10, and 11. We can leave the 00 and 11 cases alone, because they are invariant under swapping, and concentrate on swapping the 01 and 10 bits. The first two inputs change most often in the truth table: each run of 4 bits corresponds to those four cases. In each run, we want to leave the first and fourth alone and swap the second and third. For later inputs, the four cases consist of sections of bits instead of single bits.

Function		Truth Table (`f` = f(V, W, X, Y, Z))
f(W, V, X, Y, Z)		`f&0x99999999 \| (f&0x22222222)<<1 \| (f>>1)&0x22222222`
f(V, X, W, Y, Z)		`f&0xc3c3c3c3 \| (f&0x0c0c0c0c)<<1 \| (f>>1)&0x0c0c0c0c`
f(V, W, Y, X, Z)		`f&0xf00ff00f \| (f&0x00f000f0)<<1 \| (f>>1)&0x00f000f0`
f(V, W, X, Z, Y)		`f&0xff0000ff \| (f&0x0000ff00)<<8 \| (f>>8)&0x0000ff00`

Being able to swap a pair of adjacent inputs lets us consider all possible permutations by building them up one at a time. Again it is convenient to have a way to visit all permutations by applying only one swap at a time. Here Volume 4A comes to the rescue. Section 7.2.1.2 is titled “Generating All Permutations,” and Knuth delivers many algorithms to do just that. The most convenient for our purposes is Algorithm P, which generates a sequence that considers all permutations exactly once with only a single swap of adjacent inputs between steps. Knuth calls it Algorithm P because it corresponds to the “Plain changes” algorithm used by bell ringers in 17th century England to ring a set of bells in all possible permutations. The algorithm is described in a manuscript written around 1653!

We can examine all possible permutations and inversions by nesting a loop over all permutations inside a loop over all inversions, and in fact that's what my program does. Knuth does one better, though: his Exercise 7.2.1.2-20 suggests that it is possible to build up all the possibilities using only adjacent swaps and inversion of the first input. Negating arbitrary inputs is not hard, though, and still does minimal work, so the code sticks with Gray codes and Plain changes.

Zip Files All The Way Down

2010-03-18T00:00:00-04:00

Stephen Hawking begins A Brief History of Time with this story:

A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast collection of stars called our galaxy. At the end of the lecture, a little old lady at the back of the room got up and said: “What you have told us is rubbish. The world is really a flat plate supported on the back of a giant tortoise.” The scientist gave a superior smile before replying, “What is the tortoise standing on?” “You're very clever, young man, very clever,” said the old lady. “But it's turtles all the way down!”

Scientists today are pretty sure that the universe is not actually turtles all the way down, but we can create that kind of situation in other contexts. For example, here we have video monitors all the way down and set theory books all the way down, and shopping carts all the way down.

And here's a computer storage equivalent: look inside r.zip. It's zip files all the way down: each one contains another zip file under the name r/r.zip. (For the die-hard Unix fans, r.tar.gz is gzipped tar files all the way down.) Like the line of shopping carts, it never ends, because it loops back onto itself: the zip file contains itself! And it's probably less work to put together a self-reproducing zip file than to put together all those shopping carts, at least if you're the kind of person who would read this blog. This post explains how.

Before we get to self-reproducing zip files, though, we need to take a brief detour into self-reproducing programs.

Self-reproducing programs

The idea of self-reproducing programs dates back to the 1960s. My favorite statement of the problem is the one Ken Thompson gave in his 1983 Turing Award address:

In college, before video games, we would amuse ourselves by posing programming exercises. One of the favorites was to write the shortest self-reproducing program. Since this is an exercise divorced from reality, the usual vehicle was FORTRAN. Actually, FORTRAN was the language of choice for the same reason that three-legged races are popular.

More precisely stated, the problem is to write a source program that, when compiled and executed, will produce as output an exact copy of its source. If you have never done this, I urge you to try it on your own. The discovery of how to do it is a revelation that far surpasses any benefit obtained by being told how to do it. The part about “shortest” was just an incentive to demonstrate skill and determine a winner.

Spoiler alert! I agree: if you have never done this, I urge you to try it on your own. The internet makes it so easy to look things up that it's refreshing to discover something yourself once in a while. Go ahead and spend a few days figuring out. This blog will still be here when you get back. (If you don't mind the spoilers, the entire Turing award address is worth reading.)

(Spoiler blocker.)

http://www.robertwechsler.com/projects.html

Let's try to write a Python program that prints itself. It will probably be a print statement, so here's a first attempt, run at the interpreter prompt:

>>> print 'hello'
hello

That didn't quite work. But now we know what the program is, so let's print it:

>>> print "print 'hello'"
print 'hello'

That didn't quite work either. The problem is that when you execute a simple print statement, it only prints part of itself: the argument to the print. We need a way to print the rest of the program too.

The trick is to use recursion: you write a string that is the whole program, but with itself missing, and then you plug it into itself before passing it to print.

>>> s = 'print %s'; print s % repr(s)
print 'print %s'

Not quite, but closer: the problem is that the string s isn't actually the program. But now we know the general form of the program: s = '%s'; print s % repr(s). That's the string to use.

>>> s = 's = %s; print s %% repr(s)'; print s % repr(s)
s = 's = %s; print s %% repr(s)'; print s % repr(s)

Recursion for the win.

This form of self-reproducing program is often called a quine, in honor of the philosopher and logician W. V. O. Quine, who discovered the paradoxical sentence:

“Yields falsehood when preceded by its quotation”
yields falsehood when preceded by its quotation.

The simplest English form of a self-reproducing quine is a command like:

Print this, followed by its quotation:
“Print this, followed by its quotation:”

There's nothing particularly special about Python that makes quining possible. The most elegant quine I know is a Scheme program that is a direct, if somewhat inscrutable, translation of that sentiment:

((lambda (x) `(,x ',x))
'(lambda (x) `(,x ',x)))

I think the Go version is a clearer translation, at least as far as the quoting is concerned:

/* Go quine */
package main
import "fmt"
func main() {
 fmt.Printf("%s%c%s%c\n", q, 0x60, q, 0x60)
}
var q = `/* Go quine */
package main
import "fmt"
func main() {
 fmt.Printf("%s%c%s%c\n", q, 0x60, q, 0x60)
}
var q = `

(I've colored the data literals green throughout to make it clear what is program and what is data.)

The Go program has the interesting property that, ignoring the pesky newline at the end, the entire program is the same thing twice (/* Go quine */ ... q = `). That got me thinking: maybe it's possible to write a self-reproducing program using only a repetition operator. And you know what programming language has essentially only a repetition operator? The language used to encode Lempel-Ziv compressed files like the ones used by gzip and zip.

Self-reproducing Lempel-Ziv programs

Lempel-Ziv compressed data is a stream of instructions with two basic opcodes: literal(n) followed by n bytes of data means write those n bytes into the decompressed output, and repeat(d, n) means look backward d bytes from the current location in the decompressed output and copy the n bytes you find there into the output stream.

The programming exercise, then, is this: write a Lempel-Ziv program using just those two opcodes that prints itself when run. In other words, write a compressed data stream that decompresses to itself. Feel free to assume any reasonable encoding for the literal and repeat opcodes. For the grand prize, find a program that decompresses to itself surrounded by an arbitrary prefix and suffix, so that the sequence could be embedded in an actual gzip or zip file, which has a fixed-format header and trailer.

Spoiler alert! I urge you to try this on your own before continuing to read. It's a great way to spend a lazy afternoon, and you have one critical advantage that I didn't: you know there is a solution.

(Spoiler blocker.)

http://www.robertwechsler.com/thebest.html

By the way, here's r.gz, gzip files all the way down.

$ gunzip < r.gz > r
$ cmp r r.gz
$

The nice thing about r.gz is that even broken web browsers that ordinarily decompress downloaded gzip data before storing it to disk will handle this file correctly!

Enough stalling to hide the spoilers. Let's use this shorthand to describe Lempel-Ziv instructions: Ln and Rn are shorthand for literal(n) and repeat(n, n), and the program assumes that each code is one byte. L0 is therefore the Lempel-Ziv no-op; L5 hello prints hello; and so does L3 hel R1 L1 o.

Here's a Lempel-Ziv program that prints itself. (Each line is one instruction.)

	Code	Output
no-op	`L0`
no-op	`L0`
no-op	`L0`
print 4 bytes	`L4 L0 L0 L0 L4`	`L0 L0 L0 L4`
repeat last 4 printed bytes	`R4`	`L0 L0 L0 L4`
print 4 bytes	`L4 R4 L4 R4 L4`	`R4 L4 R4 L4`
repeat last 4 printed bytes	`R4`	`R4 L4 R4 L4`
print 4 bytes	`L4 L0 L0 L0 L0`	`L0 L0 L0 L0`

(The two columns Code and Output contain the same byte sequence.)

The interesting core of this program is the 6-byte sequence L4 R4 L4 R4 L4 R4, which prints the 8-byte sequence R4 L4 R4 L4 R4 L4 R4 L4. That is, it prints itself with an extra byte before and after.

When we were trying to write the self-reproducing Python program, the basic problem was that the print statement was always longer than what it printed. We solved that problem with recursion, computing the string to print by plugging it into itself. Here we took a different approach. The Lempel-Ziv program is particularly repetitive, so that a repeated substring ends up containing the entire fragment. The recursion is in the representation of the program rather than its execution. Either way, that fragment is the crucial point. Before the final R4, the output lags behind the input. Once it executes, the output is one code ahead.

The L0 no-ops are plugged into a more general variant of the program, which can reproduce itself with the addition of an arbitrary three-byte prefix and suffix:

	Code	Output
print 4 bytes	`L4 aa bb cc L4`	`aa bb cc L4`
repeat last 4 printed bytes	`R4`	`aa bb cc L4`
print 4 bytes	`L4 R4 L4 R4 L4`	`R4 L4 R4 L4`
repeat last 4 printed bytes	`R4`	`R4 L4 R4 L4`
print 4 bytes	`L4 R4 xx yy zz`	`R4 xx yy zz`
repeat last 4 printed bytes	`R4`	`R4 xx yy zz`

(The byte sequence in the Output column is aa bb cc, then the byte sequence from the Code column, then xx yy zz.)

It took me the better part of a quiet Sunday to get this far, but by the time I got here I knew the game was over and that I'd won. From all that experimenting, I knew it was easy to create a program fragment that printed itself minus a few instructions or even one that printed an arbitrary prefix and then itself, minus a few instructions. The extra aa bb cc in the output provides a place to attach such a program fragment. Similarly, it's easy to create a fragment to attach to the xx yy zz that prints itself, minus the first three instructions, plus an arbitrary suffix. We can use that generality to attach an appropriate header and trailer.

Here is the final program, which prints itself surrounded by an arbitrary prefix and suffix. [P] denotes the p-byte compressed form of the prefix P; similarly, [S] denotes the s-byte compressed form of the suffix S.

	Code	Output
print prefix	`[P]`	`P`
print p+1 bytes	`L`p+1 `[P] L`p+1	`[P] L`p+1
repeat last p+1 printed bytes	`R`p+1	`[P] L`p+1
print 1 byte	`L1 R`p+1	`R`p+1
print 1 byte	`L1 L1`	`L1`
print 4 bytes	`L4 R`p+1 `L1 L1 L4`	`R`p+1 `L1 L1 L4`
repeat last 4 printed bytes	`R4`	`R`p+1 `L1 L1 L4`
print 4 bytes	`L4 R4 L4 R4 L4`	`R4 L4 R4 L4`
repeat last 4 printed bytes	`R4`	`R4 L4 R4 L4`
print 4 bytes	`L4 R4 L0 L0 L`s+1	`R4 L0 L0 L`s+1
repeat last 4 printed bytes	`R4`	`R4 L0 L0 L`s+1
no-op	`L0`
no-op	`L0`
print s+1 bytes	`L`s+1 `R`s+1 `[S]`	`R`s+1 `[S]`
repeat last s+1 bytes	`R`s+1	`R`s+1 `[S]`
print suffix	`[S]`	`S`

(The byte sequence in the Output column is P, then the byte sequence from the Code column, then S.)

Self-reproducing zip files

Now the rubber meets the road. We've solved the main theoretical obstacle to making a self-reproducing zip file, but there are a couple practical obstacles still in our way.

The first obstacle is to translate our self-reproducing Lempel-Ziv program, written in simplified opcodes, into the real opcode encoding. RFC 1951 describes the DEFLATE format used in both gzip and zip: a sequence of blocks, each of which is a sequence of opcodes encoded using Huffman codes. Huffman codes assign different length bit strings to different opcodes, breaking our assumption above that opcodes have fixed length. But wait! We can, with some care, find a set of fixed-size encodings that says what we need to be able to express.

In DEFLATE, there are literal blocks and opcode blocks. The header at the beginning of a literal block is 5 bytes:

If the translation of our L opcodes above are 5 bytes each, the translation of the R opcodes must also be 5 bytes each, with all the byte counts above scaled by a factor of 5. (For example, L4 now has a 20-byte argument, and R4 repeats the last 20 bytes of output.) The opcode block with a single repeat(20,20) instruction falls well short of 5 bytes:

Luckily, an opcode block containing two repeat(20,10) instructions has the same effect and is exactly 5 bytes:

Encoding the other sized repeats (Rp+1 and Rs+1) takes more effort and some sleazy tricks, but it turns out that we can design 5-byte codes that repeat any amount from 9 to 64 bytes. For example, here are the repeat blocks for 10 bytes and for 40 bytes:

The repeat block for 10 bytes is two bits too short, but every repeat block is followed by a literal block, which starts with three zero bits and then padding to the next byte boundary. If a repeat block ends two bits short of a byte but is followed by a literal block, the literal block's padding will insert the extra two bits. Similarly, the repeat block for 40 bytes is five bits too long, but they're all zero bits. Starting a literal block five bits too late steals the bits from the padding. Both of these tricks only work because the last 7 bits of any repeat block are zero and the bits in the first byte of any literal block are also zero, so the boundary isn't directly visible. If the literal block started with a one bit, this sleazy trick wouldn't work.

The second obstacle is that zip archives (and gzip files) record a CRC32 checksum of the uncompressed data. Since the uncompressed data is the zip archive, the data being checksummed includes the checksum itself. So we need to find a value x such that writing x into the checksum field causes the file to checksum to x. Recursion strikes back.

The CRC32 checksum computation interprets the entire file as a big number and computes the remainder when you divide that number by a specific constant using a specific kind of division. We could go through the effort of setting up the appropriate equations and solving for x. But frankly, we've already solved one nasty recursive puzzle today, and enough is enough. There are only four billion possibilities for x: we can write a program to try each in turn, until it finds one that works.

If you want to recreate these files yourself, there are a few more minor obstacles, like making sure the tar file is a multiple of 512 bytes and compressing the rather large zip trailer to at most 59 bytes so that Rs+1 is at most R64. But they're just a simple matter of programming.

So there you have it: r.gz (gzip files all the way down), r.tar.gz (gzipped tar files all the way down), and r.zip (zip files all the way down). I regret that I have been unable to find any programs that insist on decompressing these files recursively, ad infinitum. It would have been fun to watch them squirm, but it looks like much less sophisticated zip bombs have spoiled the fun.

If you're feeling particularly ambitious, here is rgzip.go, the Go program that generated these files. I wonder if you can create a zip file that contains a gzipped tar file that contains the original zip file. Ken Thompson suggested trying to make a zip file that contains a slightly larger copy of itself, recursively, so that as you dive down the chain of zip files each one gets a little bigger. (If you do manage either of these, please leave a comment.)

P.S. I can't end the post without sharing my favorite self-reproducing program: the one-line shell script #!/bin/cat.

UTF-8: Bits, Bytes, and Benefits

2010-03-05T00:00:00-05:00

UTF-8 is a way to encode Unicode code points—integer values from 0 through 10FFFF—into a byte stream, and it is far simpler than many people realize. The easiest way to make it confusing or complicated is to treat it as a black box, never looking inside. So let's start by looking inside. Here it is:


Unicode code points		UTF-8 encoding (binary)

00-7F	(7 bits)	0tuvwxyz
0080-07FF	(11 bits)	110pqrst 10uvwxyz
0800-FFFF	(16 bits)	1110jklm 10npqrst 10uvwxyz
010000-10FFFF	(21 bits)	11110efg 10hijklm 10npqrst 10uvwxyz

The convenient properties of UTF-8 are all consequences of the choice of encoding.

All ASCII files are already UTF-8 files.
The first 128 Unicode code points are the 7-bit ASCII character set, and UTF-8 preserves their one-byte encoding.
ASCII bytes always represent themselves in UTF-8 files. They never appear as part of other UTF-8 sequences.
All the non-ASCII UTF-8 sequences consist of bytes with the high bit set, so if you see the byte 0x7A in a UTF-8 file, you can be sure it represents the character z.
ASCII bytes are always represented as themselves in UTF-8 files. They cannot be hidden inside multibyte UTF-8 sequences.
The ASCII z 01111010 cannot be encoded as a two-byte UTF-8 sequence 11000001 10111010. Code points must be encoded using the shortest possible sequence. A corollary is that decoders must detect long-winded sequences as invalid. In practice, it is useful for a decoder to use the Unicode replacement character, code point FFFD, as the decoding of an invalid UTF-8 sequence rather than stop processing the text.
UTF-8 is self-synchronizing.
Let's call a byte of the form 10xxxxxx a continuation byte. Every UTF-8 sequence is a byte that is not a continuation byte followed by zero or more continuation bytes. If you start processing a UTF-8 file at an arbitrary point, you might not be at the beginning of a UTF-8 encoding, but you can easily find one: skip over continuation bytes until you find a non-continuation byte. (The same applies to scanning backward.)
Substring search is just byte string search.
Properties 2, 3, and 4 imply that given a string of correctly encoded UTF-8, the only way those bytes can appear in a larger UTF-8 text is when they represent the same code points. So you can use any 8-bit safe byte at a time search function, like strchr or strstr, to run the search.
Most programs that handle 8-bit files safely can handle UTF-8 safely.
This also follows from Properties 2, 3, and 4. I say “most” programs, because programs that take apart a byte sequence expecting one character per byte will not behave correctly, but very few programs do that. It is far more common to split input at newline characters, or split whitespace-separated fields, or do other similar parsing around specific ASCII characters. For example, Unix tools like cat, cmp, cp, diff, echo, head, tail, and tee can process UTF-8 files as if they were plain ASCII files. Most operating system kernels should also be able to handle UTF-8 file names without any special arrangement, since the only operations done on file names are comparisons and splitting at /. In contrast, tools like grep, sed, and wc, which inspect arbitrary individual characters, do need modification.
UTF-8 sequences sort in code point order.
You can verify this by inspecting the encodings in the table above. This means that Unix tools like join, ls, and sort (without options) don't need to handle UTF-8 specially.
UTF-8 has no “byte order.”
UTF-8 is a byte encoding. It is not little endian or big endian. Unicode defines a byte order mark (BOM) code point FFFE, which are used to determine the byte order of a stream of raw 16-bit values, like UCS-2 or UTF-16. It has no place in a UTF-8 file. Some programs like to write a UTF-8-encoded BOM at the beginning of UTF-8 files, but this is unnecessary (and annoying to programs that don't expect it).

UTF-8 does give up the ability to do random access using code point indices. Programs that need to jump to the nth Unicode code point in a file or on a line—text editors are the canonical example—will typically convert incoming UTF-8 to an internal representation like an array of code points and then convert back to UTF-8 for output, but most programs are simpler when written to manipulate UTF-8 directly.

Programs that make UTF-8 more complicated than it needs to be are typically trying to be too general, not wanting to make assumptions that might not be true of other encodings. But there are good tools to convert other encodings to UTF-8, and it is slowly becoming the standard encoding: even the fraction of web pages written in UTF-8 is nearing 50%. UTF-8 was explicitly designed to have these nice properties. Take advantage of them.

For more on UTF-8, see “Hello World or Καλημέρα κόσμε or こんにちは世界,” by Rob Pike and Ken Thompson, and also this history.

Notes: Property 6 assumes the tools do not strip the high bit from each byte. Such mangling was common years ago but is very uncommon now. Property 7 assumes the comparison is done treating the bytes as unsigned, but such behavior is mandated by the ANSI C standard for memcmp, strcmp, and strncmp.

Computing History at Bell Labs

2008-04-09T00:00:00-04:00

In 1997, on his retirement from Bell Labs, Doug McIlroy gave a fascinating talk about the “History of Computing at Bell Labs.” Almost ten years ago I transcribed the audio but never did anything with it. The transcript is below.

My favorite parts of the talk are the description of the bi-quinary decimal relay calculator and the description of a team that spent over a year tracking down a race condition bug in a missile detector (reliability was king: today you'd just stamp “cannot reproduce” and send the report back). But the whole thing contains many fantastic stories. It's well worth the read or listen. I also like his recollection of programming using cards: “It's the kind of thing you can be nostalgic about, but it wasn't actually fun.”

For more information, Bernard D. Holbrook and W. Stanley Brown's 1982 technical report “A History of Computing Research at Bell Laboratories (1937-1975)” covers the earlier history in more detail.

Corrections added August 19, 2009. Links updated May 16, 2018.

Transcript of “History of Computing at Bell Labs:”

Computing at Bell Labs is certainly an outgrowth of the mathematics department, which grew from that first hiring in 1897, G A Campbell. When Bell Labs was formally founded in 1925, what it had been was the engineering department of Western Electric. When it was formally founded in 1925, almost from the beginning there was a math department with Thornton Fry as the department head, and if you look at some of Fry's work, it turns out that he was fussing around in 1929 with trying to discover information theory. It didn't actually gel until twenty years later with Shannon.

1:10 Of course, most of the mathematics at that time was continuous. One was interested in analyzing circuits and propagation. And indeed, this is what led to the growth of computing in Bell Laboratories. The computations could not all be done symbolically. There were not closed form solutions. There was lots of numerical computation done. The math department had a fair stable of computers, which in those days meant people. [laughter]

2:00 And in the late '30s, George Stibitz had an idea that some of the work that they were doing on hand calculators might be automated by using some of the equipment that the Bell System was installing in central offices, namely relay circuits. He went home, and on his kitchen table, he built out of relays a binary arithmetic circuit. He decided that binary was really the right way to compute. However, when he finally came to build some equipment, he determined that binary to decimal conversion and decimal to binary conversion was a drag, and he didn't want to put it in the equipment, and so he finally built in 1939, a relay calculator that worked in decimal, and it worked in complex arithmetic. Do you have a hand calculator now that does complex arithmetic? Ten-digit, I believe, complex computations: add, subtract, multiply, and divide. The I/O equipment was teletypes, so essentially all the stuff to make such machines out of was there. Since the I/O was teletypes, it could be remotely accessed, and there were in fact four stations in the West Street Laboratories of Bell Labs. West Street is down on the left side of Manhattan. I had the good fortune to work there one summer, right next to a district where you're likely to get bowled over by rolling ?beads? hanging from racks or tumbling ?cabbages?. The building is still there. It's called Westbeth Apartments. It's now an artist's colony.

4:29 Anyway, in West Street, there were four separate remote stations from which the complex calculator could be accessed. It was not time sharing. You actually reserved your time on the machine, and only one of the four terminals worked at a time. In 1940, this machine was shown off to the world at the AMS annual convention, which happened to be held in Hanover at Dartmouth that year, and mathematicians could wonder at remote computing, doing computation on an electromechanical calculator at 300 miles away.

5:22 Stibitz went on from there to make a whole series of relay machines. Many of them were made for the government during the war. They were named, imaginatively, Mark I through Mark VI. I have read some of his patents. They're kind of fun. One is a patent on conditional transfer. [laughter] And how do you do a conditional transfer? Well these gadgets were, the relay calculator was run from your fingers, I mean the complex calculator. The later calculators, of course, if your fingers were a teletype, you could perfectly well feed a paper tape in, because that was standard practice. And these later machines were intended really to be run more from paper tape. And the conditional transfer was this: you had two teletypes, and there's a code that says "time to read from the other teletype". Loops were of course easy to do. You take paper and [laughter; presumably Doug curled a piece of paper to form a physical loop]. These machines never got to the point of having stored programs. But they got quite big. I saw, one of them was here in 1954, and I did see it, behind glass, and if you've ever seen these machines in the, there's one in the Franklin Institute in Philadelphia, and there's one in the Science Museum in San Jose, you know these machines that drop balls that go wandering sliding around and turning battle wheels and ringing bells and who knows what. It kind of looked like that. It was a very quiet room, with just a little clicking of relays, which is what a central office used to be like. It was the one air-conditioned room in Murray Hill, I think. This machine ran, the Mark VI, well I think that was the Mark V, the Mark VI actually went to Aberdeen. This machine ran for a good number of years, probably six, eight. And it is said that it never made an undetected error. [laughter]

8:30 What that means is that it never made an error that it did not diagnose itself and stop. Relay technology was very very defensive. The telephone switching system had to work. It was full of self-checking, and so were the calculators, so were the calculators that Stibitz made.

9:04 Arithmetic was done in bi-quinary, a two out of five representation for decimal integers, and if there weren't exactly two out of five relays activated it would stop. This machine ran unattended over the weekends. People would bring their tapes in, and the operator would paste everybody's tapes together. There was a beginning of job code on the tape and there was also a time indicator. If the machine ran out of time, it automatically stopped and went to the next job. If the machine caught itself in an error, it backed up to the current job and tried it again. They would load this machine on Friday night, and on Monday morning, all the tapes, all the entries would be available on output tapes.

Question: I take it they were using a different representation for loops and conditionals by then.

Doug: Loops were done actually by they would run back and forth across the tape now, on this machine.

10:40 Then came the transistor in '48. At Whippany, they actually had a transistorized computer, which was a respectable minicomputer, a box about this big, running in 1954, it ran from 1954 to 1956 solidly as a test run. The notion was that this computer might fly in an airplane. And during that two-year test run, one diode failed. In 1957, this machine called TRADIC, did in fact fly in an airplane, but to the best of my knowledge, that machine was a demonstration machine. It didn't turn into a production machine. About that time, we started buying commercial machines. It's wonderful to think about the set of different architectures that existed in that time. The first machine we got was called a CPC from IBM. And all it was was a big accounting machine with a very special plugboard on the side that provided an interpreter for doing ten-digit decimal arithmetic, including opcodes for the trig functions and square root.

12:30 It was also not a computer as we know it today, because it wasn't stored program, it had twenty-four memory locations as I recall, and it took its program instead of from tapes, from cards. This was not a total advantage. A tape didn't get into trouble if you dropped it on the floor. [laughter]. CPC, the operator would stand in front of it, and there, you would go through loops by taking cards out, it took human intervention, to take the cards out of the output of the card reader and put them in the ?top?. I actually ran some programs on the CPC ?...?. It's the kind of thing you can be nostalgic about, but it wasn't actually fun. [laughter]

13:30 The next machine was an IBM 650, and here, this was a stored program, with the memory being on drum. There was no operating system for it. It came with a manual: this is what the machine does. And Michael Wolontis made an interpreter called the L1 interpreter for this machine, so you could actually program in, the manual told you how to program in binary, and L1 allowed you to give something like 10 for add and 9 for subtract, and program in decimal instead. And of course that machine required interesting optimization, because it was a nice thing if the next program step were stored somewhere -- each program step had the address of the following step in it, and you would try to locate them around the drum so to minimize latency. So there were all kinds of optimizers around, but I don't think Bell Labs made ?...? based on this called "soap" from Carnegie Mellon. That machine didn't last very long. Fortunately, a machine with core memory came out from IBM in about '56, the 704. Bell Labs was a little slow in getting one, in '58. Again, the machine came without an operating system. In fact, but it did have Fortran, which really changed the world. It suddenly made it easy to write programs. But the way Fortran came from IBM, it came with a thing called the Fortran Stop Book. This was a list of what happened, a diagnostic would execute the halt instruction, the operator would go read the panel lights and discover where the machine had stopped, you would then go look up in the stop book what that meant. Bell Labs, with George Mealy and Glenn Hanson, made an operating system, and one of the things they did was to bring the stop book to heel. They took the compiler, replaced all the stop instructions with jumps to somewhere, and allowed the program instead of stopping to go on to the next trial. By the time I arrived at Bell Labs in 1958, this thing was running nicely.

16:36 Bell Labs continued to be a major player in operating systems. This was called BESYS. BE was the share abbreviation for Bell Labs. Each company that belonged to Share, which was the IBM users group, ahd a two letter abbreviation. It's hard to imagine taking all the computer users now and giving them a two-letter abbreviation. BESYS went through many generations, up to BESYS 5, I believe. Each one with innovations. IBM delivered a machine, the 7090, in 1960. This machine had interrupts in it, but IBM didn't use them. But BESYS did. And that sent IBM back to the drawing board to make it work. [Laughter]

17:48 Rob Pike: It also didn't have memory protection.

Doug: It didn't have memory protection either, and a lot of people actually got IBM to put memory protection in the 7090, so that one could leave the operating system resident in the presence of a wild program, an idea that the PC didn't discover until, last year or something like that. [laughter]

Big players then, Dick Hamming, a name that I'm sure everybody knows, was sort of the numerical analysis guru, and a seer. He liked to make outrageous predictions. He predicted in 1960, that half of Bell Labs was going to be busy doing something with computers eventually. ?...? exaggerating some ?...? abstract in his thought. He was wrong. Half was a gross underestimate. Dick Hamming retired twenty years ago, and just this June he completed his full twenty years term in the Navy, which entitles him again to retire from the Naval Postgraduate Institute in Monterey. Stibitz, incidentally died, I think within the last year. He was doing medical instrumentation at Dartmouth essentially, near the end.

20:00 Various problems intrigued, besides the numerical problems, which in fact were stock in trade, and were the real justification for buying machines, until at least the '70s I would say. But some non-numerical problems had begun to tickle the palette of the math department. Even G A Campbell got interested in graph theory, the reason being he wanted to think of all the possible ways you could take the three wires and the various parts of the telephone and connect them together, and try permutations to see what you could do about reducing side ?...? by putting things into the various parts of the circuit, and devised every possibly way of connecting the telephone up. And that was sort of the beginning of combinatorics at Bell Labs. John Reardon, a mathematician parlayed this into a major subject. Two problems which are now deemed as computing problems, have intrigued the math department for a very long time, and those are the minimum spanning tree problem, and the wonderfully ?comment about Joe Kruskal, laughter?

21:50 And in the 50s Bob Prim and Kruskal, who I don't think worked on the Labs at that point, invented algorithms for the minimum spanning tree. Somehow or other, computer scientists usually learn these algorithms, one of the two at least, as Dijkstra's algorithm, but he was a latecomer.

Another pet was the traveling salesman. There's been a long list of people at Bell Labs who played with that: Shen Lin and Ron Graham and David Johnson and dozens more, oh and ?...?. And then another problem is the Steiner minimum spanning tree, where you're allowed to add points to the graph. Every one of these problems grew, actually had a justification in telephone billing. One jurisdiction or another would specify that the way you bill for a private line network was in one jurisdiction by the minimum spanning tree. In another jurisdiction, by the traveling salesman route. NP-completeness wasn't a word in the vocabulary of ?...? [laughter]. And the Steiner problem came up because customers discovered they could beat the system by inventing offices in the middle of Tennessee that had nothing to do with their business, but they could put the office at a Steiner point and reduce their phone bill by adding to what the service that the Bell System had to give them. So all of these problems actually had some justification in billing besides the fun.

24:15 Come the 60s, we actually started to hire people for computing per se. I was perhaps the third person who was hired with a Ph.D. to help take care of the computers and I'm told that the then director and head of the math department, Hendrick Bode, had said to his people, "yeah, you can hire this guy, instead of a real mathematician, but what's he gonna be doing in five years?" [laughter]

25:02 Nevertheless, we started hiring for real in about '67. Computer science got split off from the math department. I had the good fortune to move into the office that I've been in ever since then. Computing began to make, get a personality of its own. One of the interesting people that came to Bell Labs for a while was Hao Wang. Is his name well known? [Pause] One nod. Hao Wang was a philosopher and logician, and we got a letter from him in England out of the blue saying "hey you know, can I come and use your computers? I have an idea about theorem proving." There was theorem proving in the air in the late 50s, and it was mostly pretty thin stuff. Obvious that the methods being proposed wouldn't possibly do anything more difficult than solve tic-tac-toe problems by enumeration. Wang had a notion that he could mechanically prove theorems in the style of Whitehead and Russell's great treatise Principia Mathematica in the early patr of the century. He came here, learned how to program in machine language, and took all of Volume I of Principia Mathematica -- if you've ever hefted Principia, well that's about all it's good for, it's a real good door stop. It's really big. But it's theorem after theorem after theorem in propositional calculus. Of course, there's a decision procedure for propositional calculus, but he was proving them more in the style of Whitehead and Russell. And when he finally got them all coded and put them into the computer, he proved the entire contents of this immense book in eight minutes. This was actually a neat accomplishment. Also that was the beginning of all the language theory. We hired people like Al Aho and Jeff Ullman, who probed around every possible model of grammars, syntax, and all of the things that are now in the standard undergraduate curriculum, were pretty well nailed down here, on syntax and finite state machines and so on were pretty well nailed down in the 60s. Speaking of finite state machines, in the 50s, both Mealy and Moore, who have two of the well-known models of finite state machines, were here.

28:40 During the 60s, we undertook an enormous development project in the guise of research, which was MULTICS, and it was the notion of MULTICS was computing was the public utility of the future. Machines were very expensive, and ?indeed? like you don't own your own electric generator, you rely on the power company to do generation for you, and it was seen that this was a good way to do computing -- time sharing -- and it was also recognized that shared data was a very good thing. MIT pioneered this and Bell Labs joined in on the MULTICS project, and this occupied five years of system programming effort, until Bell Labs pulled out, because it turned out that MULTICS was too ambitious for the hardware at the time, and also with 80 people on it was not exactly a research project. But, that led to various people who were on the project, in particular Ken Thompson -- right there -- to think about how to -- Dennis Ritchie and Rudd Canaday were in on this too -- to think about how you might make a pleasant operating system with a little less resources.

30:30 And Ken found -- this is a story that's often been told, so I won't go into very much of unix -- Ken found an old machine cast off in the corner, the PDP-7, and put up this little operating system on it, and we had immense GE635 available at the comp center at the time, and I remember as the department head, muscling in to use this little computer to be, to get to be Unix's first user, customer, because it was so much pleasanter to use this tiny machine than it was to use the big and capable machine in the comp center. And of course the rest of the story is known to everybody and has affected all college campuses in the country.

31:33 Along with the operating system work, there was a fair amount of language work done at Bell Labs. Often curious off-beat languages. One of my favorites was called Blodi, B L O D I, a block diagram compiler by Kelly and Vyssotsky. Perhaps the most interesting early uses of computers in the sense of being unexpected, were those that came from the acoustics research department, and what the Blodi compiler was invented in the acoustic research department for doing digital simulations of sample data system. DSPs are classic sample data systems, where instead of passing analog signals around, you pass around streams of numerical values. And Blodi allowed you to say here's a delay unit, here's an amplifier, here's an adder, the standard piece parts for a sample data system, and each one was described on a card, and with description of what it's wired to. It was then compiled into one enormous single straight line loop for one time step. Of course, you had to rearrange the code because some one part of the sample data system would feed another and produce really very efficient 7090 code for simulating sample data systems. By in large, from that time forth, the acoustic department stopped making hardware. It was much easier to do signal processing digitally than previous ways that had been analog. Blodi had an interesting property. It was the only programming language I know where -- this is not my original observation, Vyssotsky said -- where you could take the deck of cards, throw it up the stairs, and pick them up at the bottom of the stairs, feed them into the computer again, and get the same program out. Blodi had two, aside from syntax diagnostics, it did have one diagnostic when it would fail to compile, and that was "somewhere in your system is a loop that consists of all delays or has no delays" and you can imagine how they handled that.

35:09 Another interesting programming language of the 60s was Ken Knowlten's Beflix. This was for making movies on something with resolution kind of comparable to 640x480, really coarse, and the programming notion in here was bugs. You put on your grid a bunch of bugs, and each bug carried along some data as baggage, and then you would do things like cellular automata operations. You could program it or you could kind of let it go by itself. If a red bug is next to a blue bug then it turns into a green bug on the following step and so on. 36:28 He and Lillian Schwartz made some interesting abstract movies at the time. It also did some interesting picture processing. One wonderful picture of a reclining nude, something about the size of that blackboard over there, all made of pixels about a half inch high each with a different little picture in it, picked out for their density, and so if you looked at it close up it consisted of pickaxes and candles and dogs, and if you looked at it far enough away, it was a reclining nude. That picture got a lot of play all around the country.

Lorinda Cherry: That was with Leon, wasn't it? That was with Leon Harmon.

Doug: Was that Harmon?

Lorinda: ?...?

Doug: Harmon was also an interesting character. He did more things than pictures. I'm glad you reminded me of him. I had him written down here. Harmon was a guy who among other things did a block diagram compiler for writing a handwriting recognition program. I never did understand how his scheme worked, and in fact I guess it didn't work too well. [laughter] It didn't do any production ?things? but it was an absolutely immense sample data circuit for doing handwriting recognition. Harmon's most famous work was trying to estimate the information content in a face. And every one of these pictures which are a cliche now, that show a face digitized very coarsely, go back to Harmon's first psychological experiments, when he tried to find out how many bits of picture he needed to try to make a face recognizable. He went around and digitized about 256 faces from Bell Labs and did real psychological experiments asking which faces could be distinguished from other ones. I had the good fortune to have one of the most distinguishable faces, and consequently you'll find me in freshman psychology texts through no fault of my own.

39:15 Another thing going on the 60s was the halting beginning here of interactive computing. And again the credit has to go to the acoustics research department, for good and sufficient reason. They wanted to be able to feed signals into the machine, and look at them, and get them back out. They bought yet another weird architecture machine called the Packard Bell 250, where the memory elements were mercury delay lines.

Question: Packard Bell?

Doug: Packard Bell, same one that makes PCs today.

40:10 They hung this off of the comp center 7090 and put in a scheme for quickly shipping jobs into the job stream on the 7090. The Packard Bell was the real-time terminal that you could play with and repair stuff, ?...? off the 7090, get it back, and then you could play it. From that grew some graphics machines also, built by ?...? et al. And it was one of the old graphics machines in fact that Ken picked up to build Unix on.

40:55 Another thing that went on in the acoustics department was synthetic speech and music. Max Mathews, who was the the director of the department has long been interested in computer music. In fact since retirement he spent a lot of time with Pierre Boulez in Paris at a wonderful institute with lots of money simply for making synthetic music. He had a language called Music 5. Synthetic speech or, well first of all simply speech processing was pioneered particularly by John Kelly. I remember my first contact with speech processing. It was customary for computer operators, for the benefit of computer operators, to put a loudspeaker on the low bit of some register on the machine, and normally the operator would just hear kind of white noise. But if you got into a loop, suddenly the machine would scream, and this signal could be used to the operator "oh the machines in a loop. Go stop it and go on to the next job." I remember feeding them an Ackermann's function routine once. [laughter] They were right. It was a silly loop. But anyway. One day, the operators were ?...?. The machine started singing. Out of the blue. “Help! I'm caught in a loop.”. [laughter] And in a broad Texas accent, which was the recorded voice of John Kelly.

43:14 However. From there Kelly went on to do some speech synthesis. Of course there's been a lot more speech synthesis work done since, by 43:31 folks like Cecil Coker, Joe Olive. But they produced a record, which unfortunately I can't play because records are not modern anymore. And everybody got one in the Bell Labs Record, which is a magazine, contained once a record from the acoustics department, with both speech and music and one very famous combination where the computer played and sang "A Bicycle Built For Two".

?...?

44:32 At the same time as all this stuff is going on here, needless to say computing is going on in the rest of the Labs. it was about early 1960 when the math department lost its monopoly on computing machines and other people started buying them too, but for switching. The first experiments with switching computers were operational in around 1960. They were planned for several years prior to that; essentially as soon as the transistor was invented, the making of electronic rather than electromechanical switching machines was anticipated. Part of the saga of the switching machines is cheap memory. These machines had enormous memories -- thousands of words. [laughter] And it was said that the present worth of each word of memory that programmers saved across the Bell System was something like eleven dollars, as I recall. And it was worthwhile to struggle to save some memory. Also, programs were permanent. You were going to load up the switching machine with switching program and that was going to run. You didn't change it every minute or two. And it would be cheaper to put it in read only memory than in core memory. And there was a whole series of wild read-only memories, both tried and built. The first experimental Essex System had a thing called the flying spot store which was large photographic plates with bits on them and CRTs projecting on the plates and you would detect underneath on the photodetector whether the bit was set or not. That was the program store of Essex. The program store of the first ESS systems consisted of twistors, which I actually am not sure I understand to this day, but they consist of iron wire with a copper wire wrapped around them and vice versa. There were also experiments with an IC type memory called the waffle iron. Then there was a period when magnetic bubbles were all the rage. As far as I know, although microelectronics made a lot of memory, most of the memory work at Bell Labs has not had much effect on ?...?. Nice tries though.

48:28 Another thing that folks began to work on was the application of (and of course, right from the start) computers to data processing. When you owned equipment scattered through every street in the country, and you have a hundred million customers, and you have bills for a hundred million transactions a day, there's really some big data processing going on. And indeed in the early 60s, AT&T was thinking of making its own data processing computers solely for billing. Somehow they pulled out of that, and gave all the technology to IBM, and one piece of that technology went into use in high end equipment called tractor tapes. Inch wide magnetic tapes that would be used for a while.

49:50 By in large, although Bell Labs has participated until fairly recently in data processing in quite a big way, AT&T never really quite trusted the Labs to do it right because here is where the money is. I can recall one occasion when during strike of temporary employees, a fill-in employee like from the Laboratories and so on, lost a day's billing tape in Chicago. And that was a million dollars. And that's generally speaking the money people did not until fairly recently trust Bell Labs to take good care of money, even though they trusted the Labs very well to make extremely reliable computing equipment for switches. The downtime on switches is still spectacular by any industry standards. The design for the first ones was two hours down in 40 years, and the design was met. Great emphasis on reliability and redundancy, testing.

51:35 Another branch of computing was for the government. The whole Whippany Laboratories [time check] Whippany, where we took on contracts for the government particularly in the computing era in anti-missile defense, missile defense, and underwater sound. Missile defense was a very impressive undertaking. It was about in the early '63 time frame when it was estimated the amount of computation to do a reasonable job of tracking incoming missiles would be 30 M floating point operations a second. In the day of the Cray that doesn't sound like a great lot, but it's more than your high end PCs can do. And the machines were supposed to be reliable. They designed the machines at Whippany, a twelve-processor multiprocessor, to no specs, enormously rugged, one watt transistors. This thing in real life performed remarkably well. There were sixty-five missile shots, tests across the Pacific Ocean ?...? and Lorinda Cherry here actually sat there waiting for them to come in. [laughter] And only a half dozen of them really failed. As a measure of the interest in reliability, one of them failed apparently due to processor error. Two people were assigned to look at the dumps, enormous amounts of telemetry and logging information were taken during these tests, which are truly expensive to run. Two people were assigned to look at the dumps. A year later they had not found the trouble. The team was beefed up. They finally decided that there was a race condition in one circuit. They then realized that this particular kind of race condition had not been tested for in all the simulations. They went back and simulated the entire hardware system to see if its a remote possibility of any similar cases, found twelve of them, and changed the hardware. But to spend over a year looking for a bug is a sign of what reliability meant.

54:56 Since I'm coming up on the end of an hour, one could go on and on and on,

Crowd: go on, go on. [laughter]

55:10 Doug: I think I'd like to end up by mentioning a few of the programs that have been written at Bell Labs that I think are most surprising. Of course there are lots of grand programs that have been written.

I already mentioned the block diagram compiler.

Another really remarkable piece of work was eqn, the equation typesetting language, which has been imitated since, by Lorinda Cherry and Brian Kernighan. The notion of taking an auditory syntax, the way people talk about equations, but only talk, this was not borrowed from any written notation before, getting the auditory one down on paper, that was very successful and surprising.

Another of my favorites, and again Lorinda Cherry was in this one, with Bob Morris, was typo. This was a program for finding spelling errors. It didn't know the first thing about spelling. It would read a document, measure its statistics, and print out the words of the document in increasing order of what it thought the likelihood of that word having come from the same statistical source as the document. The words that did not come from the statistical source of the document were likely to be typos, and now I mean typos as distinct from spelling errors, where you actually hit the wrong key. Those tend to be off the wall, whereas phonetic spelling errors you'll never find. And this worked remarkably well. Typing errors would come right up to the top of the list. A really really neat program.

57:50 Another one of my favorites was by Brenda Baker called struct, which took Fortran programs and converted them into a structured programming language called Ratfor, which was Fortran with C syntax. This seemed like a possible undertaking, like something you do by the seat of the pants and you get something out. In fact, folks at Lockheed had done things like that before. But Brenda managed to find theorems that said there's really only one form, there's a canonical form into which you can structure a Fortran program, and she did this. It took your Fortran program, completely mashed it, put it out perhaps in almost certainly a different order than it was in Fortran connected by GOTOs, without any GOTOs, and the really remarkable thing was that authors of the program who clearly knew the way they wrote it in the first place, preferred it after it had been rearranged by Brendan. I was astonished at the outcome of that project.

59:19 Another first that happened around here was by Fred Grampp, who got interested in computer security. One day he decided he would make a program for sniffing the security arrangements on a computer, as a service: Fred would never do anything crooked. [laughter] This particular program did a remarkable job, and founded a whole minor industry within the company. A department was set up to take this idea and parlay it, and indeed ever since there has been some improvement in the way computer centers are managed, at least until we got Berkeley Unix.

60:24 And the last interesting program that I have time to mention is one by Ken Church. He was dealing with -- text processing has always been a continuing ?...? of the research, and in some sense it has an application to our business because we're handling speech, but he got into consulting with the department in North Carolina that has to translate manuals. There are millions of pages of manuals in the Bell System and its successors, and ever since we've gone global, these things had to get translated into many languages.

61:28 To help in this, he was making tools which would put up on the screen, graphed on the screen quickly a piece of text and its translation, because a translator, particularly a technical translator, wants to know, the last time we mentioned this word how was it translated. You don't want to be creative in translating technical text. You'd like to be able to go back into the archives and pull up examples of translated text. And the neat thing here is the idea for how do you align texts in two languages. You've got the original, you've got the translated one, how do you bring up on the screen, the two sentences that go together? And the following scam worked beautifully. This is on western languages. 62:33 Simply look for common four letter tetragrams, four letter combinations between the two and as best as you can, line them up as nearly linearly with the lengths of the two types as possible. And this very simple idea works like storm. Something for nothing. I like that.

63:10 The last thing is one slogan that sort of got started with Unix and is just rife within the industry now. Software tools. We were making software tools in Unix before we knew we were, just like the Molière character was amazed at discovering he'd been speaking prose all his life. [laughter] But then Kernighan and Plauger came along and christened what was going on, making simple generally useful and compositional programs to do one thing and do it well and to fit together. They called it software tools, made a book, wrote a book, and this notion now is abroad in the industry. And it really did begin all up in the little attic room where you [points?] sat for many years writing up here.

Oh I forgot to. I haven't used any slides. I've brought some, but I don't like looking at bullets and you wouldn't either, and I forgot to show you the one exhibit I brought, which I borrowed from Bob Kurshan. When Bell Labs was founded, it had of course some calculating machines, and it had one wonderful computer. This. That was bought in 1918. There's almost no other computing equipment from any time prior to ten years ago that still exists in Bell Labs. This is an integraph. It has two styluses. You trace a curve on a piece of paper with one stylus and the other stylus draws the indefinite integral here. There was somebody in the math department who gave this service to the whole company, with about 24 hours turnaround time, calculating integrals. Our recent vice president Arno Penzias actually did, he calculated integrals differently, with a different background. He had a chemical balance, and he cut the curves out of the paper and weighed them. This was bought in 1918, so it's eighty years old. It used to be shiny metal, it's a little bit rusty now. But it still works.

66:30 Well, that's a once over lightly of a whole lot of things that have gone on at Bell Labs. It's just such a fun place that one I said I just could go on and on. If you're interested, there actually is a history written. This is only one of about six volumes, this is the one that has the mathematical computer sciences, the kind of things that I've mostly talked about here. A few people have copies of them. For some reason, the AT&T publishing house thinks that because they're history they're obsolete, and they stopped printing them. [laughter]

Thank you, and that's all.

Using Uninitialized Memory for Fun and Profit

2008-03-14T00:00:00-04:00

This is the story of a clever trick that's been around for at least 35 years, in which array values can be left uninitialized and then read during normal operations, yet the code behaves correctly no matter what garbage is sitting in the array. Like the best programming tricks, this one is the right tool for the job in certain situations. The sleaziness of uninitialized data access is offset by performance improvements: some important operations change from linear to constant time.

Alfred Aho, John Hopcroft, and Jeffrey Ullman's 1974 book The Design and Analysis of Computer Algorithms hints at the trick in an exercise (Chapter 2, exercise 2.12):

Develop a technique to initialize an entry of a matrix to zero the first time it is accessed, thereby eliminating the O(||V||²) time to initialize an adjacency matrix.

Jon Bentley's 1986 book Programming Pearls expands on the exercise (Column 1, exercise 8; exercise 9 in the Second Edition):

One problem with trading more space for less time is that initializing the space can itself take a great deal of time. Show how to circumvent this problem by designing a technique to initialize an entry of a vector to zero the first time it is accessed. Your scheme should use constant time for initialization and each vector access; you may use extra space proportional to the size of the vector. Because this method reduces initialization time by using even more space, it should be considered only when space is cheap, time is dear, and the vector is sparse.

Aho, Hopcroft, and Ullman's exercise talks about a matrix and Bentley's exercise talks about a vector, but for now let's consider just a simple set of integers.

One popular representation of a set of n integers ranging from 0 to m is a bit vector, with 1 bits at the positions corresponding to the integers in the set. Adding a new integer to the set, removing an integer from the set, and checking whether a particular integer is in the set are all very fast constant-time operations (just a few bit operations each). Unfortunately, two important operations are slow: iterating over all the elements in the set takes time O(m), as does clearing the set. If the common case is that m is much larger than n (that is, the set is only sparsely populated) and iterating or clearing the set happens frequently, then it could be better to use a representation that makes those operations more efficient. That's where the trick comes in.

Preston Briggs and Linda Torczon's 1993 paper, “An Efficient Representation for Sparse Sets,” describes the trick in detail. Their solution represents the sparse set using an integer array named dense and an integer n that counts the number of elements in dense. The dense array is simply a packed list of the elements in the set, stored in order of insertion. If the set contains the elements 5, 1, and 4, then n = 3 and dense[0] = 5, dense[1] = 1, dense[2] = 4:

Together n and dense are enough information to reconstruct the set, but this representation is not very fast. To make it fast, Briggs and Torczon add a second array named sparse which maps integers to their indices in dense. Continuing the example, sparse[5] = 0, sparse[1] = 1, sparse[4] = 2. Essentially, the set is a pair of arrays that point at each other:

Adding a member to the set requires updating both of these arrays:

add-member(i):
    dense[n] = i
    sparse[i] = n
    n++

It's not as efficient as flipping a bit in a bit vector, but it's still very fast and constant time.

To check whether i is in the set, you verify that the two arrays point at each other for that element:

is-member(i):
    return sparse[i] < n && dense[sparse[i]] == i

If i is not in the set, then it doesn't matter what sparse[i] is set to: either sparse[i] will be bigger than n or it will point at a value in dense that doesn't point back at it. Either way, we're not fooled. For example, suppose sparse actually looks like:

Is-member knows to ignore members of sparse that point past n or that point at cells in dense that don't point back, ignoring the grayed out entries:

Notice what just happened: sparse can have any arbitrary values in the positions for integers not in the set, those values actually get used during membership tests, and yet the membership test behaves correctly! (This would drive valgrind nuts.)

Clearing the set can be done in constant time:

clear-set():
    n = 0

Zeroing n effectively clears dense (the code only ever accesses entries in dense with indices less than n), and sparse can be uninitialized, so there's no need to clear out the old values.

This sparse set representation has one more trick up its sleeve: the dense array allows an efficient implementation of set iteration.

iterate():
    for(i=0; i<n; i++)
        yield dense[i]

Let's compare the run times of a bit vector implementation against the sparse set:

Operation	Bit Vector	Sparse set
is-member	O(1)	O(1)
add-member	O(1)	O(1)
clear-set	O(m)	O(1)
iterate	O(m)	O(n)

The sparse set is as fast or faster than bit vectors for every operation. The only problem is the space cost: two words replace each bit. Still, there are times when the speed differences are enough to balance the added memory cost. Briggs and Torczon point out that liveness sets used during register allocation inside a compiler are usually small and are cleared very frequently, making sparse sets the representation of choice.

Another situation where sparse sets are the better choice is work queue-based graph traversal algorithms. Iteration over sparse sets visits elements in the order they were inserted (above, 5, 1, 4), so that new entries inserted during the iteration will be visited later in the same iteration. In contrast, iteration over bit vectors visits elements in integer order (1, 4, 5), so that new elements inserted during traversal might be missed, requiring repeated iterations.

Returning to the original exercises, it is trivial to change the set into a vector (or matrix) by making dense an array of index-value pairs instead of just indices. Alternately, one might add the value to the sparse array or to a new array. The relative space overhead isn't as bad if you would have been storing values anyway.

Briggs and Torczon's paper implements additional set operations and examines performance speedups from using sparse sets inside a real compiler.

Play Tic-Tac-Toe with Knuth

2008-01-25T00:00:00-05:00

Section 7.1.2 of the Volume 4 pre-fascicle 0A of Donald Knuth's The Art of Computer Programming is titled “Boolean Evaluation.” In it, Knuth considers the construction of a set of nine boolean functions telling the correct next move in an optimal game of tic-tac-toe. In a footnote, Knuth tells this story:

This setup is based on an exhibit from the early 1950s at the Museum of Science and Industry in Chicago, where the author was first introduced to the magic of switching circuits. The machine in Chicago, designed by researchers at Bell Telephone Laboratories, allowed me to go first; yet I soon discovered there was no way to defeat it. Therefore I decided to move as stupidly as possible, hoping that the designers had not anticipated such bizarre behavior. In fact I allowed the machine to reach a position where it had two winning moves; and it seized both of them! Moving twice is of course a flagrant violation of the rules, so I had won a moral victory even though the machine had announced that I had lost.

That story alone is fairly amusing. But turning the page, the reader finds a quotation from Charles Babbage's Passages from the Life of a Philosopher, published in 1864:

I commenced an examination of a game called “tit-tat-to” ... to ascertain what number of combinations were required for all the possible variety of moves and situations. I found this to be comparatively insignificant. ... A difficulty, however, arose of a novel kind. When the automaton had to move, it might occur that there were two different moves, each equally conducive to his winning the game. ... Unless, also, some provision were made, the machine would attempt two contradictory motions.

The only real winning move is not to play.

Crabs, the bitmap terror!

2008-01-09T00:00:00-05:00

Today, window systems seem as inevitable as hierarchical file systems, a fundamental building block of computer systems. But it wasn't always that way. This paper could only have been written in the beginning, when everything about user interfaces was up for grabs.

A bitmap screen is a graphic universe where windows, cursors and icons live in harmony, cooperating with each other to achieve functionality and esthetics. A lot of effort goes into making this universe consistent, the basic law being that every window is a self contained, protected world. In particular, (1) a window shall not be affected by the internal activities of another window. (2) A window shall not be affected by activities of the window system not concerning it directly, i.e. (2.1) it shall not notice being obscured (partially or totally) by other windows or obscuring (partially or totally) other windows, (2.2) it shall not see the image of the cursor sliding on its surface (it can only ask for its position).

Of course it is difficult to resist the temptation to break these rules. Violations can be destructive or non-destructive, useful or pointless. Useful non-destructive violations include programs printing out an image of the screen, or magnifying part of the screen in a lens window. Useful destructive violations are represented by the pen program, which allows one to scribble on the screen. Pointless non-destructive violations include a magnet program, where a moving picture of a magnet attracts the cursor, so that one has to continuously pull away from it to keep working. The first pointless, destructive program we wrote was crabs.

As the crabs walk over the screen, they leave gray behind, “erasing” the apps underfoot:

For the rest of the story, see Luca Cardelli's “Crabs: the bitmap terror!” (6.7MB). Additional details in “Crabs (History and Screen Dumps)” (57.1MB).