Mounttable on Cloud Bigtable

This package and its sub-packages contain a mounttable server implementation that uses Google's Cloud Bigtable service for storage. It is fast and scalable to millions of nodes and, with enough replicas, millions of requests per second.

Schema

Bigtable is not a relational database. Each table has only one key, and all operations are atomic only at the row level. There is no way to mutate multiple rows together atomically.

See the Overview of Cloud Bigtable for more information.

Our table has one row per node. The row key is a hash of the node name followed by the name itself. This spreads rows evenly across all tablet servers with no risk of name collision.

The table has three column families:

Metadata m: used to store information about the row:
- Version v: The version changes every time the row is mutated. It is used to detect conflicts related to concurrent access.
- Creator c: The cell's value is the name of its creator and its timestamp is the node creation time.
- Sticky s: When this column is present, the node is not automatically garbage collected.
- Permissions p: The access.Permissions of the node.
Servers s: Each mounted server has its own column. The column name is the server's address. The value contains the mount flags. The timestamp is the mount deadline.
Children c: Each child has its own column. The column name is the name of the child, without the path. The timestamp is the child creation time.

Example:

Key	Version	Creator	Sticky	Permissions	Mounted Server...	Child...
540f1a56/	54321	user (ts)	1	{“Admin”:...		foo (ts1)
1234abcd/foo	123	user (ts1)		{“Admin”:...		bar (ts2)
46d523e3/foo/bar	5436	user (ts2)		{“Admin”:...	/example.com:123 (deadline)

Mutations

All operations use optimistic concurrency control. If a conflicting change happens during a mutation, the whole operation is restarted.

Mutation on N
- Get Node N
- Check caller's permissions
- Apply mutation on N if node version hasn't changed

If the node version changed, it means that another mutation was applied between the time when we retrieved the node and when we tried to apply our mutation. When that happens, we restart to whole operation, starting with retrieving the node again.

Mounting / Unmounting a server

Mounting a server consists of adding a new cell to the node's row. The column family is s, the column name is the address of the server, the timestamp is the mount deadline, and the value contains the mount flags.

Unmounting a server consists of deleting the server's column.

Adding / Removing a child node

Adding or removing a node requires two mutations: one on the parent, one on the child.

When adding a node, we first add it to the parent, and then create a new row with the same timestamp.

When deleting a node, we first delete the row, and then delete the child column on the parent.

If the server process dies between the two mutations, it will leave the parent with a reference to a child row that doesn't exist. As a consequence, the parent will never be seen as “empty” and will not be automatically garbage collected. This will be corrected when:

the child is re-created, or
the parent is forcibly deleted.

Hot rows & caching

Some nodes are expected to be accessed significantly more than others, e.g. the root node and its immediate children are traversed more often than nodes that are further down the tree. The bigtable rows associated with these nodes are “hotter” which can lead to traffic imbalance and poor performance.

This problem is alleviated with a small cache in the bigtable client. High frequency or concurrent requests for the same rows can be bundled together to reduce both latency and bigtable load at the same time.

Opportunistic garbage collection

A node can be garbage-collected when it has no children, no mounted servers, and hasn't been marked as sticky. A node is sticky when someone explicitly called SetPermissions on it.

The garbage collection happens opportunistically. When a mounttable server accessed a node that is eligible for garbage collection while processing a request, this node is removed before the ongoing request completes.