blob: 7801f3773a684aa4092a8ecb238872d243dd6d57 [file] [log] [blame] [view]
# Mounttable on Cloud Bigtable
This package and its sub-packages contain a [mounttable server] implementation
that uses Google's [Cloud Bigtable] service for storage. It is fast and scalable
to millions of nodes and, with enough replicas, millions of requests per second.
## Schema
Bigtable is not a relational database. Each table has only one key, and all
operations are atomic only at the row level. There is no way to mutate multiple
rows together atomically.
See the [Overview of Cloud Bigtable] for more information.
Our table has one row per node. The row key is a hash of the node name followed
by the name itself. This spreads rows evenly across all tablet servers with no
risk of name collision.
The table has three column families:
* Metadata `m`: used to store information about the row:
* ID `i`: A 4-byte ID for the node. Doesn't have to be globally unique.
* Version `v`: The version changes every time the row is mutated. It is
used to detect conflicts related to concurrent access.
* Creator `c`: The cell's value is the name of its creator and its
timestamp is the node creation time.
* Sticky `s`: When this column is present, the node is not automatically
garbage collected.
* Permissions `p`: The [access.Permissions] of the node.
* Servers `s`: Each mounted server has its own column. The column name is the
server's address. The value contains the [mount flags]. The timestamp is
the mount deadline.
* Children `c`: Each child has its own column. The column name is the name of
the child, without the path. The timestamp is the child creation time.
Example:
| Key | ID | Version | Creator | Sticky | Permissions | Mounted Server... | Child... |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 540f1a56/ | id1 | 54321 | user | 1 | {"Admin":... | | (id2)foo |
| 1234abcd/foo | id2 | 123 | user | | {"Admin":... | | (id3)bar |
| 46d523e3/foo/bar | id3 | 5436 | user | | {"Admin":... | /example.com:123 (deadline) | |
Counters are stored in another table, one row with one column per counter.
## Mutations
All operations use optimistic concurrency control. If a conflicting
change happens during a mutation, the whole operation is restarted.
* Mutation on N
* Get Node N
* Check caller's permissions
* Apply mutation on N if node version hasn't changed
If the node version changed, it means that another mutation was applied between
the time when we retrieved the node and when we tried to apply our mutation.
When that happens, we restart to whole operation, starting with retrieving the
node again.
### Mounting / Unmounting a server
Mounting a server consists of adding a new cell to the node's row. The column
family is `s`, the column name is the address of the server, the timestamp is
the mount deadline, and the value contains the [mount flags].
Unmounting a server consists of deleting the server's column.
### Adding / Removing a child node
Adding or removing a node requires two mutations: one on the parent, one on the
child.
When adding a node, we first add it to the parent, and then create a new row
with the same timestamp.
When deleting a node, we first delete the row, and then delete the child column
on the parent.
If the server process dies between the two mutations, it will leave the parent
with a reference to a child row that doesn't exist. As a consequence, the parent
will never be seen as "empty" and will not be automatically garbage collected.
This will be corrected when:
* the child is re-created, or
* the parent is forcibly deleted.
## Hot rows & caching
Some nodes are expected to be accessed significantly more than others, e.g. the
root node and its immediate children are traversed more often than nodes that
are further down the tree. The bigtable rows associated with these nodes are
"hotter" which can lead to traffic imbalance and poor performance.
This problem is alleviated with a small cache in the bigtable client. High
frequency or concurrent requests for the same rows can be bundled together to
reduce both latency and bigtable load at the same time.
## Opportunistic garbage collection
A node can be garbage-collected when it has no children, no mounted servers, and
hasn't been marked as _sticky_. A node is _sticky_ when someone explicitly
called [SetPermissions] on it.
The garbage collection happens opportunistically. When a mounttable server
accessed a node that is eligible for garbage collection while processing a
request, this node is removed before the ongoing request completes.
[Cloud Bigtable]: https://cloud.google.com/bigtable/docs/
[Overview of Cloud Bigtable]: https://cloud.google.com/bigtable/docs/api-overview
[mounttable server]: https://github.com/vanadium/go.v23/blob/master/services/mounttable/service.vdl
[SetPermissions]: https://github.com/vanadium/go.v23/blob/master/services/permissions/service.vdl#L58
[access.Permissions]: https://github.com/vanadium/go.v23/blob/master/security/access/types.vdl#L130
[mount flags]: https://github.com/vanadium/go.v23/blob/master/naming/types.vdl#L9