services/mounttable/btmtd/README.md - release.go.x.ref - Git at Google

 # Mounttable on Cloud Bigtable

 This package and its sub-packages contain a [mounttable server] implementation
 that uses Google's [Cloud Bigtable] service for storage. It is fast and scalable
 to millions of nodes and, with enough replicas, millions of requests per second.

 ## Schema

 Bigtable is not a relational database. Each table has only one key, and all
 operations are atomic only at the row level. There is no way to mutate multiple
 rows together atomically.

 See the [Overview of Cloud Bigtable] for more information.

 Our table has one row per node. The row key is a hash of the node name followed
 by the name itself. This spreads rows evenly across all tablet servers with no
 risk of name collision.

 The table has three column families:

    * Metadata `m`: used to store information about the row:
       * ID `i`: A 4-byte ID for the node. Doesn't have to be globally unique.
       * Version `v`: The version changes every time the row is mutated. It is
         used to detect conflicts related to concurrent access.
       * Creator `c`: The cell's value is the name of its creator and its
         timestamp is the node creation time.
       * Sticky `s`: When this column is present, the node is not automatically
         garbage collected.
       * Permissions `p`: The [access.Permissions] of the node.
    * Servers `s`: Each mounted server has its own column. The column name is the
      server's address. The value contains the [mount flags]. The timestamp is
      the mount deadline.
    * Children `c`: Each child has its own column. The column name is the name of
      the child, without the path. The timestamp is the child creation time.

 Example:

 | Key              | ID  | Version | Creator | Sticky | Permissions  | Mounted Server...           | Child... |
 | ---              | --- | ---     | ---     | ---    | ---          | ---                         | ---      |
 | 540f1a56/        | id1 | 54321   | user    | 1      | {"Admin":... |                             | (id2)foo |
 | 1234abcd/foo     | id2 | 123     | user    |        | {"Admin":... |                             | (id3)bar |
 | 46d523e3/foo/bar | id3 | 5436    | user    |        | {"Admin":... | /example.com:123 (deadline) |          |

 Counters are stored in another table, one row with one column per counter.

 ## Mutations

 All operations use optimistic concurrency control. If a conflicting
 change happens during a mutation, the whole operation is restarted.

   * Mutation on N
     * Get Node N
     * Check caller's permissions
     * Apply mutation on N if node version hasn't changed

 If the node version changed, it means that another mutation was applied between
 the time when we retrieved the node and when we tried to apply our mutation.
 When that happens, we restart to whole operation, starting with retrieving the
 node again.

 ### Mounting / Unmounting a server

 Mounting a server consists of adding a new cell to the node's row. The column
 family is `s`, the column name is the address of the server, the timestamp is
 the mount deadline, and the value contains the [mount flags].

 Unmounting a server consists of deleting the server's column.

 ### Adding / Removing a child node

 Adding or removing a node requires two mutations: one on the parent, one on the
 child.

 When adding a node, we first add it to the parent, and then create a new row
 with the same timestamp.

 When deleting a node, we first delete the row, and then delete the child column
 on the parent.

 If the server process dies between the two mutations, it will leave the parent
 with a reference to a child row that doesn't exist. As a consequence, the parent
 will never be seen as "empty" and will not be automatically garbage collected.
 This will be corrected when:

   * the child is re-created, or
   * the parent is forcibly deleted.

 ## Hot rows & caching

 Some nodes are expected to be accessed significantly more than others, e.g. the
 root node and its immediate children are traversed more often than nodes that
 are further down the tree. The bigtable rows associated with these nodes are
 "hotter" which can lead to traffic imbalance and poor performance.

 This problem is alleviated with a small cache in the bigtable client. High
 frequency or concurrent requests for the same rows can be bundled together to
 reduce both latency and bigtable load at the same time.

 ## Opportunistic garbage collection

 A node can be garbage-collected when it has no children, no mounted servers, and
 hasn't been marked as _sticky_. A node is _sticky_ when someone explicitly
 called [SetPermissions] on it.

 The garbage collection happens opportunistically. When a mounttable server
 accessed a node that is eligible for garbage collection while processing a
 request, this node is removed before the ongoing request completes.

 [Cloud Bigtable]: https://cloud.google.com/bigtable/docs/
 [Overview of Cloud Bigtable]: https://cloud.google.com/bigtable/docs/api-overview
 [mounttable server]: https://github.com/vanadium/go.v23/blob/master/services/mounttable/service.vdl
 [SetPermissions]: https://github.com/vanadium/go.v23/blob/master/services/permissions/service.vdl#L58
 [access.Permissions]: https://github.com/vanadium/go.v23/blob/master/security/access/types.vdl#L130
 [mount flags]: https://github.com/vanadium/go.v23/blob/master/naming/types.vdl#L9
	# Mounttable on Cloud Bigtable

	This package and its sub-packages contain a [mounttable server] implementation
	that uses Google's [Cloud Bigtable] service for storage. It is fast and scalable
	to millions of nodes and, with enough replicas, millions of requests per second.

	## Schema

	Bigtable is not a relational database. Each table has only one key, and all
	operations are atomic only at the row level. There is no way to mutate multiple
	rows together atomically.

	See the [Overview of Cloud Bigtable] for more information.

	Our table has one row per node. The row key is a hash of the node name followed
	by the name itself. This spreads rows evenly across all tablet servers with no
	risk of name collision.

	The table has three column families:

	* Metadata `m`: used to store information about the row:
	* ID `i`: A 4-byte ID for the node. Doesn't have to be globally unique.
	* Version `v`: The version changes every time the row is mutated. It is
	used to detect conflicts related to concurrent access.
	* Creator `c`: The cell's value is the name of its creator and its
	timestamp is the node creation time.
	* Sticky `s`: When this column is present, the node is not automatically
	garbage collected.
	* Permissions `p`: The [access.Permissions] of the node.
	* Servers `s`: Each mounted server has its own column. The column name is the
	server's address. The value contains the [mount flags]. The timestamp is
	the mount deadline.
	* Children `c`: Each child has its own column. The column name is the name of
	the child, without the path. The timestamp is the child creation time.

	Example:

	\| Key \| ID \| Version \| Creator \| Sticky \| Permissions \| Mounted Server... \| Child... \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| 540f1a56/ \| id1 \| 54321 \| user \| 1 \| {"Admin":... \| \| (id2)foo \|
	\| 1234abcd/foo \| id2 \| 123 \| user \| \| {"Admin":... \| \| (id3)bar \|
	\| 46d523e3/foo/bar \| id3 \| 5436 \| user \| \| {"Admin":... \| /example.com:123 (deadline) \| \|

	Counters are stored in another table, one row with one column per counter.

	## Mutations

	All operations use optimistic concurrency control. If a conflicting
	change happens during a mutation, the whole operation is restarted.

	* Mutation on N
	* Get Node N
	* Check caller's permissions
	* Apply mutation on N if node version hasn't changed

	If the node version changed, it means that another mutation was applied between
	the time when we retrieved the node and when we tried to apply our mutation.
	When that happens, we restart to whole operation, starting with retrieving the
	node again.

	### Mounting / Unmounting a server

	Mounting a server consists of adding a new cell to the node's row. The column
	family is `s`, the column name is the address of the server, the timestamp is
	the mount deadline, and the value contains the [mount flags].

	Unmounting a server consists of deleting the server's column.

	### Adding / Removing a child node

	Adding or removing a node requires two mutations: one on the parent, one on the
	child.

	When adding a node, we first add it to the parent, and then create a new row
	with the same timestamp.

	When deleting a node, we first delete the row, and then delete the child column
	on the parent.

	If the server process dies between the two mutations, it will leave the parent
	with a reference to a child row that doesn't exist. As a consequence, the parent
	will never be seen as "empty" and will not be automatically garbage collected.
	This will be corrected when:

	* the child is re-created, or
	* the parent is forcibly deleted.

	## Hot rows & caching

	Some nodes are expected to be accessed significantly more than others, e.g. the
	root node and its immediate children are traversed more often than nodes that
	are further down the tree. The bigtable rows associated with these nodes are
	"hotter" which can lead to traffic imbalance and poor performance.

	This problem is alleviated with a small cache in the bigtable client. High
	frequency or concurrent requests for the same rows can be bundled together to
	reduce both latency and bigtable load at the same time.

	## Opportunistic garbage collection

	A node can be garbage-collected when it has no children, no mounted servers, and
	hasn't been marked as _sticky_. A node is _sticky_ when someone explicitly
	called [SetPermissions] on it.

	The garbage collection happens opportunistically. When a mounttable server
	accessed a node that is eligible for garbage collection while processing a
	request, this node is removed before the ongoing request completes.

	[Cloud Bigtable]: https://cloud.google.com/bigtable/docs/
	[Overview of Cloud Bigtable]: https://cloud.google.com/bigtable/docs/api-overview
	[mounttable server]: https://github.com/vanadium/go.v23/blob/master/services/mounttable/service.vdl
	[SetPermissions]: https://github.com/vanadium/go.v23/blob/master/services/permissions/service.vdl#L58
	[access.Permissions]: https://github.com/vanadium/go.v23/blob/master/security/access/types.vdl#L130
	[mount flags]: https://github.com/vanadium/go.v23/blob/master/naming/types.vdl#L9