runtimes/google/ipc/benchmarks/README.txt - release.go.x.ref - Git at Google

 This directory contains code uses to measure the performance of the Veyron IPC
 stack.

 ================================================================================

 The ipc_test.go file uses GO's testing package to run benchmarks. Each
 benchmark involves one server and one client. The server has two very simple
 methods that echo the data received from the client back to the client.

 client ---- Echo(payload) ----> server
 client <--- return payload ---- server

 There are two versions of the Echo method:
  - Echo(payload []byte) ([]byte], error)
  - EchoStream() <[]byte,[]byte> error

 The first benchmarks use the non-streaming version of Echo with a varying
 payload size. The second benchmarks use the streaming version with varying
 number of chunks and payload sizes. The third one is for measuring the
 performance with multiple clients hosted in the same process.

 All these benchmarks create a VC before measurements begin. So, the VC creation
 overhead is excluded.

 $ veyron go test -test.bench=. -timeout=1h -test.cpu=1 -test.benchtime=5s \
   veyron/runtimes/google/ipc/benchmarks

 Benchmark____1B     2000           5144219 ns/op           0.00 MB/s
 Benchmark___10B     2000           5526448 ns/op           0.00 MB/s
 Benchmark___1KB     2000           4528221 ns/op           0.44 MB/s
 Benchmark_100KB     1000           7569096 ns/op          26.42 MB/s
 Benchmark____1_chunk_____1B         1000           8945290 ns/op           0.00 MB/s
 Benchmark____1_chunk____10B         1000           9711084 ns/op           0.00 MB/s
 Benchmark____1_chunk____1KB         1000           8541689 ns/op           0.23 MB/s
 Benchmark____1_chunk___10KB         1000           8972995 ns/op           2.23 MB/s
 Benchmark___10_chunks____1B         1000          13114807 ns/op           0.00 MB/s
 Benchmark___10_chunks___10B         1000          13219493 ns/op           0.02 MB/s
 Benchmark___10_chunks___1KB         1000          13292236 ns/op           1.50 MB/s
 Benchmark___10_chunks__10KB          500          15733197 ns/op          12.71 MB/s
 Benchmark__100_chunks____1B          500          45078939 ns/op           0.00 MB/s
 Benchmark__100_chunks___10B          200          49113585 ns/op           0.04 MB/s
 Benchmark__100_chunks___1KB          100          57982457 ns/op           3.45 MB/s
 Benchmark__100_chunks__10KB          100          81632487 ns/op          24.50 MB/s
 Benchmark__per_chunk____1B         50000            357880 ns/op           0.01 MB/s
 Benchmark__per_chunk___10B         20000            476941 ns/op           0.04 MB/s
 Benchmark__per_chunk___1KB         10000            806491 ns/op           2.48 MB/s
 Benchmark__per_chunk__10KB         10000           1185081 ns/op          16.88 MB/s
 Benchmark____1B_mux___10_chunks___10B       1000          20235386 ns/op           0.00 MB/s
 Benchmark____1B_mux___10_chunks___1KB        500          21346428 ns/op           0.00 MB/s
 Benchmark____1B_mux__100_chunks___10B        100          72942436 ns/op           0.00 MB/s
 Benchmark____1B_mux__100_chunks___1KB        100          81538481 ns/op           0.00 MB/s

 'Benchmark___1KB' shows that it takes an average of 4.528 ms to
 execute a simple Echo RPC with a 1 KB payload.

 'Benchmark___10_chunks___1KB' shows that a streaming RPC with the
 same payload (i.e. 10 chunks of 1 KB) takes about 13.292 ms on average.

 'Benchmark__per_chunk___1KB' shows that sending a stream of 1 KB chunks
 takes an average of 0.806 ms per chunk.

 'Benchmark____1B_mux___10_chunks___1KB' shows that it takes an average
 of 21.346 ms to execute a simple Echo RPC with a 1 B payload while streaming
 10 chunks of 1 KB payloads continuously in the same process.

 bm/main.go does the same benchmarks as ipc_test.go but with more varying
 configurations and optional histogram outputs.

 $ veyron go run veyron/runtimes/google/ipc/benchmarks/bm/main.go \
   -test.cpu=1,2 -test.benchtime=5s -histogram

 RESULTS.txt has the latest benchmark results with main.go

 ================================================================================

 bmserver/main.go and bmclient/main.go are simple command-line tools to run the
 benchmark server and client as separate processes. Unlike the benchmarks above,
 this test includes the startup cost of name resolution, creating the VC, etc. in
 the first RPC.

 $ veyron go run veyron/runtimes/google/ipc/benchmarks/bmserver/main.go \
   -veyron.tcp.address=localhost:8888 -acl='{"In":{"...":"R"}}'

 (In a different shell)
 $ veyron go run veyron/runtimes/google/ipc/benchmarks/bmclient/main.go \
   -server=/localhost:8888 -iterations=100 -chunk_count=0 -payload_size=10
 iterations: 100  chunk_count: 0  payload_size: 10
 elapsed time: 619.75741ms
 Histogram (unit: ms)
 Count: 100  Min: 4  Max: 54  Avg: 5.65
 ------------------------------------------------------------
 [  4,   5)   42   42.0%   42.0%  ####
 [  5,   6)   32   32.0%   74.0%  ###
 [  6,   7)    8    8.0%   82.0%  #
 [  7,   9)   13   13.0%   95.0%  #
 [  9,  11)    3    3.0%   98.0%
 [ 11,  14)    1    1.0%   99.0%
 [ 14,  18)    0    0.0%   99.0%
 [ 18,  24)    0    0.0%   99.0%
 [ 24,  32)    0    0.0%   99.0%
 [ 32,  42)    0    0.0%   99.0%
 [ 42,  55)    1    1.0%  100.0%
 [ 55,  72)    0    0.0%  100.0%
 [ 72,  94)    0    0.0%  100.0%
 [ 94, 123)    0    0.0%  100.0%
 [123, 161)    0    0.0%  100.0%
 [161, 211)    0    0.0%  100.0%
 [211, inf)    0    0.0%  100.0%


 On a Raspberry Pi, everything is much slower. The same tests show the following
 results:

 $ ./benchmarks.test -test.bench=. -test.cpu=1 -test.benchtime=5s 2>/dev/null
 PASS
 Benchmark____1B             500          21316148 ns/op
 Benchmark___10B             500          23304638 ns/op
 Benchmark__100B             500          21860446 ns/op
 Benchmark___1KB             500          24000346 ns/op
 Benchmark__10KB             200          37530575 ns/op
 Benchmark_100KB             100         136243310 ns/op
 Benchmark_N_RPCs____1_chunk_____1B           500          19957506 ns/op
 Benchmark_N_RPCs____1_chunk____10B           500          22868392 ns/op
 Benchmark_N_RPCs____1_chunk___100B           500          19635412 ns/op
 Benchmark_N_RPCs____1_chunk____1KB           500          22572190 ns/op
 Benchmark_N_RPCs____1_chunk___10KB           500          37570948 ns/op
 Benchmark_N_RPCs___10_chunks___1KB           100          51670740 ns/op
 Benchmark_N_RPCs__100_chunks___1KB            50         364938740 ns/op
 Benchmark_N_RPCs_1000_chunks___1KB             2        3586374500 ns/op
 Benchmark_1_RPC_N_chunks_____1B    10000           1034042 ns/op
 Benchmark_1_RPC_N_chunks____10B     5000           1894875 ns/op
 Benchmark_1_RPC_N_chunks___100B     5000           2857289 ns/op
 Benchmark_1_RPC_N_chunks____1KB     5000           6465839 ns/op
 Benchmark_1_RPC_N_chunks___10KB      100          80019430 ns/op
 Benchmark_1_RPC_N_chunks__100KB Killed

 The simple 1 KB RPCs take an average of 24 ms. The streaming equivalent takes
 about 22 ms, and streaming many 1 KB chunks takes about 6.5 ms per chunk.


 $ ./bmserver --address=localhost:8888 --acl='{"...":"A"}'

 $ ./bmclient --server=/localhost:8888 --count=10 --payload_size=1000
 CallEcho 0 2573406000
 CallEcho 1 44669000
 CallEcho 2 54442000
 CallEcho 3 33934000
 CallEcho 4 47985000
 CallEcho 5 61324000
 CallEcho 6 51654000
 CallEcho 7 47043000
 CallEcho 8 44995000
 CallEcho 9 53166000

 On the pi, the first RPC takes ~2.5 sec to execute.
	This directory contains code uses to measure the performance of the Veyron IPC
	stack.

	================================================================================

	The ipc_test.go file uses GO's testing package to run benchmarks. Each
	benchmark involves one server and one client. The server has two very simple
	methods that echo the data received from the client back to the client.

	client ---- Echo(payload) ----> server
	client <--- return payload ---- server

	There are two versions of the Echo method:
	- Echo(payload []byte) ([]byte], error)
	- EchoStream() <[]byte,[]byte> error

	The first benchmarks use the non-streaming version of Echo with a varying
	payload size. The second benchmarks use the streaming version with varying
	number of chunks and payload sizes. The third one is for measuring the
	performance with multiple clients hosted in the same process.

	All these benchmarks create a VC before measurements begin. So, the VC creation
	overhead is excluded.

	$ veyron go test -test.bench=. -timeout=1h -test.cpu=1 -test.benchtime=5s \
	veyron/runtimes/google/ipc/benchmarks

	Benchmark____1B 2000 5144219 ns/op 0.00 MB/s
	Benchmark___10B 2000 5526448 ns/op 0.00 MB/s
	Benchmark___1KB 2000 4528221 ns/op 0.44 MB/s
	Benchmark_100KB 1000 7569096 ns/op 26.42 MB/s
	Benchmark____1_chunk_____1B 1000 8945290 ns/op 0.00 MB/s
	Benchmark____1_chunk____10B 1000 9711084 ns/op 0.00 MB/s
	Benchmark____1_chunk____1KB 1000 8541689 ns/op 0.23 MB/s
	Benchmark____1_chunk___10KB 1000 8972995 ns/op 2.23 MB/s
	Benchmark___10_chunks____1B 1000 13114807 ns/op 0.00 MB/s
	Benchmark___10_chunks___10B 1000 13219493 ns/op 0.02 MB/s
	Benchmark___10_chunks___1KB 1000 13292236 ns/op 1.50 MB/s
	Benchmark___10_chunks__10KB 500 15733197 ns/op 12.71 MB/s
	Benchmark__100_chunks____1B 500 45078939 ns/op 0.00 MB/s
	Benchmark__100_chunks___10B 200 49113585 ns/op 0.04 MB/s
	Benchmark__100_chunks___1KB 100 57982457 ns/op 3.45 MB/s
	Benchmark__100_chunks__10KB 100 81632487 ns/op 24.50 MB/s
	Benchmark__per_chunk____1B 50000 357880 ns/op 0.01 MB/s
	Benchmark__per_chunk___10B 20000 476941 ns/op 0.04 MB/s
	Benchmark__per_chunk___1KB 10000 806491 ns/op 2.48 MB/s
	Benchmark__per_chunk__10KB 10000 1185081 ns/op 16.88 MB/s
	Benchmark____1B_mux___10_chunks___10B 1000 20235386 ns/op 0.00 MB/s
	Benchmark____1B_mux___10_chunks___1KB 500 21346428 ns/op 0.00 MB/s
	Benchmark____1B_mux__100_chunks___10B 100 72942436 ns/op 0.00 MB/s
	Benchmark____1B_mux__100_chunks___1KB 100 81538481 ns/op 0.00 MB/s

	'Benchmark___1KB' shows that it takes an average of 4.528 ms to
	execute a simple Echo RPC with a 1 KB payload.

	'Benchmark___10_chunks___1KB' shows that a streaming RPC with the
	same payload (i.e. 10 chunks of 1 KB) takes about 13.292 ms on average.

	'Benchmark__per_chunk___1KB' shows that sending a stream of 1 KB chunks
	takes an average of 0.806 ms per chunk.

	'Benchmark____1B_mux___10_chunks___1KB' shows that it takes an average
	of 21.346 ms to execute a simple Echo RPC with a 1 B payload while streaming
	10 chunks of 1 KB payloads continuously in the same process.

	bm/main.go does the same benchmarks as ipc_test.go but with more varying
	configurations and optional histogram outputs.

	$ veyron go run veyron/runtimes/google/ipc/benchmarks/bm/main.go \
	-test.cpu=1,2 -test.benchtime=5s -histogram

	RESULTS.txt has the latest benchmark results with main.go

	================================================================================

	bmserver/main.go and bmclient/main.go are simple command-line tools to run the
	benchmark server and client as separate processes. Unlike the benchmarks above,
	this test includes the startup cost of name resolution, creating the VC, etc. in
	the first RPC.

	$ veyron go run veyron/runtimes/google/ipc/benchmarks/bmserver/main.go \
	-veyron.tcp.address=localhost:8888 -acl='{"In":{"...":"R"}}'

	(In a different shell)
	$ veyron go run veyron/runtimes/google/ipc/benchmarks/bmclient/main.go \
	-server=/localhost:8888 -iterations=100 -chunk_count=0 -payload_size=10
	iterations: 100 chunk_count: 0 payload_size: 10
	elapsed time: 619.75741ms
	Histogram (unit: ms)
	Count: 100 Min: 4 Max: 54 Avg: 5.65
	------------------------------------------------------------
	[ 4, 5) 42 42.0% 42.0% ####
	[ 5, 6) 32 32.0% 74.0% ###
	[ 6, 7) 8 8.0% 82.0% #
	[ 7, 9) 13 13.0% 95.0% #
	[ 9, 11) 3 3.0% 98.0%
	[ 11, 14) 1 1.0% 99.0%
	[ 14, 18) 0 0.0% 99.0%
	[ 18, 24) 0 0.0% 99.0%
	[ 24, 32) 0 0.0% 99.0%
	[ 32, 42) 0 0.0% 99.0%
	[ 42, 55) 1 1.0% 100.0%
	[ 55, 72) 0 0.0% 100.0%
	[ 72, 94) 0 0.0% 100.0%
	[ 94, 123) 0 0.0% 100.0%
	[123, 161) 0 0.0% 100.0%
	[161, 211) 0 0.0% 100.0%
	[211, inf) 0 0.0% 100.0%


	On a Raspberry Pi, everything is much slower. The same tests show the following
	results:

	$ ./benchmarks.test -test.bench=. -test.cpu=1 -test.benchtime=5s 2>/dev/null
	PASS
	Benchmark____1B 500 21316148 ns/op
	Benchmark___10B 500 23304638 ns/op
	Benchmark__100B 500 21860446 ns/op
	Benchmark___1KB 500 24000346 ns/op
	Benchmark__10KB 200 37530575 ns/op
	Benchmark_100KB 100 136243310 ns/op
	Benchmark_N_RPCs____1_chunk_____1B 500 19957506 ns/op
	Benchmark_N_RPCs____1_chunk____10B 500 22868392 ns/op
	Benchmark_N_RPCs____1_chunk___100B 500 19635412 ns/op
	Benchmark_N_RPCs____1_chunk____1KB 500 22572190 ns/op
	Benchmark_N_RPCs____1_chunk___10KB 500 37570948 ns/op
	Benchmark_N_RPCs___10_chunks___1KB 100 51670740 ns/op
	Benchmark_N_RPCs__100_chunks___1KB 50 364938740 ns/op
	Benchmark_N_RPCs_1000_chunks___1KB 2 3586374500 ns/op
	Benchmark_1_RPC_N_chunks_____1B 10000 1034042 ns/op
	Benchmark_1_RPC_N_chunks____10B 5000 1894875 ns/op
	Benchmark_1_RPC_N_chunks___100B 5000 2857289 ns/op
	Benchmark_1_RPC_N_chunks____1KB 5000 6465839 ns/op
	Benchmark_1_RPC_N_chunks___10KB 100 80019430 ns/op
	Benchmark_1_RPC_N_chunks__100KB Killed

	The simple 1 KB RPCs take an average of 24 ms. The streaming equivalent takes
	about 22 ms, and streaming many 1 KB chunks takes about 6.5 ms per chunk.


	$ ./bmserver --address=localhost:8888 --acl='{"...":"A"}'

	$ ./bmclient --server=/localhost:8888 --count=10 --payload_size=1000
	CallEcho 0 2573406000
	CallEcho 1 44669000
	CallEcho 2 54442000
	CallEcho 3 33934000
	CallEcho 4 47985000
	CallEcho 5 61324000
	CallEcho 6 51654000
	CallEcho 7 47043000
	CallEcho 8 44995000
	CallEcho 9 53166000

	On the pi, the first RPC takes ~2.5 sec to execute.