Published by Ostap Cherkashin on 2010 12 27.
The first cut of the Bandicoot database (v0) provides enough functionality to start building some applications. Though, as with any new system, there are several important questions arise: stability, performance, …? This blog post is about performance of the system. With regard to stability I can only say that even though there are quite some structured tests available, it is still not recommended for production use.
With the Bandicoot application developers control the execution path. This is fairly different from other databases where this responsibility is split between a human (programmer) and another "living organism" (query optimizer). It happens that sometimes they misunderstand each other and as a result we get two beliefs:</p>
I will try to address the first one here. With the second one (and having the v0 tag comment say "brute force preview") I will try to avoid the discussion for a little while ;-)
In relational algebra there is only one data structure (relation 1) and several operators (select, join, etc. 2). Relations represent your data and operators take these relations as input and produce different ones as an output. From the performance perspective one of the most important properties of a relation is the number of tuples (or rows) in its body. Execution time of an operator heavily depends on this number (it is also used to express the algorithmic complexity below). In order to improve a function performance one should try to reduce the data sets passed from one operator to another as early as possible. This is a fairly intuitive idea and it also greatly impacts the performance. To help with this and to facilitate the computation reuse the Bandicoot introduced temporary variables 3. Apart from improving performance they also make code more readable.
The table below lists the complexities for different relational operators in the Bandicoot v0. I will use #n as the number of tuples in a relation variable n.
Operator | Sample Expression | Big-O |
---|---|---|
join | m * n | O(#m * #n) |
minus (semidiff) | m - n | O(#m * #n) |
union | m + n | O(#m + #m*#n) |
select | n select(x < 0) | O(#n) |
extend | n extend(x = 1) | O(#n) |
project | n project(x, y) | O(#n^2) |
summary | (m, n) summary(x = max(y, 0)) | O(#m * #n) |
As you can see most of the algorithms are of quadratic nature, and thus the limits of v0 are quite low in terms of number of tuples. I am working on the patch to improve this situation, and meanwhile you can master the temporary variables which is a good practice in any case.
## Figures
To turn the formulas above into the actual numbers you can use the performance tests available with the system source code. One of the interesting ones is perf/relation. It shows the execution times for some of the workloads on different relational operators. Here is the output from a run on my laptop (core i7 @2GHz and -Os gcc option). Note, that all of the test tuples have two attributes (string and integer) and are pre-generated before each test execution.
$ cd bandicoot
$ ctl pack
$ bin/test/perf/relation
write 100000 tuples in 18ms
read 100000 tuples in 21ms
write 1000000 tuples in 135ms
read 1000000 tuples in 175ms
join 1000x1000 tuples in 20ms
join 10000x10000 tuples in 1786ms
semidiff 1000x1000 tuples (res=500 tuples) in 15ms
semidiff 10000x10000 tuples (res=5000 tuples) in 1533ms
project 1000 tuples from 1000 in 9ms
project 10000 tuples from 10000 in 876ms
select 1 tuples from 100000 in 11ms
select 1 tuples from 1000000 in 97ms
union 1000x1000 tuples (res=1500 tuples) in 20ms
union 10000x10000 tuples (res=15000 tuples) in 1752ms
extend 100000 tuples in 37ms
extend 1000000 tuples in 372ms
sum 1000x1000 tuples in 20ms
sum 10000x10000 tuples in 1817ms
The first rows show the read/write performance for a data volume (I have an SSD drive + there are no fsync/fdatasync calls). The rest are the execution times of the relational operators.
## Resources