persistence-benchmark

How are stars counted? How InterSystems Caché eXTreme is used in Gaia

Astronomers’ tools

5 years ago, on December 19, 2013, the ESA launched an orbital telescope called Gaia. Learn more about the Gaia mission on the official website of the European Space Agency or in the article by Vitaly Egorov (Billion pixels for a billion stars).

However, few people know what technology the agency chose for storing and processing the data collected by Gaia. Two years before the launch, in 2011, the developers were considering a number of candidates (see “Astrostatistics and Data Mining” by Luis Manuel Sarro, Laurent Eyer, William O’Mullane, Joris De Ridder, pp. 111-112):

Comparing the technologies side-by-side produced the following results (source):

Technology Time
DB2 13min55s
PostgreSQL 8 14min50s
PostgreSQL 9 6min50s
Hadoop 3min37s
Cassandra 3min37s
Caché 2min25s

The first four will probably sound familiar even to schoolchildren. But what is Caché XEP?

Java technologies in Caché

If you look at the Java API stack provided by InterSystems, you will see the following:

Summing up, here’s what we have:

This approach offers some advantages in comparison with comparable relational (higher access speed) and various NoSQL solutions (instant access to data in the relational style).

The “nuance” of configuring Caché eXTreme prior to connecting is the environment set-up:

Additionally, you may need to increase the stack and heap size of the JVM (-Xss2m -Xmx768m).

Some practice

The authors were interested in how Caché eXTreme would behave while writing an uninterrupted stream of data in comparison with other data processing technologies. We used historical stock price data in the CSV format from the website of the “Finam” holding. Sample data file:

<TICKER>,<PER>,<DATE>,<TIME>,<LAST>,<VOL>
NASDAQ100,0,20130802,09:31:07,3 125.300000000,0
NASDAQ100,0,20130802,09:32:08,3 122.860000000,806 906
NASDAQ100,0,20130802,09:33:09,3 123.920000000,637 360
NASDAQ100,0,20130802,09:34:10,3 124.090000000,421 928
NASDAQ100,0,20130802,09:35:11,3 125.180000000,681 585

The code of the Caché class modeling the above structure might look like this:

Class com.intersystems.persistence.objbinding.Event Extends %Persistent [ ClassType = persistent, DdlAllowed, Final, SqlTableName = Event ]
{
Property Ticker As %String(MAXLEN = 32);

Property Per As %Integer(MAXVAL = 2147483647, MINVAL = -2147483648);

Property TimeStamp As %TimeStamp;

Property Last As %Double;

Property Vol As %Integer(MAXVAL = 9223372036854775807, MINVAL = -9223372036854775810);
}

We also wrote some basic and naive test code. This “naive” approach can be justified by the fact that we are not really measuring the speed of the code generated by JIT, but the speed at which the code that is completely unrelated to JVM (with the exception of Apache Derby) can write to the disk. Here’s how the test program window looks like:

Caché eXTreme Persistence Benchmark

Our contenders:

Note that since tests are somewhat approximate, we saw no practical purpose in providing exact numbers: the margin of error is fairly high, while the goal of the article is to demonstrate the general tendency. For the same reasons, we are not specifying the exact version of JDK and the settings of the garbage collector: the server-side JVM 8u191 with -Xmx2048m -Xss128m reached a very similar level of performance on Linux and Mac OS X. One million events were saved in each test; several warm-up runs (up to 10) were performed before each test of a particular database. As for Caché settings, the routine cache was increased to 256 MB and the 8kb database cache was expanded to 1024 MB.

Our testing yielded the following results (the write speed values are expressed in events per second (eps)):

Technology Time, s (less is better) Write speed, eps (more is better)
Apache Derby 140±30 7100±1300
Oracle 780±50 1290±80
Caché JDBC 61±8 17000±2000
Caché eXTreme 6.7±0.8 152000±17000
Caché eXTreme, transaction journaling disabled 6.3±0.6 162000±14000
  1. Derby offers speeds varying from 6200 to 8000 eps.

  2. Oracle turned out to be as fast as 1290 eps.

  3. Caché in the JDBC mode gives you a higher speed (from 15000 to 18000 eps), but there is a trade-off: the default transaction isolation level, as mentioned above, is READ_UNCOMMITTED.

  4. The next option, Caché eXTreme, gives us 127000 to 167000 eps.

  5. Finally, we took some risk and disabled the transaction log (for a given client process), and managed to achieve the write speed of 172000 eps on a test system.

Those who are interested in more accurate numbers can view the source code. You will need the following to build and run: