Evaluating Cassandra, RethinkDB on a Large Dataset

by Nadeem Nazeer

As more and more data intense applications get developed, very often one comes across datasets which are comparatively larger than the datasets in usual web applications. One can  push MySql to its limits; can store million of records in a table and do joins to support big data queries, but there are considerable limits to what you can do with MySQL.

With regards to big data applications, it is important to consider the performance of your Select, Insert and Update queries. For large data sets, evaluating a DB solution would require investigating the performance of such queries.

In one of our healthcare data-based solutions, we identified that we needed a solution that could achieve the following performance levels :

  • 1000  inserts/sec
  • 1000 updates/sec and
  • 0.001 sec for each select.

In order to achieve the above we looked into NoSql database solutions, and evaluated two popular options, Cassandra and RethinkDB.

Instead of testing these on StandAlone deployments we opted for the cluster based configurations. So, before starting to do some internal benchmarking of these databases, there were few simple tasks of setting these database clusters. You can learn how to install Cassandra in 5 minutes here.

These two databases have different architectures and use different approaches to store data. Cassandra is a keys based store and rethinkdB is a document store.

To evaluate these two DBs, we did a series of tests given the following configuration and input data sets as described below.

Configuration used: 4 shards, 1 replication per shard.

Our internal benchmark specifics to above problem statement:

INSERTION:

We inserted 1 million records for doing the benchmark tests

Results:

Seconds / million insertions
Cassandra RethinkDB
Iteration 1 1748.56 2374.99
Iteration 2 1766.16 1945.35
Iteration 3 1866.16 2022.46
Average 1460.29 2114.26

 

SELECTION:

Select * with condition <indexed column>

Results:

Seconds / 999 selections
Cassandra RethinkDB
Iteration 1 7.51 3
Iteration 2 6.58 5
Iteration 3 6.68 4
Average 6.5 4

 

Select <indexed Column> with condition <indexed column>

Results:

Seconds / 999 selections
Cassandra RethinkDB
Iteration 1 5.99 4.86
Iteration 2 5.04 3.05
Iteration 3 5.20 4.58
Average 5.4 4.17

 

 

Select <Not indexed Column> with condition <indexed column>

Results:

Seconds / 999 selections
Cassandra RethinkDB
Iteration 1 6.7 3.32
Iteration 2 6.06 4.04
Iteration 3 6.06 2.74
Average 6.28 3.37

 

UPDATING:

update  <non indexed column> where on <indexed>

Results:

Seconds / 999 updations
Cassandra RethinkDB
Iteration 1 1.56 5.70
Iteration 2 1.76 6.04
Iteration 3 1.49 5.43
Average 1.60 5.72

 

update  < indexed column> where on <indexed>

Results:

Seconds / 999 updations
Cassandra RethinkDB
Iteration 1 1.95 6.05
Iteration 2 2.48 5.82
Iteration 3 2.16 7.72
Average 2.20 6.53

 

Performing these iterations over these DB’s using our sample data initially looked promising while doing insertions, but as we moved to other operations of selections, updations, we found results were lagging behind from what we initially expected. We discovered that some of the causes were valid, obvious ones while there were some new ones. In NoSQL world somethings are unpredictable where varying datasets is concerned, but our journey still continues.

Next we look forward to Apache Spark , stay tuned.

Shahid Ashraf contributed to this blog. 

Leave a Reply

Your email address will not be published. Required fields are marked *

Data Science & PopHealth

Methods, tools, systems for healthcare data analysis

Contact us now

Popular Posts