Evaluating Cassandra, RethinkDB on a Large Datasetby Nadeem Nazeer
As more and more data intense applications get developed, very often one comes across datasets which are comparatively larger than the datasets in usual web applications. One can push MySql to its limits; can store million of records in a table and do joins to support big data queries, but there are considerable limits to what you can do with MySQL.
With regards to big data applications, it is important to consider the performance of your Select, Insert and Update queries. For large data sets, evaluating a DB solution would require investigating the performance of such queries.
In one of our healthcare data-based solutions, we identified that we needed a solution that could achieve the following performance levels :
- 1000 inserts/sec
- 1000 updates/sec and
- 0.001 sec for each select.
In order to achieve the above we looked into NoSql database solutions, and evaluated two popular options, Cassandra and RethinkDB.
Instead of testing these on StandAlone deployments we opted for the cluster based configurations. So, before starting to do some internal benchmarking of these databases, there were few simple tasks of setting these database clusters. You can learn how to install Cassandra in 5 minutes here.
These two databases have different architectures and use different approaches to store data. Cassandra is a keys based store and rethinkdB is a document store.
To evaluate these two DBs, we did a series of tests given the following configuration and input data sets as described below.
Configuration used: 4 shards, 1 replication per shard.
Our internal benchmark specifics to above problem statement:
We inserted 1 million records for doing the benchmark tests
Select * with condition <indexed column>
Select <indexed Column> with condition <indexed column>
Select <Not indexed Column> with condition <indexed column>
update <non indexed column> where on <indexed>
update < indexed column> where on <indexed>
Performing these iterations over these DB’s using our sample data initially looked promising while doing insertions, but as we moved to other operations of selections, updations, we found results were lagging behind from what we initially expected. We discovered that some of the causes were valid, obvious ones while there were some new ones. In NoSQL world somethings are unpredictable where varying datasets is concerned, but our journey still continues.
Next we look forward to Apache Spark , stay tuned.
Shahid Ashraf contributed to this blog.