Polyglot Persistence: Integrating Low-Latency NoSQL Systems (Cassandra and InfiniteGraph)
Cassandra’s integration to Hadoop’s MapReduce is a successful example of two overlapping NoSQL data-stores sharing common features. However, there are some features that are challenging to support across different NoSQL architectures, for instance the expression of the datamodel. Similar problems have been solved before in the programming world with the concept of polyglot programming. The database analog, polyglot persistence, is now gaining widespread acceptance as well. In an effort to learn more about NoSQL and polyglot storage patterns, I began to wonder if it would be possible to extract data from Cassandra for analysis in a graph database. In other-words, implement my own polyglot persistence application by fusing InfiniteGraph and Cassandra. In order to understand what each data-store can give to the other, I want to try to define their commonalities. InfiniteGraph and Cassandra share similarities in their underlying architecture. Both data-stores offer:
- distributed server nodes (for scalability and availability)
- based on proven underlying technologies Dynamo and Objectivity
- configurability for both disk and memory driven deployment
- tuning features for durability, consistency, read / write performance
- avoidance of JOINs
- low-latency solutions
The direct benefit of these common features are linear scalability, massive write performance, and low operational cost. There are some notable differences between InfiniteGraph and Cassandra though. Presently, Cassandra has more choices for asynchronous operations and the availability options are more mature. Both are considered to be on the rich end of the NoSQL datamodel spectrum (simple key-value stores on the other end). My interpretation is that InfiniteGraph’s model and programming interface are more intuitive to the general object oriented programmer. Especially, if your use-cases are graph oriented: Friend Of A Friend (FOAF) analysis, recommendation engines, fraud detection, correlating IP addresses among log files, clickstream or ad-servering metrics, etc.
Caveat: I have tried hard to understand each database; the fact is that both of these data-stores are quite comprehensive and summarizing their capabilities is not trivial. Additionally, both data-stores are under active development. I expect InfiniteGraph’s asynchronous features will expand and Cassandra’s rich client interface will improve as Avro becomes available.
In general merging of the two datastores was straight-forward and useful. After extracting the ‘friends’ columns out of Cassandra, I used InfiniteGraph to perform path analysis to find a common friend amongst friends. A feature of the hybrid system that was particularly useful was the ability to to learn from the graph and update the original Cassandra store. For instance, learning a shorter path between friends in InfiniteGraph and then updating the columns in Cassandra. For a friend suggestion application, the shortest might be the critical one. However, InfiniteGraph can hop vertexes and edges quickly. Therefore, calculating the longer paths was possible too. Alternative longer path are useful for certain applications like tracking money laundering or other law enforcement needs. In general, I found that I could easily try multi-path analysis, update Cassandra, then run my original queries in Cassandra over the new data. As a potential followup post, I plan to try to use my new hybrid NoSQL database as a caching system. Using InfiniteGraph’s streaming capabilities to augment Cassandra’s limitation.
The working prototype can be found in GitHub via “git clone git@github.com:toddstavish/Cassandra-Graph-Extract.git.” If you have any questions or suggestions, you can find me here. I look forward to trying this code base against a larger dataset with multinode deployments on both sides of the InfiniteGraph / Cassandra equation.
