How To Use A Graph Database to Integrate And Analyze Relational Exports

Graph databases can be used to analyze data from disparate datasources. In this use-case, three relational databases have been exported to CSV. Each relational export is ingested into its own sharded sub-graph to increase performance and avoid lock contention when merging the datasets. Unique keys overlap the datasources to provide the mechanism to link the subgraphs produced from parsing the CSV. A REST server is used to send the merged graph to a visualization application for analysis. 

The necessary components are below:

  • InfiniteGraph supplies the distributed graph database
  • RESTlet provides web service access to the data
  • GSON and custom parsing code produces the JSON representation
  • Gephi is used for interactive visualization and data exploration

A graph index is used to store the interlinking keys amongst the relational exports. The common graph index allows the ingests to be separated into a multithreaded client-side ingest. On the server side, InfiniteGraph offers parallel data loading to scale horizontally to the multithreaded ingest. Upon ingest, the Gephi streaming API  is used to request the graph data from the REST server. Gephi has a variety of built-in graph algorithms and customized visualization settings. The following screenshot shows Gephi visualizing the consolidated graph. The source code for this sample application can be found in GitHub. The code is completely self contained including sample CSV data.

Media_httpiimgurcomf3_evwvv

 

 

Comments [0]

Instructions for running Accumulo on MapR M5 VM

1. Download and launch MapR M5 VM

http://www.mapr.com/download

2. Download Accumulo 1.4.0 release candidate 2

http://people.apache.org/~kturner/1.4.0rc2/accumulo-1.4.0-incubating-dist-RC2...

3. gunzip / tar xvf accumulo-1.4.0-incubating-dist-RC2.tar.gz

4. sudo mv accumulo-1.4.0-incubating/ /opt

5. cd /opt/accumulo-1.4.0-incubating/conf

6. cp examples/512MB/standalone/* .

7. add the following to .bashrc or conf/accumulo-env.sh

export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2/

export ZOOKEEPER_HOME=/opt/mapr/zookeeper/zookeeper-3.3.2/

export ACCUMULO_HOME=/opt/accumulo-1.4.0-incubating

export ACCUMULO_LOG_DIR=/opt/accumulo-1.4.0-incubating/logs/ 

8. In conf/accumulo-site.xml, change zookeeper port from 2181 to 5181 (mapr runs zookeeper on this port)

<property>

  <name>instance.zookeeper.host</name>

  <value>localhost:5181</value>

  <description>comma separated list of zookeeper servers</description>

</property>

9.In conf/accumulo-site.xml, add the following to change tablet server port from 9997 (default) to 9996

<property>

  <name>tserver.port.client</name>

  <value>9996</value>

</property>

* there is something running on 9997 in the mapr vm

10. run bin/accumulo init

mapr@mapr-desktop:/opt/accumulo-1.4.0-incubating/bin$ ./accumulo init

[util.Initialize] INFO : Hadoop Filesystem is maprfs:///

[util.Initialize] INFO : Accumulo data dir is /accumulo

[util.Initialize] INFO : Zookeeper server is localhost:5181

11. Check out accumulo trunk:

svn co https://svn.apache.org/repos/asf/incubator/accumulo/trunk/

12. cd trunk then mvn clean install

13. mv accumulo-1.4.0-incubating accumulo-1.4.0-RC2

14. mv trunk accumulo-1.4.0-incubating

15. cp accumulo-1.4.0-RC2/conf/* accumulo-1.4.0-incubating/conf

16.  cp -r accumulo-1.4.0-RC2/lib/native/ ../../accumulo-1.4.0-incubating/lib/

17.  accumulo-1.4.0-incubating/bin/start-all.sh (this will launch master, tablet server, and logger - in the foreground, the script does not return)

18. check the accumulo-1.4.0-incubating/logs for any weirdness

19. if something is wrong, pkill -9 -f accumulo.start works better than stop-all.sh

20. I used accumulo shell to test. My complete setup can be downloaded here -

http://stavi.sh/downloads/accumulo-1.4.0-mapr.tar.gz

 Notes:
* copying the native libs from RC2 is not necessary if you have C++ development tools installed (make, compiler, system headers, etc). The build script is in src/assemble builds the native libs.

* the tablet server's thrift port does not have to be hard coded, you can use the following to have it find an open port

<property>
 <name>tserver.port.search</name>
 <value>true</value>
</property>

* There is a class casting exception that occurs when accumulo accesses MapR hdfs. It is fixed (ACCUMULO-476). This will be in 1.4.1, not 1.4.

 

Comments [2]

Synchronous microcontroller communication interfaces. Everything you wanted to know, but were afraid to ask.

Synchronous Interfaces (suitable for peripherals on the same board or <1 m):
I2C (Inter Integrated Circuit, I squared C)
SPI (Serial Peripheral Interface Bus )
U(S)ART (Universal Synchronous & Asynchronous Receiver Transmitter)
(aka Serial Communication Interface (SCI))
(S) can be synchronous

 I2C
* true bus, requires only two pins for multiple devices (address space)
* 100kbits/sec, 400kbits/sec, 1Mbits/sec
* half-duplex
* high noise sensitivity, lower data integrity
* slave ack / rx confirmation

SPI
* not a true bus, N devices require 3+N pins
* 10Mbits/sec
* full–duplex capability
* any message sizes, suited for longer data streams (I2C limited to 8 bit word)

USART
* standard serial connection (also over USB)
* few hundred bits per second (bps) up to 1.5Mbps
* error checking with parity bit

Comments [0]

Unit Testing Arduino With Python

Uses python to dynamically load Arduino hex images for unit testing. Supports windows and osx via configuration file. To find out where your hex images are placed by the Arduino IDE, hit the shift key before you hit the build (play) button. You can hit the shift for upload as well to find out where your avrdude (command line upload utility) is located.  Alternatively, you can look at the included config files and use your install location (using Arduino 0020).

http://github.com/toddstavish/Python-Arduino-Unit-Testing

Comments [0]

Algerian Cities Spreadsheet, Latitude \ Longitude, And Map

Spreadsheet:
http://www.stavi.sh/downloads/algerian-cities-lat-long.xls

Google Docs:
https://spreadsheets0.google.com/ccc?key=tGdADtMhFEl1a10ruijx2Kw&hl=en

Socrata:

Powered by Socrata

Powered by Socrata

Comments [0]

Afghani Cities Spreadsheet, Map, Latitude And Longitude

Spreadsheet:
http://www.stavi.sh/downloads/Cities_Within_Afghanistan_Latitude_Longitude.xls

Google Docs:
https://spreadsheets.google.com/ccc?key=0At92oU3FPZ4QdHhfMFJ0a2lUcXNuN0J2bHRR...

Socrata:

Powered by Socrata

Powered by Socrata

Comments [0]

Real-time Relationship Analytics From Large-scale Graph Processing

Cassandra excels at storing large, active, decentralized datasets. Additionally, Cassandra’s rich data model allows efficient use for many applications beyond simple associative arrays. One interesting application is the processing of large-scale graph structures.

I have devised a graph application layer to extract and process social network analysis data from Cassandra, using InfiniteGraph. The technical benefits of the social-graph-extract application layer and its use of graph-oriented processing have been articulated.

Social network analysis is one application of a more general category, relationship analytics, as defined by Curt Monash. The relationship analytics problem domain maps well to the unique features of the Cassandra-InfiniteGraph hybrid system:

  • dedicated vertex/edge API
  • data can be clustered according to vertex/edge proximity
  • disk-based/memory-centric access
  • peer-to-peer communication from InfiniteGraph node to Cassandra node
  • bidirectional updates between raw Cassandra data and Infinitegraph analytics
  • parallel streaming and caching from InfintiteGraph
  • modeling flexibility to support a variety of sources
  • redundancy and high-availability
  • precision and speed for graph analytics
  • finding extremely long paths, all paths, unknown paths, or paths of nontrivial or indeterminate length

Current business problems that can utilize these features:

  • analyzing high-frequency trading
  • discovering high degrees of mutual interconnection in social networks
  • data mining subtle retail correlations
  • product recommendation engines
  • determining terrorist or criminal behavior inferred from known relationships
  • finding a pattern of relationships for fraud detection
  • investigating the directed relationships between proteins and genes
  • checking which entity has the shortest average connection to a group of others for cyber security (botnet controller)

The working codebase for this Cassandra / InfiniteGraph integration can be retrieved from GitHub. Forking of the main project is welcome (including downstream updates). If you have any questions or suggestions, please contact @toddstavish.

Filed under  //   cassandra   graphdb   relationship analytics  

Comments [0]

Comments [0]

Python Arduino Serial Port Text Communication (Send from PDE; Receive via PySerial)

Test code to make sure Python and Arduino are working properly. The PDE sends a string out of the serial port. The Python code uses pyserial to receive it and then prints the message to standard out. Pick it up at GitHub via “git clone git://github.com/toddstavish/Python-Arduino-Serial-Text-Send-Receive.git” Any questions or suggestions, you can find me here

Comments [0]

Country Spreadsheet List With Latitude, Longitude, ISO 3166 Codes, and Flag Images

Spreadsheet to help create heatmaps in Socrata or Google geomap visualizations.

Spreadsheet download:

Spreadsheet with flag images (Countries.xls)

Google Docs:

Country List Latitude Longitude ISO 3166 Codes

Socrata:

Country List ISO 3166 Codes Latitude Longitude

Scribd:

Country List Latitude Longitude Flags ISO 3166 Codes

Previews:

Country List Latitude Longitude Flags ISO 3166 Codes

Comments [0]