NoSQL Solution for Persisting Graphs at Scale NoSQL Solution for Persisting Graphs at Scale python python

NoSQL Solution for Persisting Graphs at Scale


There are two general types of containers for storing graphs:

  1. true graph databases: e.g., Neo4J, agamemnon, GraphDB, and AllegroGraph; these not only store a graph but they also understand that a graph is, so for instance, you can query thesedatabases e.g., how many nodes are between the shortest path fromnode X and node Y?

  2. static graph containers: Twitter's MySQL-adapted FlockDB is the most well-known exemplar here. These DBs can store and retrievegraphs just fine; but to query the graph itself, you have to firstretrieve the graph from the DB then use a library (e.g., Python'sexcellent Networkx) to query the graph itself.

The redis-based graph container i discuss below is in the second category, though apparently redis is also well-suited for containers in the first category as evidenced by redis-graph, a remarkably small python package for implementing a graph database in redis.

redis will work beautifully here.

Redis is a heavy-duty, durable data store suitable for production use, yet it's also simple enough to use for command-line analysis.

Redis is different than other databases in that it has multiple data structure types; the one i would recommend here is the hash data type. Using this redis data structure allows you to very closely mimic a "list of dictionaries", a conventional schema for storing graphs, in which each item in the list is a dictionary of edges keyed to the node from which those edges originate.

You need to first install redis and the python client. The DeGizmo Blog has an excellent "up-and-running" tutorial which includes a step-by-step guid on installing both.

Once redis and its python client are installed, start a redis server, which you do like so:

  • cd to the directory in which you installed redis (/usr/local/bin on 'nix if you installed via make install); next

  • type redis-server at the shell prompt then enter

you should now see the server log file tailing on your shell window

>>> import numpy as NP>>> import networkx as NX>>> # start a redis client & connect to the server:>>> from redis import StrictRedis as redis>>> r1 = redis(db=1, host="localhost", port=6379)

In the snippet below, i have stored a four-node graph; each line below calls hmset on the redis client and stores one node and the edges connected to that node ("0" => no edge, "1" => edge). (In practice, of course, you would abstract these repetitive calls in a function; here i'm showing each call because it's likely easier to understand that way.)

>>> r1.hmset("n1", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})      True>>> r1.hmset("n2", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})      True>>> r1.hmset("n3", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})      True>>> r1.hmset("n4", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})      True>>> # retrieve the edges for a given node:>>> r1.hgetall("n2")      {'n1': '1', 'n2': '0', 'n3': '0', 'n4': '1'}

Now that the graph is persisted, retrieve it from the redis DB as a NetworkX graph.

There are many ways to do this, below did it in two *steps*:

  1. extract the data from the redis database into an adjacency matrix,implemented as a 2D NumPy array; then

  2. convert that directly to a NetworkX graph using a NetworkXbuilt-in function:

reduced to code, these two steps are:

>>> AM = NP.array([map(int, r1.hgetall(node).values()) for node in r1.keys("*")])>>> # now convert this adjacency matrix back to a networkx graph:>>> G = NX.from_numpy_matrix(am)>>> # verify that G in fact holds the original graph:>>> type(G)      <class 'networkx.classes.graph.Graph'>>>> G.nodes()      [0, 1, 2, 3]>>> G.edges()      [(0, 1), (0, 2), (0, 3), (1, 3), (2, 3), (3, 3)]

When you end a redis session, you can shut down the server from the client like so:

>>> r1.shutdown()

redis saves to disk just before it shuts down so this is a good way to ensure all writes were persisted.

So where is the redis DB? It is stored in the default location with the default file name, which is dump.rdb on your home directory.

To change this, edit the redis.conf file (included with the redis source distribution); go to the line starting with:

# The filename where to dump the DBdbfilename dump.rdb

change dump.rdb to anything you wish, but leave the .rdb extension in place.

Next, to change the file path, find this line in redis.conf:

# Note that you must specify a directory here, not a file name

The line below that is the directory location for the redis database. Edit it so that it recites the location you want. Save your revisions and rename this file, but keep the .conf extension. You can store this config file anywhere you wish, just provide the full path and name of this custom config file on the same line when you start a redis server:

So the next time you start a redis server, you must do it like so (from the shell prompt:

$> cd /usr/local/bin    # or the directory in which you installed redis $> redis-server /path/to/redis.conf

Finally, the Python Package Index lists a package specifically for implementing a graph database in redis. The package is called redis-graph and i have not used it.


I would be interested to see the best way of using the hard drive. In the past I have made multiple graphs and saved them as .dot files. Then kind of mixed some of them in memory somehow. Not the best solution though.

from random import randomimport networkx as nxdef make_graph():    G=nx.DiGraph()    N=10    #make a random graph    for i in range(N):        for j in range(i):            if 4*random()<1:                G.add_edge(i,j)    nx.write_dot(G,"savedgraph.dot")    return Gtry:    G=nx.read_dot("savedgraph.dot")except:    G=make_graph() #This will fail if you don't use the same seed but have created the graph in the past. You could use the Singleton design pattern here.print G.adj