Parse large RDF in Python

python xml sax rdf

If you are looking for fast performance then I'd recommend you to use Raptor with the Redland Python Bindings. The performance of Raptor, written in C, is way better than RDFLib. And you can use the python bindings in case you don't want to deal with C.

Another advice for improving performance, forget about parsing RDF/XML, go with other flavor of RDF like Turtle or NTriples. Specially parsing ntriples is much faster than parsing RDF/XML. This is because the ntriples syntax is simpler.

You can transform your RDF/XML into ntriples using rapper, a tool that comes with raptor:

rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples

The ntriples file will contain triples like:

<s1> <p> <o> .<s2> <p2> "literal" .

and parsers tend to be very efficient handling this structure. Moreover, memory wise is more efficient than RDF/XML because, as you can see, this data structure is smaller.

The code below is a simple example using the redland python bindings:

import RDFparser=RDF.Parser(name="ntriples") #as name for parser you can use ntriples, turtle, rdfxml, ...model=RDF.Model()stream=parser.parse_into_model(model,"file://file_path","http://your_base_uri.org")for triple in model:    print triple.subject, triple.predicate, triple.object

The base URI is the prefixed URI in case you use relative URIs inside your RDF document. You can check documentation about the Python Redland bindings API in here

If you don't care much about performance then use RDFLib, it is simple and easy to use.

python xml sax rdf

I second the suggestion that you try out rdflib. It's nice and quick prototyping, and the BerkeleyDB backend store scales pretty well into the millions of triples if you don't want to load the whole graph into memory.

import rdflibgraph = rdflib.Graph("Sleepycat")graph.open("store", create=True)graph.parse("big.rdf")# print out all the triples in the graphfor subject, predicate, object in graph:    print subject, predicate, object

python xml sax rdf

In my experience, SAX is great for performance but it's a pain to write. Unless I am having issues, I tend to avoid programming with it.

"Very large" is dependent on the RAM of the machine. Assuming that your computer has over 1GB memory, lxml, pyxml or some other library e will be fine for 200mb files.

CodeHunter

Parse large RDF in Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last