In this post I want to share the java code with which we have converted the Open library dump. Open Library has the largest freely available book catalog dataset I know of. It is the most complete thing you can get…I could find even the most exotic philosophy of science books I have to read.
The java source code is here: hu.sztaki.oltools.tar (You will need a few jars to compile but not anything exotic)
And you can download all the data in a json format, which contains one json record per line, the size is about 40-50M lines. However, before you download anything, please note that they are providing an API to search and access their data. You can build your stuff upon it. Naturally, if the API is enough for you should use that, as they are keeping the database updated behind it.
But sometimes you want to do research like we wanted. This data is a gold mine for people dealing with Named Entity Recognition, also there is a nice subject ontology in it for which you can build statistics, etc.
If your case is similar, please read on.
So a typical book record looks like this:
/type/work /works/OL10017586W 1 2009-12-11T02:00:04.646565 {"title": "Enemies Within Us", "created": {"type": "/type/datetime", "value": "2009-12-11T02:00:04.646565"}, "last_modified": {"type": "/type/datetime", "value": "2009-12-11T02:00:04.646565"}, "latest_revision": 1, "key": "/works/OL10017586W", "authors": [{"type": "/type/author_role", "author": {"key": "/authors/OL3985980A"}}], "type": {"key": "/type/work"}, "revision": 1}
As you can see the author (and also the subject if it is known) is referenced by an id. For authors and subjects there are entires of type /type/author and /type/subject in the dump. I wanted these to be resolved and produce one XML element for each book with all the data in it. This way I could feed the data into a solr instance.
The first step is to convert the json to XML. To do that use the JSON2XML class the following way:
JSON2XML ol_dump_latest.txt ol_dump_latest.xml
(You will need a few jars to compile but not anything exotic)
Now you have converted the text to a very simple but large (currently it is 61GB) xml document.
During the process you will get character encoding errors. I have made efforts to overcome this but with no luck. This problem breaks a very small percent (in the region 0.01%) of the entries which was acceptable for me.
The next step is to loop trough this document, read all the authors and subjects, the start the loop again and dump out the books with the names of the authors and subjects substituted.
Main /mnt/workdir2/LibraryOfCongress/LibraryOfCongress/ol_dump_latest.xml
After this step you will find your files in directory called “entries”. The files will be in a directory hierarchy divided up to subdirs of 100K files, which are further split up to 1K sub-subdirs.
One thing to be aware of: if you use this code unchanged you will get about 40M files. On an ext3 or ext4 file system with default settings you might run out of inodes!
Good luck, if you need help, just contact me!