Beta test of Sztakipedia toolbar starts

After a lot of work, finally Sztakipedia public beta test starts today. I have made an intro video for this occasion, and also an “Installation” and a “Technical Details” video, which you can find at the “Screencasts” menu.

So the intro video is here:

Of course, there are still plenty of issues with the system but I hope in this stage it is more of a benefit than a drawback for the Wiki editors :)

The Open Library json bulk to XML tool

In this post I want to share the java code with which we have converted the Open library dump. Open Library has the largest freely available book catalog dataset I know of. It is the most complete thing you can get…I could find even the most exotic philosophy of science books I have to read.

The java source code is here: hu.sztaki.oltools.tar (You will need a few jars to compile but not anything exotic)

And you can download all the data in a json format, which contains one json record per line, the size is about 40-50M lines. However, before you download anything, please note that they are providing an API to search and access their data. You can build your stuff upon it. Naturally, if the API is enough for you should use that, as they are keeping the database updated behind it.

But sometimes you want to do research like we wanted. This data is a gold mine for people dealing with Named Entity Recognition, also there is a nice subject ontology in it for which you can build statistics, etc.
If your case is similar, please read on.

So a typical book record looks like this:

/type/work /works/OL10017586W 1 2009-12-11T02:00:04.646565 {"title": "Enemies Within Us", "created": {"type": "/type/datetime", "value": "2009-12-11T02:00:04.646565"}, "last_modified": {"type": "/type/datetime", "value": "2009-12-11T02:00:04.646565"}, "latest_revision": 1, "key": "/works/OL10017586W", "authors": [{"type": "/type/author_role", "author": {"key": "/authors/OL3985980A"}}], "type": {"key": "/type/work"}, "revision": 1}

As you can see the author (and also the subject if it is known) is referenced by an id. For authors and subjects there are entires of type /type/author and /type/subject in the dump. I wanted these to be resolved and produce one XML element for each book with all the data in it. This way I could feed the data into a solr instance.

The first step is to convert the json to XML. To do that use the JSON2XML class the following way:

JSON2XML ol_dump_latest.txt ol_dump_latest.xml

(You will need a few jars to compile but not anything exotic)

Now you have converted the text to a very simple but large (currently it is 61GB) xml document.
During the process you will get character encoding errors. I have made efforts to overcome this but with no luck. This problem breaks a very small percent (in the region 0.01%) of the entries which was acceptable for me.

The next step is to loop trough this document, read all the authors and subjects, the start the loop again and dump out the books with the names of the authors and subjects substituted.


Main /mnt/workdir2/LibraryOfCongress/LibraryOfCongress/ol_dump_latest.xml

After this step you will find your files in directory called “entries”. The files will be in a directory hierarchy divided up to subdirs of 100K files, which are further split up to 1K sub-subdirs.

One thing to be aware of: if you use this code unchanged you will get about 40M files. On an ext3 or ext4 file system with default settings you might run out of inodes!

Good luck, if you need help, just contact me!

Open Library data to book search

So we have managed to download the catalog data of Open Library in JSON format. The plan is that we integrate it in our book search functionality, alongside with the data from British National Bibliography (BNB).

However, the challenge is much bigger. If I understand the data right, right now we have to deal with about 25 million entires (the BNB contains about 3 million). Moreover, in this export (it is a single 35 GB file), the authors are referenced by their identifiers in the book entries, and before we can push the thing into our solr server, we need to resolve those. And on the top of that, we need to convert the entries into xml.

Now, I started with a clever bash script to split up the file. Now it is running in two instances, and I estimate that it will take about 40 hours to finish on altair, our powerful, but only desktop machine we could dedicated to bach tasks.

A new site for pedia.sztaki.hu

We needed to make some adjustments in the deployment of sztakipedia. As a result, we have two quite poweful kvm machines, four logical cores and 4 gigs of RAM each to provide Sztakipedia. One is the pedia.sztaki.hu frontend and the API endpoint, the other is running Solr, and various background processors. There is a third machine, a clone of pedia.sztaki.hu, on which we do the testing of new functionality. We do not have a spare background processor machine however, but practice shows that it is not necessary – as all of the background processors have a UIMA interface, we can test them on our desktops, and if the XML looks right and there are no memory leaks, we can plug them in the server.

Also, as a part of a re-deployment of Sztakipedia I have installed this small blog site and started blogging:) Enjoy!