How many articles on computer science can there possibly be?

Meet the PAF Peacock

peacock

I just got back from a 3 day trip at the beautiful and mysterious PAF, in the countryside near Reims, about 2 hours east of Paris. The place is made for artists, dancers, and musicians to work, think, and play. And at only 16€ a night, an incredible deal. An ancient monastery, it’s still filled with objects of previous grandeur like out-of-tune pianos and stringy tapestries, but who can say no to a kick-ass ping-pong table?

But as their website says, it’s a place for “production” not for “vacation”. And that’s what the 9 of us were there to do. Import Wikipedia into KnowNodes, visualize it on a graph, and let users quickly find these articles when making connections.

Know your Nodes

So what is this KnowNodes project, you may be asking. Dor Garbash‘s dreamchild, it is a connectivist orgasm, a sort of map of human thought and knowledge, but focused the connections between resources (scientific articles, blog posts, videos, etc.) rather than on those resources themselves. Students could use it to find new learning resources, researchers could use it to explore the boundaries of knowledge in their field, and the rest of us just might love jumping from one crazy idea to the next.

Much like a new social network, one recurring problem with getting this kind of project of the ground is it needs good quality content to jumpstart it. And what could a be better source of quality information as the world’s largest crowdsourcing project ever, Wikipedia?

Now down to the gory details. How big is Wikipedia, anyway? Well, according to the “Statistics” page, the English-language site is at 30 million articles and growing. Wikipedia (and the Wikimedia platform behind it) is very open with their data. You can easily download nightly data dumps of their database in different formats. But here’s the rub: the articles alone (not counting user pages, talk pages, attachments, previous versions, and who knows what else) still weighs in at 42 GB of XML. That was a bit too much for our poor little free-plan Heroku instance to handle.

So, we came up with a better idea: why not just focus on a particular domain, such as computer science? That way we could demonstrate the value of the approach without overloading own tiny DB. Now, we realized that we couldn’t just start at the computer-science article and branch outwards, because with the 6-degrees nature of the world, we would soon end up importing the Kevin Bacon article. But Wikipedia has thoughtfully created a system of categories and sub-categories and sub-sub-categories, and anyway, how many articles under the Computer Science category category could there possibly be?

1+6+2+5+9+17+3+34+…

Hmmm, let’s find out. We wrote a node.js script that uses the open Wikimedia API. The only way to find all the articles in the Computer Science category hierarchy is to recursively ask the API for the categories within it, do the same with its children, and so on, until we reach the bottom.

The nodemw module came in really handy, as it wraps many of the most common API operations so you don’t have to make the HTTP requests yourself. It also queues all requests that you make, and only executes one at a time. That prevents Wikipedia from banning your IP (which is good) but also slows you way down (not so good).

Enough talk, here’s what we came up with:

Could not embed GitHub Gist 5792361: API rate limit exceeded for 155.133.132.65. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)

And so we launched the script, saw that it was listing the articles, and walked away happily, bragging about how quickly we had coded our part as we watched the peacocks scaring each other in the courtyard.

And when we came back a few hours later, and the article count had surpassed 250000, Weipeng suspected there may be a problem. He started printing out the categories as we imported them, and sure enough we saw duplicates. That was the first sign that something was wrong. The second was when we saw that we had somehow imported an article on “Gender Identity”. That doesn’t sound a lot like computer science, does it?

On further inspection, we found that our conception of how the category system worked was very wrong. It turns out that categories can have several parents, that pages can be in multiple categories, and that categories might even loop around on themselves. This is very different than the simple tree that we had been imagining.

Time for a new approach: we simply limit the depth of our exploration. Stopping at 5 levels was about 110k articles, and 6 levels gave us 192k. We couldn’t find any automatic criteria to say whether all these articles really should be part of the system, but this was about the number that we were hoping for, so we stopped there.

Wikipedia -> KnowNodes

Now that we had a list of articles, time to actually put them into the database. Time-wise, it probably would have mad sense to go through the XML dump in order to avoid making live API requests. But then this wouldn’t help us if the users were looking at a new article outside of those we were searching for. And so we created a dynamic system.

The code in this case might not make a lot of sense to anyone who hasn’t worked on the project, but the idea simple enough. Convert the title to the url of the article, download the 1st paragraph (as a description), and insert it into our database. The 2nd part turned out to be much harder than we had thought. Wikipedia uses their own “Wikitext” format, which you wouldn’t want to show by itself. There actually are quite a few libraries to convert from Wikitext to plain text (or to HTML), but very few of the Javascript ones worked reliably in our case. The best we found was txtwiki.js, which really is quite good, except even it fails on infoboxes (which unfortunately are often placed first on the page, messing up our “take the 1st paragraph” approach). In the end, Weipeng found that we could simply ask for the “parsed” or HTML version of the page, and take the text between the first “<p>” tag we found.

Importing a bunch of isolated Wikipedia articles does not create a map of knowledge, making connections between them does. The Wikipedia API provides at least 3 different kinds of links: internal (to other pages on the site), external (to the general internet, as well as to partner sites like WikiBooks), and backlinks (other Wikipedia articles that point to it). We query all 3, find which ones already exist in the database, and setup a link between them.

Code-wise, there’s not much to show that isn’t tied intimately into KnowNodes. Nodemw is missing a method to get internal links, though, so here it is what we wrote:

Could not embed GitHub Gist 5792472: API rate limit exceeded for 155.133.132.65. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)

One foot in front of the other

The last step in this journey was going through the article lists we had generated and making the calls to our own API to load the Wikipedia article. This seems straightforward enough, except that it is bizarrely difficult to read a text file line by line in node.js. Search StackOverflow and you’ll find a bunch of different approaches, including using the lazy model, which works pretty well. But since I knew that our system could only make one Wikipedia request at at time, and that each Wikipedia article involves at least 4 requests (for the article and the 3 types of links) there’s no point overloading the server. I just wanted to read one line at at time.

Line Reader to the rescue. A very minimalist API, but which allows you to asynchronously declare when a new line should be read, and therefore perfect for my needs.

Could not embed GitHub Gist 5792485: API rate limit exceeded for 155.133.132.65. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)

20130610_213719

Bonus: Drunken graph walking

Well Weipeng and I were puzzling over inexplicable errors, Bruno was pondering a bigger question: Now that we have all these links, how to know which are more important than others? Among the many planned features for Knownodes is a voting system for the links, but couldn’t we get a good idea from the link structure that already exists on Wikipedia?

Bruno came up with a “friends of friends” approach: Given an article A and an article B that it links to, count the number of articles that that A links to that also link to B. What’s nice about this approach is that it imitates a random-walk along the graph. Imagine you are on the Wikipedia article of A, what is the chance that by following the links you will end up at B in 2 clicks?

In practice, these numbers tend to be very asymmetrical. A subject like “Python” may have a lot of links towards “Computer Science”, but only a small fraction of “Computer Science” links lead to “Python”.

We considered coding this in to the Wikipedia importer, but there’s no reason that the approach shouldn’t work for any type of node and link in the system. And why not learn about querying a graph database in the process?

This was Bruno and I’s first time writing Cypher queries, so I doubt this is the best way to do it, but this is what we came up with:

Could not embed GitHub Gist 5792493: API rate limit exceeded for 155.133.132.65. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)

Although Cypher lacks some documentation (mostly in the form of examples), it actually makes a lot of sense once you start working with it. The graphic representation of the links is a big plus, and the rest of it is reminiscent of SQL for those of us who have used “normal” DBs before.

And there you go. 3 days of good work and good times. Next step? Let’s get a good search box on the KnowNodes front page!

20130610_091714