{"id":87,"date":"2013-06-16T18:12:23","date_gmt":"2013-06-16T16:12:23","guid":{"rendered":"http:\/\/heavenstone.net\/jesse\/?p=87"},"modified":"2025-08-01T19:00:59","modified_gmt":"2025-08-01T17:00:59","slug":"how-many-articles-on-computer-science-can-there-possibly-be","status":"publish","type":"post","link":"https:\/\/heavenstone.net\/jesse\/how-many-articles-on-computer-science-can-there-possibly-be\/","title":{"rendered":"How many articles on computer science can there possibly be?"},"content":{"rendered":"<p><strong>Meet the PAF Peacock<\/strong><\/p>\n<p><a href=\"http:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/peacock.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-100\" alt=\"peacock\" src=\"http:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/peacock-300x195.jpg\" width=\"300\" height=\"195\" srcset=\"https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/peacock-300x195.jpg 300w, https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/peacock-1024x668.jpg 1024w, https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/peacock-459x300.jpg 459w, https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/peacock.jpg 1758w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>I just got back from a 3 day trip at the beautiful and mysterious <a href=\"http:\/\/www.pa-f.net\/\">PAF<\/a>, in the countryside near Reims, about 2 hours east of Paris. The place is made for artists, dancers, and musicians to work, think, and play. And at only 16\u20ac a night, an incredible deal. An ancient monastery, it&#8217;s still filled with objects of previous grandeur like out-of-tune pianos and stringy tapestries, but who can say no to a kick-ass ping-pong table?<\/p>\n<p>But as their website says, it&#8217;s a place for &#8220;production&#8221; not for &#8220;vacation&#8221;. And that&#8217;s what the 9 of us were there to do. Import Wikipedia into KnowNodes, visualize it on a graph, and let users quickly find these articles when making connections.<\/p>\n<p><strong>Know your Nodes<\/strong><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" alt=\"\" src=\"http:\/\/www.knownodes.com\/img\/KnownodesLogo.jpg\" width=\"751\" height=\"161\" \/><\/p>\n<p>So what is <a href=\"http:\/\/www.knownodes.com\/\">this KnowNodes project<\/a>, you may be asking. <a href=\"https:\/\/twitter.com\/garbash\">Dor Garbash<\/a>&#8216;s dreamchild, it is a connectivist orgasm, a sort of map of human thought and knowledge, but focused the connections between resources (scientific articles, blog posts, videos, etc.) rather than on those resources themselves. Students could use it to find new learning resources, researchers could use it to explore the boundaries of knowledge in their field, and the rest of us just might love jumping from one crazy idea to the next.<\/p>\n<p>Much like a new social network, one recurring problem with getting this kind of project of the ground is it needs good quality content to jumpstart it. And what could a be better source of quality information as the world&#8217;s largest crowdsourcing project ever, Wikipedia?<\/p>\n<p>Now down to the gory details. How big is Wikipedia, anyway? Well, according to <a title=\"the &quot;Statistics&quot; page\" href=\"http:\/\/en.wikipedia.org\/wiki\/Wikipedia:Statistics\">the &#8220;Statistics&#8221; page<\/a>, the English-language site is at 30 million articles and growing. Wikipedia (and the Wikimedia platform behind it) is very open with their data. You can easily <a href=\"http:\/\/en.wikipedia.org\/wiki\/Wikipedia:Database_download#English-language_Wikipedia\">download nightly data dumps of their database in different formats<\/a>. But here&#8217;s the rub: the articles alone (not counting user pages, talk pages, attachments, previous versions, and who knows what else) still weighs in at 42 GB of XML. That was a bit too much for our poor little free-plan Heroku instance to handle.<\/p>\n<p>So, we came up with a better idea: why not just focus on a particular domain, such as computer science? That way we could demonstrate the value of the approach without overloading own tiny DB. Now, we realized that we couldn&#8217;t just start at the computer-science article and branch outwards, because with the 6-degrees nature of the world, we would soon end up importing the Kevin Bacon article. But Wikipedia has thoughtfully created a system of categories and sub-categories and sub-sub-categories, and anyway, how many articles under the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Category:Computer_science\">Computer Science category<\/a>\u00a0category could there possibly be?<\/p>\n<p><strong>1+6+2+5+9+17+3+34+&#8230;<\/strong><\/p>\n<p>Hmmm, let&#8217;s find out. We wrote a node.js script that uses the open <a href=\"http:\/\/www.mediawiki.org\/wiki\/API:Main_page\">Wikimedia API<\/a>. The only way to find all the articles in the Computer Science category hierarchy is to recursively ask the API for the categories within it, do the same with its children, and so on, until we reach the bottom.<\/p>\n<p>The <strong><a href=\"https:\/\/github.com\/macbre\/nodemw\">nodemw<\/a><\/strong>\u00a0module came in really handy, as it wraps many of the most common API operations so you don&#8217;t have to make the HTTP requests yourself. It also queues all requests that you make, and only executes one at a time. That prevents Wikipedia from banning your IP (which is good) but also slows you way down (not so good).<\/p>\n<p>Enough talk, here&#8217;s what we came up with:<\/p>\n<script src=\"https:\/\/gist.github.com\/5792361.js\"><\/script><noscript><pre><code class=\"language-javascript javascript\">var bot = require(&#039;nodemw&#039;);\nvar fs=require(&#039;fs&#039;);\n\nvar client = new bot({\n  server: &#039;en.wikipedia.org&#039;,  \/\/ host name of MediaWiki-powered site\n  path: &#039;\/w&#039;,                  \/\/ path to api.php script\n  debug: false                \/\/ is more verbose when set to true\n});\n\n\/\/ get the object being the first key\/value entry of a given object\nvar getFirstItem = function(object) {\n  for(var key in object) return object[key];\n};\n\nfunction getLinks(title, callback) {\n  query = { \n    action: &#039;query&#039;,\n    prop: &#039;links&#039;,\n    titles: title,\n  };\n\n  function makeCall() {\n    client.api.call(query, function(data, info, next) {\n      for(var i in getFirstItem(data.pages).links)\n      {\n        callback(getFirstItem(data.pages).links[i]);\n      }\n  \n      if(next &amp;&amp; next[&quot;query-continue&quot;]) {\n        query.plcontinue = next[&quot;query-continue&quot;].links.plcontinue;\n        makeCall();\n      }\n    });\n  }\n\n  makeCall();\n}\n\nfunction getSubCategories(prefix,callback){\n  if(typeof prefix === &#039;function&#039;) {\n  \tcallback = prefix;\n\t}\n\n  client.api.call({\n\t\taction: &#039;query&#039;,\n\t\tcmnamespace:&#039;14|0&#039;,\n\t\tlist: &#039;categorymembers&#039;,\n\t\tcmtitle:prefix,\n\t\tcmlimit: 500\n\t}, function(data) {\n\t\tcallback &amp;&amp; callback(data &amp;&amp; data.categorymembers || []);\n\t});\n}\n\n\nvar pageCount = 0;\nvar catCount = 0;\nvar existingCategories = {};\n\nfunction writeAllPagesAndCat(catagory,callback) {\n\tgetSubCategories(catagory,function(data) {\n\t\tfor(var i in data) {\n\t\t\tif(data[i].ns===0) {\n\t\t\t\t\/*\n\t\t\t\tthis is the article in this catagory\n\t\t\t\twrite article into the article.txt,format: articleTitle,catagory\n\t\t\t\tarticle might be in the different catagory\n        *\/ \n        fs.appendFile(&quot;article.txt&quot;, data[i].title+&quot;,&quot;+catagory+&quot;\\n&quot;, function(){});\n        pageCount++;\n\t\t\t} else {\n\t\t\t\t\/* this is the catagory\n\t\t\t\t write catagory into the catagory.txt,format: catagory,parentCatagory\n\t\t\t\t*\/\n         \/\/ check if we already has the catogory, check in the txt files\n        if(!existingCategories[data[i].title]){\n       \t  \/\/ add it into the exsistingCategories\n       \t  existingCategories[data[i].title] = true;\n          fs.appendFile(&quot;catagory.txt&quot;, data[i].title+&quot;,&quot;+catagory+&quot;\\n&quot;, function(){});\n          \/\/ get it&#039;s subcatagories\n          writeAllPagesAndCat(data[i].title, callback);\n          catCount++;\n        }\n\t\t\t}\n\t\t}\n\t\tcallback(catagory);\n\t})\n}\n\nwriteAllPagesAndCat(&#039;Category:Computer_science&#039;, function(catagory) {\n\tconsole.log(catagory, pageCount, catCount);\n});\n<\/code><\/pre><\/noscript>\n<p>And so we launched the script, saw that it was listing the articles, and walked away happily, bragging about how quickly we had coded our part as we watched the peacocks scaring each other in the courtyard.<\/p>\n<p>And when we came back a few hours later, and the article count had surpassed 250000, Weipeng suspected there may be a problem. He started printing out the categories as we imported them, and sure enough we saw duplicates. That was the first sign that something was wrong. The second was when we saw that we had somehow imported an article on &#8220;Gender Identity&#8221;. That doesn&#8217;t sound a lot like computer science, does it?<\/p>\n<p>On further inspection, we found that our conception of how the category system worked was very wrong. It turns out that categories can have several parents, that pages can be in multiple categories, and that categories might even loop around on themselves. This is very different than the simple tree that we had been imagining.<\/p>\n<p>Time for a new approach: we simply limit the depth of our exploration. Stopping at 5 levels was about 110k articles, and 6 levels gave us 192k. We couldn&#8217;t find any automatic criteria to say whether all these articles really should be part of the system, but this was about the number that we were hoping for, so we stopped there.<\/p>\n<p><strong>Wikipedia -&gt; KnowNodes<\/strong><\/p>\n<p>Now that we had a list of articles, time to actually put them into the database. Time-wise, it probably would have mad sense to go through the XML dump in order to avoid making live API requests. But then this wouldn&#8217;t help us if the users were looking at a new article outside of those we were searching for. And so we created a dynamic system.<\/p>\n<p>The code in this case might not make a lot of sense to anyone who hasn&#8217;t worked on the project, but the idea simple enough. Convert the title to the url of the article, download the 1st paragraph (as a description), and insert it into our database. The 2nd part turned out to be much harder than we had thought. Wikipedia uses their own &#8220;Wikitext&#8221; format, which you wouldn&#8217;t want to show by itself. There actually are quite a few libraries to convert from Wikitext to plain text (or to HTML), but very few of the Javascript ones worked reliably in our case. The best we found was <strong><a href=\"https:\/\/github.com\/kakiray\/txtwiki.js\">txtwiki.js<\/a><\/strong>, which really is quite good, except even it fails on infoboxes (which unfortunately are often placed first on the page, messing up our &#8220;take the 1st paragraph&#8221; approach). In the end, Weipeng found that we could simply ask for the &#8220;parsed&#8221; or HTML version of the page, and take the text between the first &#8220;&lt;p&gt;&#8221; tag we found.<\/p>\n<p>Importing a bunch of isolated Wikipedia articles does not create a map of knowledge, making connections between them does. The <a href=\"http:\/\/www.mediawiki.org\/wiki\/API:Properties\">Wikipedia API provides at least 3 different kinds of links<\/a>: internal (to other pages on the site), external (to the general internet, as well as to partner sites like WikiBooks), and backlinks (other Wikipedia articles that point to it). We query all 3, find which ones already exist in the database, and setup a link between them.<\/p>\n<p>Code-wise, there&#8217;s not much to show that isn&#8217;t tied intimately into KnowNodes. Nodemw is missing a method to get internal links, though, so here it is what we wrote:<\/p>\n<script src=\"https:\/\/gist.github.com\/5792472.js\"><\/script><noscript><pre><code class=\"language-coffeescript coffeescript\">getInternalLinks = (title, callback) -&gt;\n  query =\n    action: &#039;query&#039;\n    prop: &#039;links&#039;\n    titles: title\n    pllimit: 5000\n  client.api.call query, (data) -&gt; \n    titles = (link.title for link in getFirstItem(data.pages).links)\n    callback(titles)<\/code><\/pre><\/noscript>\n<p><strong>One foot in front of the other<\/strong><\/p>\n<p>The last step in this journey was going through the article lists we had generated and making the calls to our own API to load the Wikipedia article. This seems straightforward enough, except that it is bizarrely difficult to read a text file line by line in node.js. <a href=\"http:\/\/stackoverflow.com\/questions\/6156501\/read-a-file-one-line-at-a-time-in-node-js\">Search StackOverflow<\/a> and you&#8217;ll find a bunch of different approaches, including using the <a href=\"https:\/\/github.com\/pkrumins\/node-lazy\"><strong>lazy<\/strong><\/a>\u00a0model, which works pretty well. But since I knew that our system could only make one Wikipedia request at at time, and that each Wikipedia article involves at least 4 requests (for the article and the 3 types of links) there&#8217;s no point overloading the server. I just wanted to read one line at at time.<\/p>\n<p><strong><a href=\"https:\/\/github.com\/nickewing\/line-reader\">Line Reader<\/a> <\/strong>to the rescue. A very minimalist API, but which allows you to asynchronously declare when a new line should be read, and therefore perfect for my needs.<\/p>\n<script src=\"https:\/\/gist.github.com\/5792485.js\"><\/script><noscript><pre><code class=\"language-javascript javascript\">var request = require(&quot;request&quot;);\nvar lineReader =  require(&#039;line-reader&#039;);\nvar fs = require(&quot;fs&quot;);\n\nvar HOST = &quot;http:\/\/localhost:3000&quot;;\nvar LOGIN = { \/* your login and password here *\/ };\n\n\nfunction login(callback) {\n  request.post(HOST + &quot;\/login&quot;, { form: LOGIN }, function(error, response, body) {\n    if(body == &quot;ERROR&quot;) return callback(&quot;Error logging in&quot;);\n\n    responseObj = JSON.parse(body);\n    if(responseObj.KN_ID &amp;&amp; callback) callback(null);\n  });\n}\n\n\nfunction createWikiNode(title, callback) {\n  request.post(HOST + &quot;\/knownodes\/wikinode&quot;, { form: { title: title } }, function(error, response, body) {\n    try {\n      responseObj = JSON.parse(body);\n      if(responseObj.success.KN_ID) {\n        console.log(&quot;Created Wikinode &quot; + responseObj.success.KN_ID + &quot; (&quot; + title + &quot;)&quot;);\n        return callback &amp;&amp; callback(null);\n      }      \n    } catch(err) { \n      return callback &amp;&amp; callback(err)\n    }\n\n    return callback &amp;&amp; callback(&quot;Unknown error&quot;)\n  });\n}\n\n\/\/ Read command-line arguments\nif(process.argv.length &lt; 3) \n{\n  console.log(&quot;usage: node makeNodes.js &lt;file&gt; [start] [count]&quot;);\n  process.exit(1);\n}\n\nvar start = process.argv.length &gt; 3 ? parseInt(process.argv[3]) : 0;\nvar count = process.argv.length &gt; 4 ? parseInt(process.argv[4]) : Number.MAX_VALUE;\n\nconsole.log(&quot;Logging in...&quot;)\nlogin(function(err) {\n  if(err) return console.log(err);\n\n  console.log(&quot;Logged in&quot;);\n\n  var lineNumber = 0;\n  lineReader.eachLine(process.argv[2], function(line, last, cb) {\n    lineNumber++;\n\n    \/\/ Skip lines before start\n    if(lineNumber &lt; start) return cb(true); \n\n    \/\/ Expecting a CSV format with the article title first\n    var array = line.toString().split(&#039;,&#039;);\n    \/\/ HACK: some titles have commas in them :(  This should be fixed on the exporter end, but at least filter them out here to avoid errors\n    if(array.length &gt; 2) \n    {\n      console.log(lineNumber + &quot;. ERROR: TWO MANY COMMAS- &quot; + line);\n      return cb(true)\n    }\n\n    console.log(lineNumber + &quot;: Creating Wikinode for &quot; + array[0]);\n    createWikiNode(array[0], function(err) { \n      if(err) console.log(&quot;ERROR: Skipping&quot;, array[0]); \n\n      cb(lineNumber &lt; start + count); \n    });\n  });\n});<\/code><\/pre><\/noscript>\n<p><a href=\"http:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_213719.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-101\" alt=\"20130610_213719\" src=\"http:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_213719-300x225.jpg\" width=\"300\" height=\"225\" srcset=\"https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_213719-300x225.jpg 300w, https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_213719-1024x768.jpg 1024w, https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_213719-400x300.jpg 400w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p><strong>Bonus: Drunken graph walking<\/strong><\/p>\n<p>Well Weipeng and I were puzzling over inexplicable errors, Bruno was pondering a bigger question: Now that we have all these links, how to know which are more important than others? Among the many planned features for Knownodes is a voting system for the links, but couldn&#8217;t we get a good idea from the link structure that already exists on Wikipedia?<\/p>\n<p>Bruno came up with a &#8220;friends of friends&#8221; approach: Given an article A and an article B that it links to, count the number of articles that that A links to that also link to B. What&#8217;s nice about this approach is that it imitates a random-walk along the graph. Imagine you are on the Wikipedia article of A, <a href=\"https:\/\/github.com\/CyberCRI\/KnowNodes\/wiki\/Relations-relevance\">what is the chance that by following the links you will end up at B in 2 clicks?<\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" alt=\"\" src=\"https:\/\/dl.dropboxusercontent.com\/u\/19894029\/knownodes.png\" width=\"637\" height=\"437\" \/><\/p>\n<p>In practice, these numbers tend to be very asymmetrical. A subject like &#8220;Python&#8221; may have a lot of links towards &#8220;Computer Science&#8221;, but only a small fraction of &#8220;Computer Science&#8221; links lead to &#8220;Python&#8221;.<\/p>\n<p>We considered coding this in to the Wikipedia importer, but there&#8217;s no reason that the approach shouldn&#8217;t work for any type of node and link in the system. And why not learn about querying a graph database in the process?<\/p>\n<p>This was Bruno and I&#8217;s first time writing Cypher queries, so I doubt this is the best way to do it, but this is what we came up with:<\/p>\n<script src=\"https:\/\/gist.github.com\/5792493.js\"><\/script><noscript><pre><code class=\"language-text text\">START root = node(*) \nMATCH (root)--&gt;(link)--&gt;(friend)\nWHERE root.nodeType! = &quot;kn_Post&quot; \n  AND link.nodeType! = &quot;kn_Edge&quot;\n  AND friend.nodeType! = &quot;kn_Post&quot;\nWITH root, link, friend\n\nMATCH (root)--&gt;(linkA)--&gt;(common)--&gt;(linkB)--&gt;(friend)\nWHERE \n  linkA.nodeType! = &quot;kn_Edge&quot;\n  AND common.nodeType! = &quot;kn_Post&quot; \n  AND linkB.nodeType! = &quot;kn_Edge&quot;\nWITH root, link, friend, count(DISTINCT common) as commonCount\n\nMATCH (root)--&gt;(linkC)--&gt;(other)\nWHERE \n  linkC.nodeType! = &quot;kn_Edge&quot;\n  AND other.nodeType! = &quot;kn_Post&quot; \nWITH root, link, friend, commonCount, count(DISTINCT other) as totalCount\n\nSET link.relevance = commonCount * 1.0 \/ totalCount\n\nRETURN root, friend, commonCount, totalCount, commonCount * 1.0 \/ totalCount as strength <\/code><\/pre><\/noscript>\n<p>Although Cypher lacks some documentation (mostly in the form of examples), it actually makes a lot of sense once you start working with it. The graphic representation of the links is a big plus, and the rest of it is reminiscent of SQL for those of us who have used &#8220;normal&#8221; DBs before.<\/p>\n<p>And there you go. 3 days of good work and good times. Next step? Let&#8217;s get a good search box on the KnowNodes front page!<\/p>\n<p><a href=\"http:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_091714.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-99\" alt=\"20130610_091714\" src=\"http:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_091714-300x225.jpg\" width=\"300\" height=\"225\" srcset=\"https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_091714-300x225.jpg 300w, https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_091714-1024x768.jpg 1024w, https:\/\/heavenstone.net\/jesse\/wp-content\/uploads\/2013\/06\/20130610_091714-400x300.jpg 400w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Meet the PAF Peacock I just got back from a 3 day trip at the beautiful and mysterious PAF, in the countryside near Reims, about 2 hours east of Paris. The place is made for artists, dancers, and musicians to work, think, and play. And at only 16\u20ac a night, an incredible deal. An ancient [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-87","post","type-post","status-publish","format-standard","hentry","category-learning"],"_links":{"self":[{"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/posts\/87","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/comments?post=87"}],"version-history":[{"count":10,"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/posts\/87\/revisions"}],"predecessor-version":[{"id":247,"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/posts\/87\/revisions\/247"}],"wp:attachment":[{"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/media?parent=87"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/categories?post=87"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/heavenstone.net\/jesse\/wp-json\/wp\/v2\/tags?post=87"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}