Hacker News new | past | comments | ask | show | jobs | submit login
Extracting data from Wikipedia using curl, grep, cut and other bash commands (loige.co)
218 points by loige on Aug 15, 2016 | hide | past | favorite | 70 comments



Because i was randomly curious, can extract this data from the structured html with some dom selectors in a similarly haphazard way:

Start with: https://en.wikipedia.org/api/rest_v1/page/html/List_of_Olymp...

Run this js one-liner: [].slice.call(document.querySelectorAll('table[typeof="mw:Transclusion mw:ExpandedAttrs"] tr td:nth-child(n+2) > a:nth-child(1), table[typeof="mw:Transclusion mw:ExpandedAttrs"] tr:nth-child(3) td > a:nth-child(1)')).map(function(e) { return e.innerText; }).reduce(function(res,el) { res[el] = res[el] ? res[el] + 1 : 1; return res; }, {});

The result is an object with the medalists as keys, and the count as values. JS objects are unordered so sorting is left as an excercise for the reader.


Nice. Nasty, but nice :)

I can't help but notice a small bug.... Driulis Gonzalez for example has medalled 4 times, but your script gives his count as only 3. Similarly Gévrise Émane isn't listed by your script. Something to do with split tables cells I suspect.

Still, it's inspiring. I've often used bash and perl to scrape data from web pages. I'll definitely consider JS in future.


Wish I could upvote this more. That is some serious queryselector fu.


This is you-can't-parse-HTML-with-regex [1] level hideous, only worse, because Mediawiki markup is essentially a Turing-complete programming language thanks to template inclusion, parser functions [2], etc.

The only remotely sane way to do this is to use the Mediawiki API [3] to get the pages you want, then use an actual parser like mwlib [4] to extract the content you need. Wikidata and DBpedia are also promising efforts, but both have a long way to go in terms of coverage.

[1] http://stackoverflow.com/questions/1732348/regex-match-open-...

[2] https://www.mediawiki.org/wiki/Help:Extension:ParserFunction...

[3] https://www.mediawiki.org/wiki/API:Main_page

[4] https://www.mediawiki.org/wiki/Alternative_parsers


It's even worse. The syntax of MediaWiki markup (parser functions, etc.) will vary depending on extensions installed in a given instance. And there is no way to obtain a list of parser functions and extension tags (the latter of which look like HTML tags, but parse differently). There is literally no way to reliably parse MediaWiki markup offline. [0] can give you a parse tree — but only of a top-level page, you can't recursively expand templates with it. And even that parse tree only gives you information about which templates should be expanded. After expanding templates you have to parse the whole thing again to actually interpret the formatting markup. Oh, and expand strip markers: [1]. Because some extensions which define their own tags will want to output raw, unfiltered HTML. But tags are parsed in the template expansion stage and raw HTML isn't allowed in the formatting stage, so what do the extensions do? They output "strip markers" that are later substituted for the raw HTML.

Most alternative parsing libraries (at least all the ones I looked at, and that includes Wikimedia's own Parsoid!) don't bother with implementing all that complexity. Which means that they can be often tripped by sufficiently tricky markup; and this turns out to be a quite low bar.

Parsing MediaWiki markup properly is an insanity on a level comparable only with TeX and Unix shell scripts. Even PHP is saner.

[0] https://www.mediawiki.org/wiki/API:Parsing_wikitext

[1] https://www.mediawiki.org/wiki/Strip_marker


> And there is no way to obtain a list of parser functions and extension tags (the latter of which look like HTML tags, but parse differently).

FWIW, every sane extensions reports its existence to [[Special:Version]] which at least includes a list of extension tags at the bottom but you'd need to implement all of them in an alternative parser.


This isn't for production-level data migration. It's for smooshing some source text into a shape which is useful to you.

Parsing HTML with regexps is fine if you're just curious roughly how many images are in a page. It's great for quick command line experiments. It's just not good when you need to be "doing it properly".


I used to work with Wiki markup for a living. The time you think you'll save with regex hackery is quickly chewed up by the time wasted eternally tweaking your regexes to catch yet another corner case -- it's much better just to parse for real from the get go, just like it's much better to use a real HTML/XML parser than trying to do the same job badly with regexes.


I used to make scrappers for living and trust me that it all depends on the particular situation and your requirements. Real HTML parsers are easier and safer for general type of work, but they quickly get very heavy on memory when parsing big DOM trees. If all you need from a page is a few strings, like e.g. just a product price (very common task), using regexes is far superior approach performance-wise. It's both faster and uses less memory (so you can run more parallel workers) and also if you write it well it's immune to many small html/design changes as long as the pattern you look for is not changed.


I'm sure that works well on product pages, which are just the same template reused for every single product. I'm afraid it will fail spectacularly on wiki pages, which are handcrafted by humans with all the completely unpredictable randomness that entails.


I mostly agree, been using Regexes to parse known limited subset of HTML tags in known limited format (1 tag per line, forced padding, no attrs, etc.).

But on the other hand, this is often how long term "proper" solutions are born — evolved from something cobbled together in couple of hours.


Oh yeah, the golden rule for quick hacks is NEVER show them to someone non-technical, especially someone non-technical who's in your chain of command.

I learned that one after mocking up some UI screens using VB6 and then having to explain over and over again that no, just because we showed you some buttons on a page doesn't mean that the program (which had to be a Java applet, mind you) was "almost done."


There are already literally dozens of "proper" MediaWiki parsers out there: https://www.mediawiki.org/wiki/Alternative_parsers


...or we can just use SPARQL and dbpedia!(http://wiki.dbpedia.org/) There are questions where you'll have to scrap more than one page to get an answer and things could get really complicated with shell commands.

dbpedia is a triple-store that allows us to perform simple queries against wikipedia data like listing music bands based on a particular city:

  SELECT ?name ?place
  WHERE {
    ?place rdfs:label "Denver"@en .
    ?band dbo:hometown ?place .
    ?band rdf:type dbo:Band .
    ?band rdfs:label ?name .
    FILTER langMatches(lang(?name),'en')
  }
or queries that involve multiple subjects, categories etc.


dbpedia provides very low coverage of wikipedia information.


I looked at dbpedia, but it was non-obvious to me what statements to use. We can also use SPARQL with wikidata, although the coverage isn't particularly great.I threw together an example query for medalists and michael phelps doesn't make the list, because he doesn't have the appropriate participant of/award received statements:

https://query.wikidata.org/#SELECT%20%3Fhuman%20%3FhumanLabe...

  SELECT ?human ?humanLabel ?count WHERE {
    {
      SELECT ?human (COUNT(*) as ?count) WHERE {
        ?event wdt:P31 wd:Q18536594 . # All items that are instance of Olympic sporting event
  
        ?medal wdt:P279 wd:Q636830 .    # All items that are subclass of Olympic medal 

        ?human p:P1344 ?participantStat . # Humans with a participant of statement
        ?participantStat ps:P1344 ?event . # .. that has any of the values of ?event
        ?participantStat pq:P166 ?medal . # .. with the award received qualifier of any of the values of ?medal
      }
      GROUP BY ?human
    }
  
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
  }
  ORDER BY DESC(?count)
  LIMIT 100
EDIT: The dbpedia search should be something like: http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbp...

  SELECT ?human ?count WHERE
  {
    {
      SELECT ?human (count(*) as ?count) WHERE {
        ?event rdf:type dbo:OlympicEvent
        {
          ?event dbo:bronzeMedalist ?human .  
        } UNION {
          ?event dbo:silverMedalist ?human
        } UNION {
          ?event dbo:goldMedalist ?human
        }
      }
      GROUP BY ?human
    }
  }
  ORDER BY DESC(?count)
  LIMIT 100


It's not surprising that Phelps is not on the list, you're grouping by ?human and he is not... jk, maybe there is an ontology issue i.e. new triples are using a different class or that triple-store is not up to date.


> You will not need to open an editor and write a long script and then to have an interpreter like Node.js to run it, sometime the bash command line is just enough you need!

This is a bad attitude to have for working with data processing, where QA is necessary and the accuracy of the output is important. A 50 LOC scraper with comments and explicitly-defined inputs and output from functions is far preferable to a 8 LOC scraper that those without bash knowledge will be unable to parse.

And the 8 LOC bash script is not much of a time savings as this post demonstrates; you still have to check each function output manually to find which data to parse / handle edge cases.


> far preferable

Well, that depends. Shell scripts can be commented, and they can be built progressively and interactively by building a pipeline. That's a great choice for a one-off task, and in my experience much faster than many other approaches.

It also works well for tasks close to the system. For example, our users are able to download large archives of data, and we keep over 99% of such downloads indefinitely. We delete downloads > 100GB after they're more than 6 months old.

With a shell script run by cron that's achieved with find + rm + curl|jq (to tell the API the download is deleted).


To clarify, by data processing I mean extract-transform-load, where the outputs of the extract/transform phase(s) may not be immediately obvious. Bash as an simple automation tool for simple tasks is fine.


Most interesting data-analysis problems require multiple iterations. It's not a bad idea to take a couple quick exploratory passes at your data with interactive command-line tools, look at its shape, figure out what areas you need to take a closer look at, and then write real programs to work specifically with those.

I've gotten surprisingly far using just curl/cat/cut/sed/awk/wc & friends. When I need to build on top of that, I go write a real program, but the UNIX-fu tells me what I need to build on top of.


I've used Makefiles to coordinate a zoo of perl/bash scripts before. It's pretty effective.


Makefiles are underrated. The concept is so simple – it is basically a dependency graph with attached shell scripts – yet so powerful. Not just for building software, but also for everyday tasks where you need some files to be updated under some condition.

I recommend mk (original make replacement for Plan 9, available for different OSes through Plan9Port [0,1]). It has a bit more uniform syntax, and can check that the dependency graph is wellfounded.

[0] https://github.com/9fans/plan9port

[1] https://swtch.com/plan9port/


I've been getting into Makefiles more over the past year, and still a noob at it. I see them as a necessary evil - they're baroque and they hurt... and yet they make other things easier. I'm using them less to compile stuff and more to package/upload things, but they certainly have their place.


another advantage of make is that every host have it. like vim.

mk is not there yet.


Did you mean "vi"? Lots of systems don't install vim by default.


And contrariwise, lots of hosts install "vi" which actually seems to be vim. Aah, 'standards'. :P


not since 1990. :)

and even if that is the case, I'm still better being a vim user at a system that only has vi and ed, then any other editor's user.


now I have to ask, is it just to run/execute things or there are other stuff you do with it?


You wouldn't do calculation in a Makefile, if that's what you mean. Make is not well-suited to use as a general-purpose programming language.


100% this. Coming from someone whose made disgusting chained commands to parse access logs for relevant information, you _need_ better documentation than you can do with a single line of commands.

I have so many `zgrep ... | awk ... | sort` scripts that I legitimately couldn't tell you what they do anymore, just what the correct output is.

I really like the feeling of proving to myself that I know how to use my command line well, but in most cases I end up spending longer trying to remember the differences between OS X built-in `sed` and the `sed` I know and blah. Lots of wasted time.

I've started keeping a "scratch" git repo with scripts for one-offs, so the next time I say "oh I want to go through this CSV and run X on each line with Y" I can look at my old code and replace the necessary parts.


You can put comments in Bash scripts, including in multi-line pipelines

  echo x |\
    # Comment here
    cat | # Or like this
    cat
An interactive shell with the option interactive_comments set (the default) ignores these comments, which is useful if you wish to copy+paste+execute bits of scripts.


What? For a one-off, one-time "I wonder who has the most medals" curiosity script?

> 8 LOC scraper that those without bash knowledge will be unable to parse.

And someone without knowledge of Node.js will be unable to parse the 50-line JavaScript monstrosity.

> And the 8 LOC bash script is not much of a time savings as this post demonstrates; you still have to check each function output manually to find which data to parse / handle edge cases.

It only took that long because he detailed every step as a tutorial. If you have basic literacy of the shell, it's no time at all.

To refute you, I decided to solve it myself, using only the "hint" at the beginning of the article to use ?action=raw to work with the source wikitext (I had not yet read the rest of the article; I had not seen his solution).

It took me literally 2 minutes: (Setting in url='https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_j... for readability on HN)

    [2016-08-15 17:52] curl -s ${url}
    [2016-08-15 17:52] curl -s ${url}|grep flagIOCmedalist
    [2016-08-15 17:52] curl -s ${url}|grep -o '{{flagIOCmedalist.*}}'
    [2016-08-15 17:53] curl -s ${url}|grep -o '{{flagIOCmedalist.*}}'|cut -d'|' -f2|sed 's/[[\]]//g'
    [2016-08-15 17:53] curl -s ${url}|grep -o '{{flagIOCmedalist.*}}'|cut -d'|' -f2
    [2016-08-15 17:54] curl -s ${url}|grep -o '{{flagIOCmedalist.*}}'|cut -d'|' -f2|sed 's/\(\[\|\]\)//g'
    [2016-08-15 17:54] curl -s ${url}|grep -o '{{flagIOCmedalist.*}}'|cut -d'|' -f2|sed 's/\(\[\|\]\)//g'|sort |uniq -c
    [2016-08-15 17:54] curl -s ${url}|grep -o '{{flagIOCmedalist.*}}'|cut -d'|' -f2|sed 's/\(\[\|\]\)//g'|sort |uniq -c|sort -n
You can see the only place I fumbled a bit was with escaping brackets inside of brackets in sed, which is admittedly a little wonky.

Sure, for something that might need to run repeatedly, it probably doesn't handle future edge cases that might arise. But it's not production, it's a one-off exploring the data script. Not every one-off program you write needs to be production quality.

And being worried about "those without bash knowledge"... don't be afraid to use your operating system!

That said, this line of solutions has a big edge case: it relies on the editors of Wikipedia being consistent and formatting each row as one line in the source. If I were to solve this totally on my own, choosing to ignore the "hint", and work with the rendered HTML, and just do most of it with a nokogiri one-liner.

Knowing that there's an ?action=something to get just the page HTML without the navigation and such. I spent about 3 minutes finding the "render" action (documented here: https://www.mediawiki.org/wiki/Manual:Parameters_to_index.ph... ). Then anther 3-ish minutes poking around the DOM in my browser to get an idea of what I'm working with, and visually inspecting the layout of the article; each cell with a medalist has two links, the first to the medalist, and the second to the country. Then it took me a whopping 5 minutes to hammer out the rest of the one-liner.

(Similarly, url='https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_j... for HN readability)

    [2016-08-15 18:14] curl -s ${url}
    [2016-08-15 18:15] curl -s ${url}|nokogiri -e '$_.css("table.wikitable td a:first-child").each{|a| puts a.text}'
    [2016-08-15 18:16] curl -s ${url}|nokogiri -e '$_.css("table.wikitable td a:first-child").each{|a| puts a.text}'|grep -vE '^[0-9]{4} '
    [2016-08-15 18:17] curl -s ${url}|nokogiri -e '$_.css("table.wikitable td a:first-child").each{|a| puts a.text.sub!("\n", " ")}'
    [2016-08-15 18:18] curl -s ${url}|nokogiri -e '$_.css("table.wikitable td a:first-child").each{|a| puts a.text.sub("\n", " ")}'
    [2016-08-15 18:18] curl -s ${url}|nokogiri -e '$_.css("table.wikitable td a:first-child").each{|a| puts a.text}'|grep -vE -e '^[0-9]{4} ' -e '^details$'
    [2016-08-15 18:19] curl -s ${url}|nokogiri -e '$_.css("table.wikitable td a:first-child").each{|a| puts a.text}'|grep -vE -e '^[0-9]{4} ' -e '^details$'|sort |uniq -c|sort -n
And most of it was fumbling around with thinking that the year/details links were one link with a newline in them, when they are in fact two separate links.

For a grand total of 11 minutes; generously.

My point is: Don't be afraid to play around with your tools and data! Have fun! Not everything needs to be production quality! And being written primarily in bash doesn't necessarily mean that it isn't production quality!


Which company/industry do you work in, where people who would be parsing your code are unable to grok something as basic as bash?


I don't think it's really bash, more the utilities' inconsistent APIs and single-letter flags (which are not descriptive at all). Each command has its own API with different options, so it takes a lot of exposure to really learn them.


If you don't like single letter flags, most have options for longer, more descriptive flags that do the same thing. For example, you can either do:

   Make -I dir
or

   Make --include-dir=dir
Choose whichever one makes most sense.


this.

anyone who uses single letter options on a script should be punished.

single letters are for one time typing.


Single letters are sometimes more known than their full option equivalents. Quick, without a man page, what's the short alias for `tar --get` ?


I have no idea because I memorized 'tar zxvf' a long time ago [1] and I have to google everything else.

[1] https://xkcd.com/1597/


I thought you'd post this one https://xkcd.com/1168/


These days tar autodetects zipfiles, and you don't usually need to see the files as they come out, so you can do 'tar xf' instead...


I wouldn't use "--get". If I wanted to use the long option, I'd use "--extract".


Until you find yourself on (Free)BSD, where it doesn't look like the long options are supported: https://www.freebsd.org/cgi/man.cgi?make(1)



It depends:

    .
is portable, whereas:

    source
is not... which was a surprise. :)


This is the kind of query that excites me for WikiData's development.

http://wikidata.org


see also the SPARQL search: https://query.wikidata.org/


There is the other option to use Parsoid.

https://github.com/wikimedia/parsoid

That is MediaWiki's official off wiki parser that can turn wikitext into HTML or HTML back into wikitext. It would be reasonably simple to hook into its API and use it for data extraction instead.


Is converting Wikitext to HTML/RDFa really going to help with this task? I'd say it's actually clearer how to get the data out of the original Wikitext.


My 2 cents: the cut/grep lines could be replaced by a sed/awk one-liner such as:

sed -n 's/.flagIOCmedalist|\[\[\([^]|]\).*/\1/p'


Agreed. I was a long time abuser of cut, but has moved to relying on sed instead. I find that it's generally a lot more robust if you think through your expressions. For certain cases awk will also do the job. Perl oneliners do seem convenient but that has never been my cup of tea.


There is an active project sponsored by the Wikimedia Foundation called PyWikiBot that I've been a contributor to and user of for over a decade now. If you want to do anything and everything with Wikipedia, look no further than: https://github.com/wikimedia/pywikibot-core


You should try wikidata for any type of query that can't be answer using google and where all the information itself is already on wikipedia. it's way faster (if you know about sparql) and way more powerfull and flexible. It only seems surprising there isn't more people talking about it, triplestore are awesome


... there's an API making half of this superfluous. You can do pretty much any MediaWiki reading or writing through it. (All Wikipedia bots are required to use it, for instance.)

https://en.wikipedia.org/w/api.php

The article text is a raw blob of wikitext you have to process, but you don't have to go to stupid lengths trying to parse HTML without a browser.


But he didn't parse any HTML in the article.


Extracting data from Wikipedia with type providers: http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/


An xpath like `//table/tr/td[2]/a[1]/text()` seems like it would be a lot simpler.


This is my goto approach every time I have to parse html or XML. I still don't understand why people don't use something as simple as google spreadsheets and write a simple xpath to load tabular data using =IMPORTHTML().


Large part of Bioinformatics data processing involves these commands. They seem little cryptic but gets job done. I would also like to mention Datamash: https://www.gnu.org/software/datamash/


Isn't this a poster child example for the semantic web?


errrrrrrrrk. Extracting raw wiki-markup and trying to use it? Not the greatest of idea. The only true parser of that language is mediawiki. Doing it yourself is a recipe for a massive headache.


I was thinking the same thing. Wikimedia makes all of wikipedia available for download. You don't need to screen scrape. LOL. I guess they had some other wikis they want to get data from but the "main" worldwide wikipedia site, that everyone thinks of as wikipedia makes the data freely downloadable, and I've downloaded it before.


A couple of years ago I found Perl was fastest at processing Wikipedia dumps.

It also didn't require having a JVM preloaded to make startup times acceptable during development (naming no other tools).

I do use shell tools to process data, a lot. They're particularly good for exploratory programming and initial analysis of new datasets.


cut, awk, grep, and perl can churn through an initial data dump like nobody's business.


Or, for a lot of the structured elements, you could use DBPedia.


There is also a wikipedia library for Python. An example of its use:

Using the wikipedia Python library (to search for oranges :)

https://jugad2.blogspot.in/2015/11/using-wikipedia-python-li...

And there maybe libraries for other languages too, since the above library wraps a Wikipedia API:

https://en.wikipedia.org/wiki/Wikipedia:API


I actually added some of your alternative solutions to the bottom of the article, thanks for commenting :)


Python has excellent packages like mwparserfromhell and wikitables for this kind of processing.


I appreciate this for the novelty factor, but, somebody show this dude how to use a spreadsheet!


Honestly, what novelty factor?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: