Hacker News new | past | comments | ask | show | jobs | submit login
HTML Command Line Utilities (w3.org)
153 points by TheZenPsycho on March 11, 2014 | hide | past | favorite | 33 comments



I found these the other day and I wonder how these have largely slipped under the radar. Of particular interest is hxpipe and hxunpipe which makes "scraping" tasks absurdly easy, by converting html to a form easily manipulatable by sed, grep and other fun unix utilities.

update: tracking the score of this post on the front page using this:

    curl -s https://news.ycombinator.com/news | hxnormalize | hxpipe | grep -C 20 "TheZenPsycho" | grep points
    -9\n                    points


> how these have largely slipped under the radar

Your single example sells this more effectively than the entire OP. cf. http://bost.ocks.org/mike/example/


That site is inspiring. How many wonderful things can be built with the right mix of technology and passion.

Thanks for sharing.


You might like to try httpie in place of curl:

    https://pypi.python.org/pypi/httpie/
For example:

    http --body https://news.ycombinator.com/news | hxnormalize ...


Tangentially : I liked the style of their c code.


You know what makes me sad now? That there doesn't seem to be anything like these for css files- in particular for extracting references to external files and images, and moving a css file from one directory to another while maintaining relative links.


In Debian and Ubuntu those are in the official repositories. You can install them with

    sudo apt-get install html-xml-utils


For Arch Linux, they are available in the AUR.

https://aur.archlinux.org/packages/html-xml-utils/


Same with RedHat: yum install html-xml-utils


Mac OS X + Homebrew:

  brew install html-xml-utils


From the README page:

  hxpipe (1) - convert XML to a format easier to parse with Perl or AWK
Being unfamiliar with either Perl or AWK, could anyone point me to an explanation/ example of why it is easier to parse/ what format it generates. Would it be easy to write a similar utility to say convert it to a Lua table?


The idea is that those utilities work in the UNIX way, which means that they are line-oriented.

The following two xml documents are equivalent:

    <a><b><c /></b><d>foo</d></a>
and

    <a>
      <b> <c /> </b>
      <d>foo</d>
    </a>
But to understand that using classical UNIX tools which are line-oriented is quite difficult, so you'll have a hard time doing operations such as "replace 'foo' by 'bar' if it appears as the textNode of a 'd' tag".

So the idea of hxpipe is that it is supposed to give you a line-oriented and similar representation of those two documents to work with.

But it actually fails to do that properly (at least for my taste). I largely prefer the output of xml2. Compare:

    # first doc, output of hxpipe
    (a
    (b
    |c
    )b
    (d
    -foo
    )d
    )a
    -\n

    # second doc, output of hxpipe
    (a
    -\n  
    (b
    - 
    |c
    - 
    )b
    -\n  
    (d
    -foo
    )d
    -\n
    )a
    -\n

    # output of xml2, for both documents
    /a/b/c
    /a/d=foo


Many thanks for the detailed reply. That makes a lot of sense.


Thanks! I've always created hokey awk scripts that split on <, but I really like that xml2 output!


Lots of other fun little utilities from the author, Bert Bos:

http://www.w3.org/People/Bos/#htmlutils


There's also tidy-html5, developed by W3C: https://github.com/w3c/tidy-html5


I tried to use this recently, but failed to make this. Probably my bad.


This builds with gcc 4.8.2 (and earlier), since gcc stdlib has definition for min/max functions. But for clang you need to include MIN function/macro based on your system type (hxindex.c does not link without it).

Specifically on Mac OS X you need to modify the hxindex.c like this:

    --- hxindex.c	2013-07-25 17:22:53.000000000 -0400
    +++ hxindex.c.patched	2014-03-11 10:05:55.000000000 -0400
    @@ -43,6 +43,7 @@
      * Version: $Id: hxindex.c,v 1.20 2013-07-25 21:04:05 bbos Exp $
      *
      **/
    +#include <sys/param.h>
     #include "config.h"
     #include <assert.h>
     #include <locale.h>
    @@ -439,7 +440,7 @@
     
       /* Count how many subterms are equal to the previous entry */
       i = 0;
    -  while (i < min(term->nrkeys, globalprevious->nrkeys) &&
    +  while (i < MIN(term->nrkeys, globalprevious->nrkeys) &&
	     !folding_cmp(term->sortkeys + i, 1, globalprevious->sortkeys + i, 1))
	 i++;
Basically, you need the sys/param.h include and change the min function calls to MIN.


AFAIK MIN macro is not standard (not sure though), so I made this:

  --- hxindex.c.orig	2014-03-11 17:55:17.305697689 +0200
  +++ hxindex.c	2014-03-11 17:58:30.331318646 +0200
  @@ -103,6 +103,10 @@
   #define SECNO "secno"		/* Class of elements that define section # */
   #define NO_NUM "no-num"		/* Class of elements without a section # */
   
  +#ifndef MIN
  +# define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
  +#endif
  +
   typedef struct _indexterm {
     string url;
     int importance;		/* 1 (low) or 2 (high) */
  @@ -435,7 +439,7 @@
   
     /* Count how many subterms are equal to the previous entry */
     i = 0;
  -  while (i < min(term->nrkeys, globalprevious->nrkeys) &&
  +  while (i < MIN(term->nrkeys, globalprevious->nrkeys) &&
   	 !folding_cmp(term->sortkeys + i, 1, globalprevious->sortkeys + i, 1))
       i++;
I also created an Arch package with this patch:

> https://dl.dropboxusercontent.com/u/127354948/arch-repo/html...


Hmmm weird, I can't compile on Mac OS (clang,) complains about undefined iofuncs in openurl. Can't figure out what it is exactly missing: seems to be a library, but where and why? I could install the Homebrew version, but it seems to have a bug that makes hxselect not work correctly :/


$ brew install -g html-xml-utils

seems to work on OS X 10.9.2


That's version 6.4, whereas the most recent is 6.6


I think the only dependency is curl.


That's also the impression I got from the sources, and I have libcurl. I'm quite puzzled at where the problem is coming from. Just did a system upgrade (I was a minor release of Mac OS behind) and a brew update and upgrade and I'm still missing this minor "something." Oh well, the only thing that does not work correctly is hxselect... I just faked it with grep.


brew install html-xml-utils


sudo port install html-xml-utils


brew++


Mh, interesting, I need to check this out. Currently I'm using xml2 and 2xml and classic unix tools (sed, grep, cut…) to deal with HTML in Bash scripts and Makefiles (this is how my personnal website is regenerated automatically when I commit or push modifications, by calling `make` in the corresponding git hooks).


OH! I think I like the line format of those even better.


Indeed, I just tried hxpipe and its output is quite a mess. Almost impossible to work with compared to xml2!


nevertheless there are other gems such as hxselect and hxwls


To find out what's for lunch:

wget -qO- http://fazer.se/fleminggatan -q | hxselect 'div.OrangeHeader tr' | lynx -dump -stdin


These look pretty useful for some scripts. Thanks for linking, I've never heard of them before.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: