We just started out one and a half week ago, joining the Pragmatic Programmer's ...

angrycoder · on Nov 16, 2010

Thanks for your work, I am really enjoying it so far. As a counterpoint to some of the comments regarding the choice of language, I had the opposite response. Oh neat, I get to learn haskell and natural language processing at the same time.

If I could make one request, could you remove the mouseover from the paragraph text that shows the topic heading? It is really distracting for those us who like to use our mouse pointer as a finger when reading.

microtonal · on Nov 16, 2010

Should be fixed in 10 minutes. Thanks!

roel_v · on Nov 16, 2010

You seem to know a lot about NLP and I've asked this question in various places and never even found anyone who knew just a little, so I hope you don't mind that I ask you a small question on whether my problem can even be solved with NLP at all.

I'm looking for a way to extract addresses from web pages, where these addresses are immediately recognizable as such by people but are not in a standard format (zip codes before city or after, no zip codes at all, p/o box instead of street name, ...). All in text format (no graphics, no OCR problem) but inside html tags, in various forms (as row in a table, inside one or multple <div>'s, as an <ul>, etc).

- Is this an NLP problem? - If so, where do I start reading/learning? Most NLP seems to be about understanding free-flowing texts of all sorts of subjects. I'm looking for 98% solutions in what I think is a restricted problem space. Is this a reasonable expectation?

microtonal · on Nov 16, 2010

Without knowing all details, this is probably something you could no with a regular language (such as regular expressions).

Since this book is for the 'working programmer' (rather than the 'working scientist'), it seems reasonable to assume that the book will provide techniques that can be used in domain-specific problems.

_corbett · on Nov 16, 2010

this could be an NLP problem although if you can find an adequate solution with a regular expression/context free grammar that's the easier route.

a lot of modern NLP is based on statistical methods and training data driven, meaning having a training corpus of example addresses identified within the context of these webpages would be the starting place if you went one of those routes. you might start by looking up some academic papers in this area and see if it's been done and methods published.

nervechannel · on Nov 16, 2010

Just for information, CFGs used for processing natural language are almost invariably statistical too these days. Because natural language is inherently ambiguous and probabilistic.

"Fruit flies like bananas" can be grammatically parsed in [at least] two ways, but one if a much more likely interpretation.

_corbett · on Nov 17, 2010

yea for sure, I meant in my advice that a non-statistical solution might be good to start out with.

jimmyjim · on Nov 16, 2010

Hey Daniel,

I actually remember reading your Slackware book a few years back. I've no doubt that the quality of this text will be as superb as that one's! Cheers!