We just started out one and a half week ago, joining the Pragmatic Programmer's writing month. We though a 'release early, release often' approach would be best, that's why there are just a few in-progress chapters.
We will keep you posted, and thanks for the encouragement!
Thanks for your work, I am really enjoying it so far. As a counterpoint to some of the comments regarding the choice of language, I had the opposite response. Oh neat, I get to learn haskell and natural language processing at the same time.
If I could make one request, could you remove the mouseover from the paragraph text that shows the topic heading? It is really distracting for those us who like to use our mouse pointer as a finger when reading.
You seem to know a lot about NLP and I've asked this question in various places and never even found anyone who knew just a little, so I hope you don't mind that I ask you a small question on whether my problem can even be solved with NLP at all.
I'm looking for a way to extract addresses from web pages, where these addresses are immediately recognizable as such by people but are not in a standard format (zip codes before city or after, no zip codes at all, p/o box instead of street name, ...). All in text format (no graphics, no OCR problem) but inside html tags, in various forms (as row in a table, inside one or multple <div>'s, as an <ul>, etc).
- Is this an NLP problem?
- If so, where do I start reading/learning? Most NLP seems to be about understanding free-flowing texts of all sorts of subjects. I'm looking for 98% solutions in what I think is a restricted problem space. Is this a reasonable expectation?
Without knowing all details, this is probably something you could no with a regular language (such as regular expressions).
Since this book is for the 'working programmer' (rather than the 'working scientist'), it seems reasonable to assume that the book will provide techniques that can be used in domain-specific problems.
this could be an NLP problem although if you can find an adequate solution with a regular expression/context free grammar that's the easier route.
a lot of modern NLP is based on statistical methods and training data driven, meaning having a training corpus of example addresses identified within the context of these webpages would be the starting place if you went one of those routes. you might start by looking up some academic papers in this area and see if it's been done and methods published.
Just for information, CFGs used for processing natural language are almost invariably statistical too these days. Because natural language is inherently ambiguous and probabilistic.
"Fruit flies like bananas" can be grammatically parsed in [at least] two ways, but one if a much more likely interpretation.
We will keep you posted, and thanks for the encouragement!