>
The classic "I'll write my own csv parser - how hard can it be?"
I did as part of my work. It was easy.
To be very clear: the CSV files that are used are outputs from another tool, so they are much more "well-behaved" and "well-defined" (e.g. no escaping in particular for newlines; well-known separators; well-known encoding; ...) than many CSV files that you find on the internet.
On the other hand, some columns need a little bit of "special" handling (you could also do this as a post-processing step, but it is faster to be able to attach a handler to a column to do this handling directly during the parsing).
Under these circumstances (very well-behaved CSV files, but on the other hand wishing the capability to do some processing as part of the CSV reading), likely any existing library for parsing CSV would likely either be like taking a sledgehammer to crack a nut, or would have to be modified to suit the requirements.
So, writing a (very simple) own CSV reader implementation was the right choice.
You were incredibly lucky. I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.
> I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.
To be fair: problematic CSV files do occur. But for the functionality that the program provides, it suffices if in such a situation, an error message is shown to the user that helps him track down where the problem with the CSV file is. Or if the reading does not fail, the user can see in the visualization of the read data where the error with the CSV file was.
In other words: what is not expected is that the program gracefully has to
- automatically find out the "intended behaviour" (column separators, encoding, escaping, ...) of the CSV parsing,
And for anyone who's not convinced by CSV, consider parsing XML with a regex. "I don't need a full XML parser, I just need this little piece of data! Let's keep things lightweight. This can just be a regex..."
I've said it many times myself and been eventually burned by it each time. I'm not saying it's always wrong, but stop and think whether or not you can _really_ trust that "little piece of data" not to grow...
It's easy if the fields are all numbers and you have a good handle on whether any of them will be negative, in scientific notation, etc.
Once strings are in play, it quickly gets very hairy though, with quoting and escaping that's all over the place.
Badly formed, damaged, or truncated files are another caution area— are you allowed to bail, or required to? Is it up to your parser to flag when something looks hinky so a human can check it out? Or to make a judgment call about how hinky is hinky enough that the whole process needs to abort?
Regardless of the format if you're parsing something and encounter an error there are very few circumstances where the correct action is to return mangled dat.
Maybe? If the dataset is large and the stakes are low, maybe you just drop the affected records, or mark them as incomplete somehow. Or generate a failures spool on the side for manual review after the fact. Certainly in a lot of research settings it could be enough to just call out that 3% of your input records had to be excluded due to data validation issues, and then move on with whatever the analysis is.
It's not usually realistic to force your data source into compliance, nor is manually fixing it in between typically a worthwhile pursuit either.
The classic "I'll write my own csv parser - how hard can it be?"