Hacker News new | past | comments | ask | show | jobs | submit login

> I/O of specialized formats comes to mind quickly

The classic "I'll write my own csv parser - how hard can it be?"




> The classic "I'll write my own csv parser - how hard can it be?"

I did as part of my work. It was easy.

To be very clear: the CSV files that are used are outputs from another tool, so they are much more "well-behaved" and "well-defined" (e.g. no escaping in particular for newlines; well-known separators; well-known encoding; ...) than many CSV files that you find on the internet.

On the other hand, some columns need a little bit of "special" handling (you could also do this as a post-processing step, but it is faster to be able to attach a handler to a column to do this handling directly during the parsing).

Under these circumstances (very well-behaved CSV files, but on the other hand wishing the capability to do some processing as part of the CSV reading), likely any existing library for parsing CSV would likely either be like taking a sledgehammer to crack a nut, or would have to be modified to suit the requirements.

So, writing a (very simple) own CSV reader implementation was the right choice.


> very well-behaved CSV files

You were incredibly lucky. I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.


> I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.

To be fair: problematic CSV files do occur. But for the functionality that the program provides, it suffices if in such a situation, an error message is shown to the user that helps him track down where the problem with the CSV file is. Or if the reading does not fail, the user can see in the visualization of the read data where the error with the CSV file was.

In other words: what is not expected is that the program gracefully has to

- automatically find out the "intended behaviour" (column separators, encoding, escaping, ...) of the CSV parsing,

- automatically correct incorrect input files.


CSV is _way_ hairier than folks think it is!!

And for anyone who's not convinced by CSV, consider parsing XML with a regex. "I don't need a full XML parser, I just need this little piece of data! Let's keep things lightweight. This can just be a regex..."

I've said it many times myself and been eventually burned by it each time. I'm not saying it's always wrong, but stop and think whether or not you can _really_ trust that "little piece of data" not to grow...


> "I don't need a full XML parser, I just need this little piece of data! Let's keep things lightweight. This can just be a regex..."

relevant:

> ruby-saml was using two different XML parsers during the code path of signature verification. Namely, REXML and Nokogiri

where "REXML" does exactly what you described, and hilarity ensued

Sign in as anyone: Bypassing SAML SSO authentication with parser differentials - https://news.ycombinator.com/item?id=43374519 - March 2025 (126 comments)


A plural of regex is regrets...


What are some footguns? It does seem easy


It's easy if the fields are all numbers and you have a good handle on whether any of them will be negative, in scientific notation, etc.

Once strings are in play, it quickly gets very hairy though, with quoting and escaping that's all over the place.

Badly formed, damaged, or truncated files are another caution area— are you allowed to bail, or required to? Is it up to your parser to flag when something looks hinky so a human can check it out? Or to make a judgment call about how hinky is hinky enough that the whole process needs to abort?


Even with numbers, some locales use a comma `,` as the decimal seperator, and some use the dot `.` so that can cause headaches out of the box.


Beyond the basic implementation of quoting and escaping, those are things you also have to worry about if you use someone else's csv parser.

And if you implement your own, you get to choose the answers you want.


What do you mean "allowed to bail"?

Regardless of the format if you're parsing something and encounter an error there are very few circumstances where the correct action is to return mangled dat.


Maybe? If the dataset is large and the stakes are low, maybe you just drop the affected records, or mark them as incomplete somehow. Or generate a failures spool on the side for manual review after the fact. Certainly in a lot of research settings it could be enough to just call out that 3% of your input records had to be excluded due to data validation issues, and then move on with whatever the analysis is.

It's not usually realistic to force your data source into compliance, nor is manually fixing it in between typically a worthwhile pursuit either.


multiline values, comma vs semicolon, value delimiter escaping




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: