What are some footguns? It does seem easy

mikepurvis · 2025-03-21T16:12:39 1742573559

It's easy if the fields are all numbers and you have a good handle on whether any of them will be negative, in scientific notation, etc.

Once strings are in play, it quickly gets very hairy though, with quoting and escaping that's all over the place.

Badly formed, damaged, or truncated files are another caution area— are you allowed to bail, or required to? Is it up to your parser to flag when something looks hinky so a human can check it out? Or to make a judgment call about how hinky is hinky enough that the whole process needs to abort?

naitgacem · 2025-03-21T17:55:56 1742579756

Even with numbers, some locales use a comma `,` as the decimal seperator, and some use the dot `.` so that can cause headaches out of the box.

mjw1007 · 2025-03-21T17:49:43 1742579383

Beyond the basic implementation of quoting and escaping, those are things you also have to worry about if you use someone else's csv parser.

And if you implement your own, you get to choose the answers you want.

lelanthran · 2025-03-21T17:56:22 1742579782

What do you mean "allowed to bail"?

Regardless of the format if you're parsing something and encounter an error there are very few circumstances where the correct action is to return mangled dat.

mikepurvis · 2025-03-21T18:01:20 1742580080

Maybe? If the dataset is large and the stakes are low, maybe you just drop the affected records, or mark them as incomplete somehow. Or generate a failures spool on the side for manual review after the fact. Certainly in a lot of research settings it could be enough to just call out that 3% of your input records had to be excluded due to data validation issues, and then move on with whatever the analysis is.

It's not usually realistic to force your data source into compliance, nor is manually fixing it in between typically a worthwhile pursuit either.

rho4 · 2025-03-21T16:04:53 1742573093

multiline values, comma vs semicolon, value delimiter escaping