Hacker News new | past | comments | ask | show | jobs | submit login

I think people who are downvoting me don't understand how ludicrously retarded most people who output CSV are. CSV is not RFC4180. It's whatever bullshit text file your client has handed you and convinced your project manager is your problem to parse, not their problem to generate even remotely correctly. There is no CSV library capable of handling "CSV". Every time someone asks you for it, you better kick and scream or expect to do a custom job.



I think people are being a fit unfair downvoting you right now (I bumped you up) - but I also disagree with you.

When I'm working with a file that purports to be CSV/TSV, with python, I reach out to the CSV module, specify the dialect that created it - and instantly get all the power of being able to identify refer to all the fields, rows without having to otherwise worry about parsing them.

Is it 100% bulletproof - definitely not - but, then again, I'm not writing a life safety system. And I've also never had the Python CSV parser break on any reasonable file I've sent it.

I'm truly thankful for robust CSV/TSV parsers. Throwaway code like this just works - in particular handles parsing the column headers to automatically build the dict for me.

   sitesFN=['gateway.tsv','relay.tsv']
   dsites={}
   for fn in sitesFN:
       f=open(fn,'r')
       reader=csv.DictReader(f,dialect='excel-tab')
       for row in reader:
           dsites[row['NIC_Serial_No']]=row['Device_Name']
Perhaps what you are trying to say that people are failing to hear, is that you can't rely on a CSV parser to, a priori, handle all possible files that purport to be "TSV/CSV" - in that I agree with you, and that you will always need to examine the file and determine if the built in parser will handle it.

But - What if it turns out the standard library CSV parser handles the "CSV" file just fine - in that case it seems to make a lot of sense to use it, rather than taking the time in writing your own (along with the bugs that come from re-writing anything).

And, speaking just for myself - again, I've never seen a CSV/TSV file that the python didn't handle just fine - not to say they aren't out there - you just have to go out of your way to create them.


> And, speaking just for myself - again, I've never seen a CSV/TSV file that the python didn't handle just fine - not to say they aren't out there - you just have to go out of your way to create them.

Indeed. Python's CSV module supports a "strict" mode that will yell at you more often, but by default, it is disabled. When disabled, the parser will greatly prefer a parse over a correct parse. I took the same route with my CSV parser in Rust (with the intention of adding a strict mode later), because that's by far the most useful implementation. There's nothing more annoying then trying to slurp in a CSV file from somewhere and having your CSV library choke on it.


The csv library of python has handled every csv i've ever thrown on it, csv is "standardized" enough for that. Just set two things, the delimiter, the escape quote method and be done with it. The output is a list of dictionaries with the column headers as keys, very elegant. The best part is that using the same input you did to read a file you can use to save/modify the file and be sure it will look the same when your client re-opens it. A regexp would take twice the time to write and wouldn't give you half of those features, and it would probably fail at escaping sooner or later, for the same reason as xml can't be parsed with regexps.


While I agree with you, I'm having a lot of trouble handling unicode and windows' latin-1 encodings...


This post sums up why.

http://programmers.stackexchange.com/a/215171

CSV is not just simple fields separated by commas with some quotes thrown in.


Well, if you look at, eg, Python's CSV parsing library, it's been more than enough to cover my needs so far, and handles different CSV flavours. It is much nicer and less error-prone to use than using regexps.


You are getting downvoted to hell but I can see what you're meaning. Generally if someone hands you a CSV file there is no guarantee that something mental isn't happening as there is no "CSV" standard. So you're saying that when your task is "process the client's CSV file" you might not necessarily be able to rely on a library handling it correctly, and that you should prepare to get your hands dirty (perhaps hacking together something with a regex or two).


He actually has a point there. There are so many different versions of "CSV" floating around that I'm not at all sure I'd want to deal with a parser that could handle most of them. Ever generated a CSV file from a spreadsheet or DB interface program? Did it have a big list of options on how the CSV would be formatted, so you could easily read the generated file into whatever downstream you were using?

Yeah.


> I'm not at all sure I'd want to deal with a parser that could handle most of them.

Python's CSV parser will handle almost anything you throw at it and it is widely used to great success.

> Ever generated a CSV file from a spreadsheet or DB interface program? Did it have a big list of options on how the CSV would be formatted, so you could easily read the generated file into whatever downstream you were using?

Just about every single CSV file that I've ever had to read was generated by someone other than me. Frequently (but not always), they come from a non-technical person.

Sometimes those CSV files even have NUL bytes in them. Yeah. Really. I swear. It's awful and Python's CSV parser fell over when trying to read them. (You can bet that my parser won't.)

> He actually has a point there.

His point is to use regexes instead of a proper CSV parser. I'm hard pressed to think of a reason to ever do such a thing:

1. A regex is much harder to get correct than using a standard CSV parser. 2. A regex will probably be slower than a fast CSV parser.


"Python's CSV parser will handle almost anything you throw at it and it is widely used to great success."

I like the word "almost". It's kept me in cheesy-puffs for years, now. :-)

I was actually only speaking to the post I replied to. CSV is a mess.


A good CSV Parser (Python's) let's you specify the dialect of CSV file - it doesn't make assumptions as to the format of the CSV file.


I think you just answered your own question. Parsing CSV is not a problem solved by "any standard regex library".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: