> Rust is one of those languages where I need to work to make the compiler happy, but once I manage that, the code generally works on the first try.
I can confirm this. I have a CSV parser[1] that is maybe twice as fast as Python's CSV parser (which is written in C)+. There's nothing magical going on: with Rust, I can expose a safe iterator over fields in a record without allocating.
All things a proper understanding of regex and a minimal understanding of streaming file IO can cover. The whole "if you think regex is the solution to your problem, now you have two problems" thing has gotten out of hand. Regex is not that hard.
It's simpler to handle quotes and backslashes escaping commas with a custom parser. And then there's the domain knowledge baked into the lib. Does your regex solution produce excel compatible csv files when you have leading zeros? That's important to some people.
I think people who are downvoting me don't understand how ludicrously retarded most people who output CSV are. CSV is not RFC4180. It's whatever bullshit text file your client has handed you and convinced your project manager is your problem to parse, not their problem to generate even remotely correctly. There is no CSV library capable of handling "CSV". Every time someone asks you for it, you better kick and scream or expect to do a custom job.
I think people are being a fit unfair downvoting you right now (I bumped you up) - but I also disagree with you.
When I'm working with a file that purports to be CSV/TSV, with python, I reach out to the CSV module, specify the dialect that created it - and instantly get all the power of being able to identify refer to all the fields, rows without having to otherwise worry about parsing them.
Is it 100% bulletproof - definitely not - but, then again, I'm not writing a life safety system. And I've also never had the Python CSV parser break on any reasonable file I've sent it.
I'm truly thankful for robust CSV/TSV parsers. Throwaway code like this just works - in particular handles parsing the column headers to automatically build the dict for me.
sitesFN=['gateway.tsv','relay.tsv']
dsites={}
for fn in sitesFN:
f=open(fn,'r')
reader=csv.DictReader(f,dialect='excel-tab')
for row in reader:
dsites[row['NIC_Serial_No']]=row['Device_Name']
Perhaps what you are trying to say that people are failing to hear, is that you can't rely on a CSV parser to, a priori, handle all possible files that purport to be "TSV/CSV" - in that I agree with you, and that you will always need to examine the file and determine if the built in parser will handle it.
But - What if it turns out the standard library CSV parser handles the "CSV" file just fine - in that case it seems to make a lot of sense to use it, rather than taking the time in writing your own (along with the bugs that come from re-writing anything).
And, speaking just for myself - again, I've never seen a CSV/TSV file that the python didn't handle just fine - not to say they aren't out there - you just have to go out of your way to create them.
> And, speaking just for myself - again, I've never seen a CSV/TSV file that the python didn't handle just fine - not to say they aren't out there - you just have to go out of your way to create them.
Indeed. Python's CSV module supports a "strict" mode that will yell at you more often, but by default, it is disabled. When disabled, the parser will greatly prefer a parse over a correct parse. I took the same route with my CSV parser in Rust (with the intention of adding a strict mode later), because that's by far the most useful implementation. There's nothing more annoying then trying to slurp in a CSV file from somewhere and having your CSV library choke on it.
The csv library of python has handled every csv i've ever thrown on it, csv is "standardized" enough for that. Just set two things, the delimiter, the escape quote method and be done with it. The output is a list of dictionaries with the column headers as keys, very elegant. The best part is that using the same input you did to read a file you can use to save/modify the file and be sure it will look the same when your client re-opens it. A regexp would take twice the time to write and wouldn't give you half of those features, and it would probably fail at escaping sooner or later, for the same reason as xml can't be parsed with regexps.
Well, if you look at, eg, Python's CSV parsing library, it's been more than enough to cover my needs so far, and handles different CSV flavours. It is much nicer and less error-prone to use than using regexps.
You are getting downvoted to hell but I can see what you're meaning. Generally if someone hands you a CSV file there is no guarantee that something mental isn't happening as there is no "CSV" standard. So you're saying that when your task is "process the client's CSV file" you might not necessarily be able to rely on a library handling it correctly, and that you should prepare to get your hands dirty (perhaps hacking together something with a regex or two).
He actually has a point there. There are so many different versions of "CSV" floating around that I'm not at all sure I'd want to deal with a parser that could handle most of them. Ever generated a CSV file from a spreadsheet or DB interface program? Did it have a big list of options on how the CSV would be formatted, so you could easily read the generated file into whatever downstream you were using?
> I'm not at all sure I'd want to deal with a parser that could handle most of them.
Python's CSV parser will handle almost anything you throw at it and it is widely used to great success.
> Ever generated a CSV file from a spreadsheet or DB interface program? Did it have a big list of options on how the CSV would be formatted, so you could easily read the generated file into whatever downstream you were using?
Just about every single CSV file that I've ever had to read was generated by someone other than me. Frequently (but not always), they come from a non-technical person.
Sometimes those CSV files even have NUL bytes in them. Yeah. Really. I swear. It's awful and Python's CSV parser fell over when trying to read them. (You can bet that my parser won't.)
> He actually has a point there.
His point is to use regexes instead of a proper CSV parser. I'm hard pressed to think of a reason to ever do such a thing:
1. A regex is much harder to get correct than using a standard CSV parser.
2. A regex will probably be slower than a fast CSV parser.
A regexp would work, but it's usually the wrong level of abstraction to operate at. One wants to say "for the next 2,000 rows, retrieve columns 2 and 4, and the column labeled 'foo'", not write a regexp.
I can confirm this. I have a CSV parser[1] that is maybe twice as fast as Python's CSV parser (which is written in C)+. There's nothing magical going on: with Rust, I can expose a safe iterator over fields in a record without allocating.
[1] - https://github.com/BurntSushi/rust-csv
The docs explain the different access patterns (start with convenience and move toward performance): http://burntsushi.net/rustdoc/csv/#iteratoring-over-records
+ - Still working on gathering evidence...