> Columns are separated by \u001F (ASCII unit separator)
> Rows are separated by \u001E (ASCII record separator)
That's a nightmare to try to edit yourself in a text editor?
I'd rather just have basically TSV, but with every value always quoted, always UTF-8. Quotes escaped with backslashes, backslashes escaped with backslashes, and that's it. Any binary allowed between the quotes.
I deal with CSVs all day every day. I'm known for these two posts
Some friends and I actually started an RFC about 11 years ago for a CSV enhancement with an HTTP inspired header section with metadata including encoding. UTF-8 wasn't as clear of a winner back then. Never went anywhere.
> I'd rather just have basically TSV, but with every value always quoted, always UTF-8. Quotes escaped with backslashes, backslashes escaped with backslashes, and that's it. Any binary allowed between the quotes.
Encode rather than escape, such as encoding an arbitrary byte as %xx where xx is two hex digits. Use this encoding for any %s in the values, as well as any field separators and record separators and any bytes that have special meaning in your format.
Encoding rather than escaping means that given a record I can split it into fields using the built-in string splitting method of whatever language I'm using. Dealing with a format that can have the field separators escaped in the values will usually present less opportunity to use the language's efficient built-in string functions.
One major limitation with quoted values that can this contain record delimiters (as opposed to escaping the delimiters) is that it stops systems from being able to load records in parallel.
Some systems ban embedded record delimiters, for this reason.
What's the difference between quoted and escaped delimiters? (Keeping in mind that escaping sequences can themselves be escaped, ad infinitum. You can't simply seek to an escape sequence and depend algorithmically on a small, fixed lookbehind.)
I think the parent that if newlines were encoded as "\n" (with a backslash) then you could always split on (actual) newlines and process them in parallel without having to tokenize the quote first.
"That's a nightmare to try to edit yourself in a text editor?"
I have found these "unusual" separators are useful, e.g., I use ASCII file separator (FS). I use tr and sed to add/remove/change separators. If I had to edit a table interactively I would change the separator to something visible before editing. Thats said, in nvi(1) or less(1), US is displayed as highlighted ^_ and RS as highlighted ^^. It is not difficult to work with if the data is just text, as it is for me. One could also use a hex editor like bvi.
On large tables, I prefer to make edits non-interactively. I use ed(1) scripts.
Unfortunately, the UNIX sort command -t option will only support a limited range characters as separators. US, RS and FS are not among them. If I want to use UNIX sort I have to change the separator to one that sort accepts as a separator before sorting.
The complaints about CSV I read on HN and elsewhere seem to be complaints about what people put into them, i.e., lack of enforced rules about what is acceptable, not about the format itself.
In the printing world inkjet printers for industrial use use a file format that is all RS, GS, US, and FS characters. It had no line breaks instead it used a RS character at the beginning of a record. It would routinely break if people tried to open in a text editor. Nothing wants to deal a 300mb file consisting of a single line. Ended up writing my own library to manipulate the files and used a lot of tr, see, and awk in the command line. It was a pain only because modern editors have forgotten control codes
>> Columns are separated by \u001F (ASCII unit separator) > Rows are separated by \u001E (ASCII record separator)
>
> That's a nightmare to try to edit yourself in a text editor?
It is. When I needed to edit a file with field separators (a little tool I made used field separators) I found that Vim was great, because I could copy an existing character into a specific register, and then never use that register for anything else.
Excellent posts. Coincidentally couple weeks ago while evaluating a HTTP response for a web service, I noticed that for tabular data, CSV is much more optimal than JSON; yet there is lack of HTTP header support for CSV responses that could provide clients with supplementary information in order to keep parsers adaptable.
If you have a copy of the said RFC, would like to refer.
We’re talking about a tabular data file format. If you want to include arbitrary binary, use a binary data file. Or base64 encoded data. Most datasets you’d use data like this for are small enough to fit into memory, so let’s not get carried away.
(I happen to use tab delimited files to store data that can’t fit into memory, but that’s okay too)
Yes. I think we're agreeing. I was responding to this "Any binary allowed between the quotes.". Binary data can't generally be dropped directly into a text format without some kind of organized encoding.
Yeah, I think so… I thought they meant using a quote as a flag for “binary data lies ahead”, which really seemed odd to me. But — it is completely possible in a custom file type. But yes, if this case, the entire file wouldn’t be UTF8, even if all of the non-quotes data would be.
In retrospect, the idea of random binary data enclosed in quotes is what I’m mainly responding to — which I think we can all agree is a bad idea. (If you need to do that, encode it!)
That's just text. In one sense, yes, it's arbitrary bits. But it's also clearly text. It's probably not what was referred to. If you have binary data that's encoded as text, it seems obvious that that can be embedded in text. It probably wouldn't be worth mentioning that "Any binary allowed between the quotes."
Actually, I think that it is, if it supports UTF-8 without BOM and does not try to do such things like convert quotation marks into non-ASCII quotation marks automatically, etc.
However, using a proper ASCII text editor would be better, if you do not want Unicode, to avoid many of the problems with Unicode if you are loading an unknown file or copying unknown stuff using clipboard, etc, which may be homoglyphs, reverse text direction override, etc.
(I use vim with exclusively ASCII-only mode, and do not use a Unicode locale, on my computer.)
I just mean, if you _require_ plugins in order to be able to edit the content, then the content can't easily be described as text. It is fine to use a specialized application to edit a file of a non-text format, I have nothing against that, but you have then left the realm of text editor.
As an example of what I mean, if someone wrote a vim plugin that allowed a user to interact with a sqlite file and change the schema or edit raw row values from vim, it could be a really valuable and useful plugin. But the presence or absence of a plugin for some given text editor doesn't change whether a given format is generally considered a format suitable for being edited in a text editor. What it does instead is convert vim into an editor of non-text files.
I admit that the proposed file format is much closer to being editable in a text editor than a binary format such as sqlite, but the fact that the characters cannot be typed without special functionality suggests that the format is not suitable for a text editor.
The described format _is_ editable in vim without a plugin. It's just a little awkward, because everything will be on a single line, and you have to use more complicated key commands to enter the delimiters (for example `<C-v><C-6>` for \x1e).
That's a nightmare to try to edit yourself in a text editor?
I'd rather just have basically TSV, but with every value always quoted, always UTF-8. Quotes escaped with backslashes, backslashes escaped with backslashes, and that's it. Any binary allowed between the quotes.
I deal with CSVs all day every day. I'm known for these two posts
https://donatstudios.com/Falsehoods-Programmers-Believe-Abou...
https://donatstudios.com/CSV-An-Encoding-Nightmare
Some friends and I actually started an RFC about 11 years ago for a CSV enhancement with an HTTP inspired header section with metadata including encoding. UTF-8 wasn't as clear of a winner back then. Never went anywhere.