> Columns are separated by \u001F (ASCII unit separator) > Rows are separated by...

tzs · on May 4, 2022

> I'd rather just have basically TSV, but with every value always quoted, always UTF-8. Quotes escaped with backslashes, backslashes escaped with backslashes, and that's it. Any binary allowed between the quotes.

Encode rather than escape, such as encoding an arbitrary byte as %xx where xx is two hex digits. Use this encoding for any %s in the values, as well as any field separators and record separators and any bytes that have special meaning in your format.

Encoding rather than escaping means that given a record I can split it into fields using the built-in string splitting method of whatever language I'm using. Dealing with a format that can have the field separators escaped in the values will usually present less opportunity to use the language's efficient built-in string functions.

LeonB · on May 4, 2022

One major limitation with quoted values that can this contain record delimiters (as opposed to escaping the delimiters) is that it stops systems from being able to load records in parallel.

Some systems ban embedded record delimiters, for this reason.

Btw, I’ve (previously) included at least one of your essays in “awesome csv” list at GitHub. https://github.com/secretGeek/AwesomeCSV#essays

There’s a few specs mentioned there too — is one of those the spec you worked on?

wahern · on May 4, 2022

What's the difference between quoted and escaped delimiters? (Keeping in mind that escaping sequences can themselves be escaped, ad infinitum. You can't simply seek to an escape sequence and depend algorithmically on a small, fixed lookbehind.)

dataflow · on May 4, 2022

I think the parent that if newlines were encoded as "\n" (with a backslash) then you could always split on (actual) newlines and process them in parallel without having to tokenize the quote first.

LeonB · on May 4, 2022

Yep that’s exactly it.

dataflow · on May 4, 2022

I think you still can load in parallel, but it just introduces potential for backtracking/correcting the data if you speculate incorrectly.

1vuio0pswjnm7 · on May 4, 2022

"That's a nightmare to try to edit yourself in a text editor?"

I have found these "unusual" separators are useful, e.g., I use ASCII file separator (FS). I use tr and sed to add/remove/change separators. If I had to edit a table interactively I would change the separator to something visible before editing. Thats said, in nvi(1) or less(1), US is displayed as highlighted ^_ and RS as highlighted ^^. It is not difficult to work with if the data is just text, as it is for me. One could also use a hex editor like bvi.

On large tables, I prefer to make edits non-interactively. I use ed(1) scripts.

Unfortunately, the UNIX sort command -t option will only support a limited range characters as separators. US, RS and FS are not among them. If I want to use UNIX sort I have to change the separator to one that sort accepts as a separator before sorting.

The complaints about CSV I read on HN and elsewhere seem to be complaints about what people put into them, i.e., lack of enforced rules about what is acceptable, not about the format itself.

wombatpm · on May 4, 2022

In the printing world inkjet printers for industrial use use a file format that is all RS, GS, US, and FS characters. It had no line breaks instead it used a RS character at the beginning of a record. It would routinely break if people tried to open in a text editor. Nothing wants to deal a 300mb file consisting of a single line. Ended up writing my own library to manipulate the files and used a lot of tr, see, and awk in the command line. It was a pain only because modern editors have forgotten control codes

a_t48 · on May 4, 2022

At that point I’d be considering compiling my own sort. It can’t be that much code.

lelanthran · on May 4, 2022

>> Columns are separated by \u001F (ASCII unit separator) > Rows are separated by \u001E (ASCII record separator)

>

> That's a nightmare to try to edit yourself in a text editor?

It is. When I needed to edit a file with field separators (a little tool I made used field separators) I found that Vim was great, because I could copy an existing character into a specific register, and then never use that register for anything else.

omk · on May 4, 2022

Excellent posts. Coincidentally couple weeks ago while evaluating a HTTP response for a web service, I noticed that for tabular data, CSV is much more optimal than JSON; yet there is lack of HTTP header support for CSV responses that could provide clients with supplementary information in order to keep parsers adaptable.

If you have a copy of the said RFC, would like to refer.

gregw2 · on May 4, 2022

Don't know the parent poster's RFC... but there is a RFC 4810 which attempts to define CSV and when I looked was what Excel generated.

recursive · on May 4, 2022

If every value is always UTF-8, then you can't embed arbitrary binary, since arbitrary bytes aren't necessarily valid UTF-8.

mbreese · on May 4, 2022

We’re talking about a tabular data file format. If you want to include arbitrary binary, use a binary data file. Or base64 encoded data. Most datasets you’d use data like this for are small enough to fit into memory, so let’s not get carried away.

(I happen to use tab delimited files to store data that can’t fit into memory, but that’s okay too)

recursive · on May 4, 2022

Yes. I think we're agreeing. I was responding to this "Any binary allowed between the quotes.". Binary data can't generally be dropped directly into a text format without some kind of organized encoding.

mbreese · on May 4, 2022

Yeah, I think so… I thought they meant using a quote as a flag for “binary data lies ahead”, which really seemed odd to me. But — it is completely possible in a custom file type. But yes, if this case, the entire file wouldn’t be UTF8, even if all of the non-quotes data would be.

In retrospect, the idea of random binary data enclosed in quotes is what I’m mainly responding to — which I think we can all agree is a bad idea. (If you need to do that, encode it!)

tantalor · on May 4, 2022

Sure you can, here is example:

0101010010000111101010101

recursive · on May 4, 2022

That's just text. In one sense, yes, it's arbitrary bits. But it's also clearly text. It's probably not what was referred to. If you have binary data that's encoded as text, it seems obvious that that can be embedded in text. It probably wouldn't be worth mentioning that "Any binary allowed between the quotes."

thayne · on May 4, 2022

> That's a nightmare to try to edit yourself in a text editor?

You just need a text editor that can support thia format.

MobiusHorizons · on May 4, 2022

So… not a text editor then, right?

pishpash · on May 4, 2022

A Unicode text editor is not an ascii text editor either.

zzo38computer · on May 4, 2022

Actually, I think that it is, if it supports UTF-8 without BOM and does not try to do such things like convert quotation marks into non-ASCII quotation marks automatically, etc.

However, using a proper ASCII text editor would be better, if you do not want Unicode, to avoid many of the problems with Unicode if you are loading an unknown file or copying unknown stuff using clipboard, etc, which may be homoglyphs, reverse text direction override, etc.

(I use vim with exclusively ASCII-only mode, and do not use a Unicode locale, on my computer.)

thayne · on May 4, 2022

So vim with a plugin isn't a text editor?

MobiusHorizons · on May 4, 2022

I just mean, if you _require_ plugins in order to be able to edit the content, then the content can't easily be described as text. It is fine to use a specialized application to edit a file of a non-text format, I have nothing against that, but you have then left the realm of text editor.

As an example of what I mean, if someone wrote a vim plugin that allowed a user to interact with a sqlite file and change the schema or edit raw row values from vim, it could be a really valuable and useful plugin. But the presence or absence of a plugin for some given text editor doesn't change whether a given format is generally considered a format suitable for being edited in a text editor. What it does instead is convert vim into an editor of non-text files.

I admit that the proposed file format is much closer to being editable in a text editor than a binary format such as sqlite, but the fact that the characters cannot be typed without special functionality suggests that the format is not suitable for a text editor.

thayne · on May 4, 2022

The described format _is_ editable in vim without a plugin. It's just a little awkward, because everything will be on a single line, and you have to use more complicated key commands to enter the delimiters (for example `<C-v><C-6>` for \x1e).

lelanthran · on May 4, 2022

> I just mean, if you _require_ plugins in order to be able to edit the content,

Vim doesn't require plugins (as long as you are editing an existing file, and not trying to create a new one).