> with no restrictions on the text in fields or the need to try and escape chara...

missblit · 2024-11-10T16:35:21 1731256521

See this is why I once used moon-viewing-ceremony-seperated-values (MVCSV). The Moon Viewing Ceremony emoji was unlikely to show up in my dataset, and not only is the emoji visible, it's quite visually pleasing.

dazzaji · 2024-11-10T16:46:18 1731257178

I’m now free-falling down a moon viewing ceremony rabbit hole of emoji history, and enjoying the ride!

Macha · 2024-11-10T22:08:17 1731276497

Wait until you expand into the Japanese market and all your users are talking about 月見

dec0dedab0de · 2024-11-10T16:43:00 1731256980

Not if you just say those characters are invalid data. I first heard about them decades ago, but I don't think I have ever once seen them in use.

The real problem is that there is no easy universal way to type them with a keyboard. So it would require software interfaces in the application, and at that point it’s basically binary.

Ukv · 2024-11-10T16:50:56 1731257456

> Not if you just say those characters are invalid data

The author's claim is "with no restrictions on the text".

It's easy if you can forbid certain characters, but then you can't store arbitrary text (e.g: filepaths, or scraped comments).

dec0dedab0de · 2024-11-10T19:13:46 1731266026

but these specific characters are not text. they exist solely to be delimiters.

It would be like trying to escape a column in your spreadsheet.

Ukv · 2024-11-10T21:31:30 1731274290

> but these specific characters are not text. they exist solely to be delimiters.

Even if people used them only for their intended purpose, someone could use them as delimiters within the text you want to store (e.g: list of tags in filename) - unless I'm misunderstanding.

> It would be like trying to escape a column in your spreadsheet.

Other formats do allow escaping their delimiters, so that you can use that character literally or even nest a string of that format within an entry.

dec0dedab0de · 2024-11-11T17:00:15 1731344415

Right, nesting wouldn’t be possible. But there is never any reason to use these characters literally, they are just delimiters.

Ukv · 2024-11-11T18:10:47 1731348647

> But there is never any reason to use these characters literally, they are just delimiters.

I can put []* in my comment (maybe because I'm demonstrating the format, referencing the characters, or just being capricious), and now someone scraping and storing comments has a need to use those characters literally. Sometimes fine to ignore certain content or store it lossily, but often not.

*: (copy-paste between the brackets into https://bobpritchett.com/unicode-inspector)

dec0dedab0de · 2024-11-11T19:16:20 1731352580

yes, you would still need to clean your inputs before randomly adding it to your table. Your contrived example brings me back to my original assertion that as long as you’re ok with those characters not being valid data it works fine. So, sure if someone really wanted to store those two literal non-visible characters in a text file that would not work. Everyone else could just not do that.

Ukv · 2024-11-11T20:57:42 1731358662

> yes, you would still need to clean your inputs before randomly adding it to your table.

Lossy is fine in some cases, but in many cases you do actually need the specific text you're trying to store - not just something similar to it. Hence my objection to "never any reason to use these characters literally".

> Your contrived example [...] if someone really wanted to store those two literal non-visible characters in a text file

Needing to store these specific characters is rare, but needing to store arbitrary text (possibly from adversarial/mischievous parties, or just a large enough dataset that encountering all edge-cases is inevitable) is common. For instance, for security reasons a log shouldn't break or have a blindspot for folders with those characters.

> as long as you’re ok with those characters not being valid data it works fine

Which is what I'm saying in my original comment with "or alternatively, a restriction for the stored text not to have them"

dec0dedab0de · 2024-11-11T22:08:58 1731362938

if you’re storing arbitrary text from untrusted sources you will always need to clean it first. Plus in those instances you’ll probably want a db or json or whatever works well with your language anyway.

I guess I’m saying that if this had caught on early with ubiquitous support it could have saved us from the mess that is csv/tsv/etc. It wouldn’t have negated the need for more advanced storage and serialization formats.

Ukv · 2024-11-11T23:53:57 1731369237

> if you’re storing arbitrary text from untrusted sources you will always need to clean it first

Reversible escaping of characters is pretty common (though not always; length-before-text formats don't require it). But to "clean" as in deleting characters such that you can no longer get back to the original string is definitely not required for all formats, and is a fairly undesirable property.

> Plus in those instances you’ll probably want a db or json or whatever works well with your language anyway.

You'd want to use some format that doesn't have the problem this one has, yeah. IMO ASCII delimited text just isn't really anywhere on the Pareto front of formats you'd want to use - it's unpleasant to work with manually, and once you're writing the file through code or a tabular editor you may as well use a format that can handle arbitrary text.

> I guess I’m saying that if this had caught on early with ubiquitous support it could have saved us from the mess that is csv/tsv/etc

I think you could say the same of RFC 4180. In reality, I don't see why this wouldn't also spawn dialects, like people adding newlines between rows so they can open it in a text editor without it being in one huge long line, or inventing an escaping scheme so that it can handle arbitrary text.

jader201 · 2024-11-10T19:31:37 1731267097

I feel like this (and some of the replies to this) is missing the point a bit.

I don’t think the goal was to make a bullet-proof delimiter that fails at nothing.

The goal was to solve the problem of not allowing things like commas, quotes, newlines, tabs, pipes, etc. in text files.

I feel like using the proposed ASCII characters would eliminate these limitations, while also allowing machine creatable and readable format (emphasis on machine as opposed to human).

Yes, it would still be tough for a human to type or read these delimiters, so in that case, go with traditional CSV or TSV (or MVCSV!).

But if you only need to use a machine to create/read the text, this sounds like a great solution, allowing all of the normal characters you might see in text.

Ukv · 2024-11-10T20:23:00 1731270180

If you need a machine-readable format, why not go with escaping like most other formats, or length-before-text, to include all characters - instead of a format that fails on some (albeit rare) characters?

jader201 · 2024-11-10T20:51:20 1731271880

Both of those are fine, but they add additional complexities (even if small) where there is very little, if any, complexity added with using these two characters as delimiters.

Ukv · 2024-11-10T21:25:53 1731273953

For machine-readable formats, I'd argue length-before-text is simpler to parse than splitting by separators - even before having to add extra application logic to handle this method being fallible (e.g: if you want to use it to store filenames, you now need a check and pop-up about some OS-valid filenames not being supported).

chrishill89 · 2024-11-11T16:34:46 1731342886

> For machine-readable formats, I'd argue length-before-text is simpler to parse than splitting by separators

Yes.

> even before having to add extra application logic to handle this method being fallible (e.g: if you want to use it to store filenames, you now need a check and pop-up about some OS-valid filenames not being supported).

I don't see why you'd need that. CSV does not have anything like that.

That's a higher-level concern. This ascii-delimited format (like CSV) is supposed to be a stupid row/column format. And also simpler to implement than CSV.

Ukv · 2024-11-11T18:26:39 1731349599

> CSV does not have anything like that.

If you're using a fallible CSV dialect, your application does need to handle that case in some way (or in some cases it may be fine just to let it crash). Something like length-before-text is convenient because you don't have to worry about that case.

chubot · 2024-11-10T16:35:41 1731256541

Yup exactly, it just pushes the problem around, without solving it.

The delimiters can occur in binary data, or when there's nesting -- trying to store TSV in TSV, or JSON in JSON, etc. The latter definitely happens a lot, for better or worse

smarx007 · 2024-11-10T17:31:44 1731259904

The title says ASCII Delimited Text not ASCII Delimited Binary Data.

For the purposes of CSV, I consider text to be anything that satisfies the regex ^\P{Cc}+$ (https://www.compart.com/en/unicode/category/Cc) and I normally strip chars in that category before saving some text (for single-line text). ^[\p{Cc}&&[^\n]]+$ is a regex that can be used to strip all control chars except for the newline.

chrishill89 · 2024-11-11T16:15:32 1731341732

Thanks for that. Handy. I think I’ll have use for that myself.

chrishill89 · 2024-11-10T17:03:53 1731258233

You can disallow those metacharacters in the data proper. Then you have a format that can store any utf8 or whatever except the non-whitespace control codes without any escaping. That solves a problem in an opinionated way. Just like how json is opinionated (utf8 only).

You can convert to another format if you need something crazier than rows and columns consisting of normal text.

chubot · 2024-11-10T17:30:34 1731259834

That already exists -- TSV. You disallow the tab metacharacter.

chrishill89 · 2024-11-11T16:19:06 1731341946

Then I don't understand. Like the sibling comment said the really "problematic" character in TSV is the line feed. But a tab can occur as well.

The format that I described does not already exist in the form of TSV. And further based on your original comment I would have thought that both TSV and this format would be discarded as not-useful.

thayne · 2024-11-11T04:49:32 1731300572

And the newline character. Both of which commonly occur in normal text.

huem0n · 2024-11-10T16:35:29 1731256529

Thank you for writing that complaint out so I don't have to. It solves nothing.