> with no restrictions on the text in fields or the need to try and escape characters.
Maybe I'm missing something, but wouldn't it still need escaping for those ASCII separator characters (or alternatively, a restriction for the stored text not to have them)?
It's true that having to deal with escaping much less often (since the ASCII separator characters are rarer than commas/quotes) would be convenient for manual reading/writing, but I feel that's canceled out by the characters being hard to type/see (likely the reason why they're rare) - and it wouldn't necessarily save on writer/parser code complexity.
See this is why I once used moon-viewing-ceremony-seperated-values (MVCSV). The Moon Viewing Ceremony emoji was unlikely to show up in my dataset, and not only is the emoji visible, it's quite visually pleasing.
Not if you just say those characters are invalid data. I first heard about them decades ago, but I don't think I have ever once seen them in use.
The real problem is that there is no easy universal way to type them with a keyboard. So it would require software interfaces in the application, and at that point it’s basically binary.
> but these specific characters are not text. they exist solely to be delimiters.
Even if people used them only for their intended purpose, someone could use them as delimiters within the text you want to store (e.g: list of tags in filename) - unless I'm misunderstanding.
> It would be like trying to escape a column in your spreadsheet.
Other formats do allow escaping their delimiters, so that you can use that character literally or even nest a string of that format within an entry.
> But there is never any reason to use these characters literally, they are just delimiters.
I can put []* in my comment (maybe because I'm demonstrating the format, referencing the characters, or just being capricious), and now someone scraping and storing comments has a need to use those characters literally. Sometimes fine to ignore certain content or store it lossily, but often not.
yes, you would still need to clean your inputs before randomly adding it to your table. Your contrived example brings me back to my original assertion that as long as you’re ok with those characters not being valid data it works fine. So, sure if someone really wanted to store those two literal non-visible characters in a text file that would not work. Everyone else could just not do that.
> yes, you would still need to clean your inputs before randomly adding it to your table.
Lossy is fine in some cases, but in many cases you do actually need the specific text you're trying to store - not just something similar to it. Hence my objection to "never any reason to use these characters literally".
> Your contrived example [...] if someone really wanted to store those two literal non-visible characters in a text file
Needing to store these specific characters is rare, but needing to store arbitrary text (possibly from adversarial/mischievous parties, or just a large enough dataset that encountering all edge-cases is inevitable) is common. For instance, for security reasons a log shouldn't break or have a blindspot for folders with those characters.
> as long as you’re ok with those characters not being valid data it works fine
Which is what I'm saying in my original comment with "or alternatively, a restriction for the stored text not to have them"
if you’re storing arbitrary text from untrusted sources you will always need to clean it first. Plus in those instances you’ll probably want a db or json or whatever works well with your language anyway.
I guess I’m saying that if this had caught on early with ubiquitous support it could have saved us from the mess that is csv/tsv/etc. It wouldn’t have negated the need for more advanced storage and serialization formats.
> if you’re storing arbitrary text from untrusted sources you will always need to clean it first
Reversible escaping of characters is pretty common (though not always; length-before-text formats don't require it). But to "clean" as in deleting characters such that you can no longer get back to the original string is definitely not required for all formats, and is a fairly undesirable property.
> Plus in those instances you’ll probably want a db or json or whatever works well with your language anyway.
You'd want to use some format that doesn't have the problem this one has, yeah. IMO ASCII delimited text just isn't really anywhere on the Pareto front of formats you'd want to use - it's unpleasant to work with manually, and once you're writing the file through code or a tabular editor you may as well use a format that can handle arbitrary text.
> I guess I’m saying that if this had caught on early with ubiquitous support it could have saved us from the mess that is csv/tsv/etc
I think you could say the same of RFC 4180. In reality, I don't see why this wouldn't also spawn dialects, like people adding newlines between rows so they can open it in a text editor without it being in one huge long line, or inventing an escaping scheme so that it can handle arbitrary text.
I feel like this (and some of the replies to this) is missing the point a bit.
I don’t think the goal was to make a bullet-proof delimiter that fails at nothing.
The goal was to solve the problem of not allowing things like commas, quotes, newlines, tabs, pipes, etc. in text files.
I feel like using the proposed ASCII characters would eliminate these limitations, while also allowing machine creatable and readable format (emphasis on machine as opposed to human).
Yes, it would still be tough for a human to type or read these delimiters, so in that case, go with traditional CSV or TSV (or MVCSV!).
But if you only need to use a machine to create/read the text, this sounds like a great solution, allowing all of the normal characters you might see in text.
If you need a machine-readable format, why not go with escaping like most other formats, or length-before-text, to include all characters - instead of a format that fails on some (albeit rare) characters?
Both of those are fine, but they add additional complexities (even if small) where there is very little, if any, complexity added with using these two characters as delimiters.
For machine-readable formats, I'd argue length-before-text is simpler to parse than splitting by separators - even before having to add extra application logic to handle this method being fallible (e.g: if you want to use it to store filenames, you now need a check and pop-up about some OS-valid filenames not being supported).
> For machine-readable formats, I'd argue length-before-text is simpler to parse than splitting by separators
Yes.
> even before having to add extra application logic to handle this method being fallible (e.g: if you want to use it to store filenames, you now need a check and pop-up about some OS-valid filenames not being supported).
I don't see why you'd need that. CSV does not have anything like that.
That's a higher-level concern. This ascii-delimited format (like CSV) is supposed to be a stupid row/column format. And also simpler to implement than CSV.
If you're using a fallible CSV dialect, your application does need to handle that case in some way (or in some cases it may be fine just to let it crash). Something like length-before-text is convenient because you don't have to worry about that case.
Yup exactly, it just pushes the problem around, without solving it.
The delimiters can occur in binary data, or when there's nesting -- trying to store TSV in TSV, or JSON in JSON, etc. The latter definitely happens a lot, for better or worse
The title says ASCII Delimited Text not ASCII Delimited Binary Data.
For the purposes of CSV, I consider text to be anything that satisfies the regex ^\P{Cc}+$ (https://www.compart.com/en/unicode/category/Cc) and I normally strip chars in that category before saving some text (for single-line text). ^[\p{Cc}&&[^\n]]+$ is a regex that can be used to strip all control chars except for the newline.
You can disallow those metacharacters in the data proper. Then you have a format that can store any utf8 or whatever except the non-whitespace control codes without any escaping. That solves a problem in an opinionated way. Just like how json is opinionated (utf8 only).
You can convert to another format if you need something crazier than rows and columns consisting of normal text.
Then I don't understand. Like the sibling comment said the really "problematic" character in TSV is the line feed. But a tab can occur as well.
The format that I described does not already exist in the form of TSV. And further based on your original comment I would have thought that both TSV and this format would be discarded as not-useful.
Maybe I'm missing something, but wouldn't it still need escaping for those ASCII separator characters (or alternatively, a restriction for the stored text not to have them)?
It's true that having to deal with escaping much less often (since the ASCII separator characters are rarer than commas/quotes) would be convenient for manual reading/writing, but I feel that's canceled out by the characters being hard to type/see (likely the reason why they're rare) - and it wouldn't necessarily save on writer/parser code complexity.