Do I understand correctly that Arc strings are sequences of octets?
If so: I really don't want to be a negativity guy but it seems like every language that has made an 8-bit string the default string type has regretted it later because it is so painful to change it without breaking code. Okay, Paul says that he won't mind breaking code. Maybe he means it, but it doesn't make any sense to me to knowingly and consciously repeat a design mistake that dozens of other people have made and regretted.
It really just takes one day to get this right. You need to distinguish between the raw bytes read from a device and the true string type (which needs to be 21 bit or greater). You need a trivial converter from one to the other (which you can presumably steal from MZScheme) and back.
That's it. You get this right at the beginning and you never have to backtrack or break code.
My apologies in advance if this post is based on incorrect premises. I'm trying to help.
So should I infer that the only reason UTF-8 is mentioned is that the reader APIs do not let you select the codec? Or is even that provided in which case it is accurate to say that Arc supports Unicode-in-general?
Arc uses MzSchemes reader (it modifies the readtable slightly to support []-syntax). AFAIK you cannot access the reader API from inside Arc. The reason Utf-8 is mentioned is that it is the default encoding when MzScheme reads or writes files or streams.
I don't think anyone at this point would claim that Arc supports unicode-in-general.
Could you offer a better solution? What would your solution offer that octets do not? Random character access? No, because not a single unicode encoding offers easy random character access (because they are made of possibly several codepoints, which, in some encodings, are made of more than one basic "chars"). Gylph, word and sentence segmentation? I guess not.
Abstraction. Having Unicode strings (i.e. strings that are a sequence of Unicode code points rather than octets) allows you to work with strings without worrying about encoding (except when doing IO, which is where encoding matters).
If you treat strings as octets OTOH even simple operations like concatenating two strings might lead to headache if the strings are in two different encodings. And how do you keep track of the encodings of individual strings? Madness lies down that road.
If so: I really don't want to be a negativity guy but it seems like every language that has made an 8-bit string the default string type has regretted it later because it is so painful to change it without breaking code. Okay, Paul says that he won't mind breaking code. Maybe he means it, but it doesn't make any sense to me to knowingly and consciously repeat a design mistake that dozens of other people have made and regretted.
It really just takes one day to get this right. You need to distinguish between the raw bytes read from a device and the true string type (which needs to be 21 bit or greater). You need a trivial converter from one to the other (which you can presumably steal from MZScheme) and back.
That's it. You get this right at the beginning and you never have to backtrack or break code.
My apologies in advance if this post is based on incorrect premises. I'm trying to help.