Do I understand correctly that Arc strings are sequences of octets? If so: I rea...

olavk · on Feb 7, 2008

Arc snarfs the string implementation from MzScheme which support Unicode in The Right Way, as code points rather than octets.

prescod · on Feb 7, 2008

So should I infer that the only reason UTF-8 is mentioned is that the reader APIs do not let you select the codec? Or is even that provided in which case it is accurate to say that Arc supports Unicode-in-general?

olavk · on Feb 7, 2008

Arc uses MzSchemes reader (it modifies the readtable slightly to support []-syntax). AFAIK you cannot access the reader API from inside Arc. The reason Utf-8 is mentioned is that it is the default encoding when MzScheme reads or writes files or streams.

I don't think anyone at this point would claim that Arc supports unicode-in-general.

dzorz · on Feb 7, 2008

Could you offer a better solution? What would your solution offer that octets do not? Random character access? No, because not a single unicode encoding offers easy random character access (because they are made of possibly several codepoints, which, in some encodings, are made of more than one basic "chars"). Gylph, word and sentence segmentation? I guess not.

olavk · on Feb 7, 2008

Abstraction. Having Unicode strings (i.e. strings that are a sequence of Unicode code points rather than octets) allows you to work with strings without worrying about encoding (except when doing IO, which is where encoding matters).

If you treat strings as octets OTOH even simple operations like concatenating two strings might lead to headache if the strings are in two different encodings. And how do you keep track of the encodings of individual strings? Madness lies down that road.