I’d argue that any string comparison which does not take into account collation is inherently broken. Even in the pure ASCII English-language case, a naïve comparison on values won’t give desirable results since abc123 will come before abc99 even though a reader would expect otherwise. Just because we’ve tolerated crappy string sorting for sixty years doesn’t mean we should continue to do so.
For some applications (eg sticking stuff in an ordered data structure), you just need any consistent ordering, but don't care too much about exactly which one.
I've been developing my own version string parser for a couple weeks now, in golang.
It's ridiculous to what lengths you have to go to understand which part of a string comes earlier or later.
Simple example: semantic versioning allows "1.2.3alpha" and also 1.2.3-beta", but which one comes first now...
- Is 1.2.3 > 1.2.3omega?
- Is 1.2.3 > 1.2.3beta?
- Is 1.2.3gamma > 1.2.3?
In the Linux world it gets even funnier cause they invented SONAME fields that reflect breaking API changes instead of forcing packages to comply with semantic versioning syntax. Oftentimes there is a package version of e.g. 0.4.7 that has an SONAME of 12.7 on the filesystem.
Add to that the ~prerelease suffix syntax in Debian based distros which are maintained downstream, and all the +buildid or .commithash or -revision123 suffixes and you've landed in string comparison hell.
When I started I would have never guessed that this is such a complex problem to solve in golang.
Perhaps I missed it, but I thought [Semantic Versioning](https://semver.org/) required a “-“ between the patch number and a prerelease identifier since at least version 1.0.0 (with 1.0.0-beta allowing a “.” instead of a “-“), no?
Yes, version string comparison is hard because people have all sorts of unstandardized ideas about version strings. Not sure why you seem to believe there’s golang-specific difficulty here.
> semantic versioning allows "1.2.3alpha" and also 1.2.3-beta"
No, according to the spec, the hyphen is mandatory: "A pre-release version MAY be denoted by appending a hyphen and a series of dot separated identifiers immediately following the patch version."
The wording is ambiguous, but the BNF later in the spec [1] agrees with your interpretation. Valid version numbers are three numbers separated by dots, followed by either a minus and dot-separated pre-release versions; or a plus and dot-separated build identifiers.
No, that is a misreading. The "MAY" indicates that the prerelease identifier itself is optional. However, if you do append one, it must include a leading hyphen.
1.2.3gamma is not a pre-release version, it is a malformed version string (assuming SemVer). A proper SemVer is something like [0-9]+[.][0-9]+[.][0-9]+(-[0-9a-zA-Z]+)?([+][0-9a-zA-Z]+)?
> 2. A normal version number MUST take the form X.Y.Z where X, Y, and Z are non-negative integers, and MUST NOT contain leading zeroes. X is the major version, Y is the minor version, and Z is the patch version.
I'd use capture regex to get the first three numerals and capture the remaining string. If the remaining string exists, you can easily ignore the expected leading dash and your malformed semver suffixes will work and those can be trivially compared/sorted.
What about Go makes this different? That's how I'd solve this is any language
Agreed - especially when it is potentially unknown what might follow the first three numerals. Any performance hit would be mitigated by the corresponding reduction of the downstream logic.
Because with (Unicode) strings, "\u006e\u0303" is defined to be equal to "\u00f1", for example. If you'd do bytewise comparison, as the above comment suggested, you may not reach the same result ¯\_(ツ)_/¯
Whether those two strings are or are not equivalent depends on the context. If we're assuming (as the GP did) a very generic context where we simply want to store arbitrary strings in a sorted data structure, then there is no reason to assume they are supposed to be interpreted as Unicode.
For a simple example, perhaps this is a list of strings that require Unicode normalization to be properly interpreted as human text that you are storing into a TreeMap for efficient retrieval. When you are adding "\u00f1" to the list, you wouldn't want the collection to say that it's already there because it already had "\u006e\u0303".