Hacker News new | past | comments | ask | show | jobs | submit login

Quick link to what I think is the most interesting class in the CLR:

https://github.com/dotnet/coreclr/blob/master/src/mscorlib/s...




Since I wrote a string hash code function recently[1], I was interested to see what they use. Like many, they're using[2] the venerable djb hash. That sent me down a rabbit hole where I discovered that Bernstein was 19 when he wrote about it. Wow.

[1]: https://github.com/munificent/wren/blob/master/src/wren_valu...

[2]: https://github.com/dotnet/coreclr/blob/master/src/mscorlib/s...


From their GetHashCode():

// We want to ensure we can change our hash function daily.

// This is perfectly fine as long as you don't persist the

// value from GetHashCode to disk or count on String A

// hashing before string B. Those are bugs in your code.

hash1 ^= ThisAssembly.DailyBuildNumber;

I'd love to hear the story behind this one :D


I don't know the story, but the logic behind it is simple: If you want to guarantee no one depends on GetHashCode staying static between runs of an application, change it all the time.


But now I can circumvent this hack by xoring with ThisAssembly.DailyBuildNumber again. :)


I think the story is even simpler than that as the code in question is prefaced with: #if DEBUG

The shipped product doesn't include this "randomness".


Also is it not for hash collision protection, which can end up hurting the runtime of many algorithms, causing a DoS?


No, it has no effect on hash collisions. Two hashes that are equal before being xored with the daily build number will still be equal afterwards.


It won't help much there as the number is static for a particular build.


The hash randomization for security is above this part of the code.


IIRC, a few yeas ago appeared a denial of service attack, probably originally for Phyton, but it was ported son to other languages.

The idea is that the hash is good enough for normal list, but it's not a cryptographic hash and it's easy to find collisions. Then you can make a lot of requests with strings that has the same hash value. Now the hash operations are O(N) instead of O(~1) and everything is slower.

Using an unpredictable hash calculation makes this attack more difficult.


Your typo can be easily fixed by a Python one-liner:

    >>> (lambda w:w[2:]+w[:2])(''.join(sorted("Phyton",key=lambda c:math.sin(ord(c)^50))))
    'Python'


It's in a #if DEBUG statement so it would not change. Historically, even if they shipped the debug symbols, the assembly would have been built in release. Now, I suppose you could build it in debug.


The Reference Source site is much easier to navigate. Method names are hyperlinks:

http://referencesource.microsoft.com/#mscorlib/system/string...


Admittedly, I've never been big in C#, but I'm AMAZED by how much repetition string comparison there is.

IE: Environment.GetResourceString("ArgumentOutOfRange_Index")

The string there is in multiple areas of that class, and the same behavior is displayed for all of them. Wouldn't logic suggest everything such as above would be moved in to a constant repository for clarity and also less potential human error for future additions?


It's pretty common in C# to use Constants to represent strings. Cuts down on the repetition as you said and you get the Intellisense too. I'd be curios to know the reasoning for using the literals as well.


And for comparison:

http://grepcode.com/file/repository.grepcode.com/java/root/j...

It's sort of surprising to me how much huger the .NET version is, in terms of code. Virtually all the lines in the Java version are API docs. The .NET version doesn't seem to have them (they must be elsewhere?) but it does have a lot more code and that code is much lower level.

Not sure what that means, if anything, but it's interesting.


That does answer an unanswered question I had on SO about string hashing. If strings are immutable, why isn't the hash code memoized? Seems like it would make HashSet/Dictionary lookups using string keys much faster.


Presumably because it's not worth the memory hit to store the hash.


That was my assumption, too. They do memoize the length, but I'm sure those bytes add up, having run into OutOfMemoryExceptions building huge amounts of strings before.


Strings know their length in the CLR because they are represented as BSTRs

http://blogs.msdn.com/b/ericlippert/archive/2011/07/19/strin...

This lets them interoperate with OLE Automation.


You have to store the length, C# strings can contain '\0' (although the hashing code doesn't take this into account!)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: