More

rmunn · 2025-09-19T04:52:24 1758257544

Bingo. All my experience is on Linux, and I've never written anything for Windows. So recently when I needed to port a small C program to Windows, I told ChatGPT "Hey, port this to Windows for me". I wouldn't trust the result, I'd rewrite it myself, but it let me find out which Win32 API functions I'd be calling, and why I'd be calling them, faster than searching MSDN would have done.

rmunn · 2025-09-19T04:49:50 1758257390

The only AI-assisted software work I've seen actually have a benefit is the way my coworker use Supermaven, where it's basically Intellisense but suggesting filling in the function parameters for you as well. He'll type `MergeEx` and it will not just suggest `MergeExample(` as Intellisense would have done, but also suggest `MergeExample(oldExample, newExample, mergeOptions)` based on the variable names in scope at the moment and which ones line up with the types. Then he presses Tab and moves on, saving 10-15 seconds of typing. Repeat that multiple times through the day and it might be a 10% improvement, with no time lost on fiddling with prompts to get the AI to correct its mistakes. (Here, if the suggestion is wrong, he just ignores it and keeps typing, and the second he types a character that wasn't the next one in the suggestion it goes away and a new suggestion might be calculated, but the cognitive load in ignoring the incorrect suggestion is minimal).

rmunn · 2025-09-18T01:47:48 1758160068

Rephrase it as "Youtube doesn't care about you, just about putting ads in front of your face" and it's not a contradiction. As long as you don't get irritated enough to go away and stop using Youtube entirely, they don't care about improving your viewing experience.

Another way to phrase it is the classic line "If you're not paying for it, you aren't the customer, you're the product."

rmunn · 2025-09-16T04:05:24 1757995524

Upvoted despite your final sentence being incorrect. :-) You're absolutely right that React is miles better than Angular, but Svelte and Vue (which feel very similar to each other, I just switched from one project written in Svelte to a different project written in Vue and a lot of my knowledge is carrying over) are quite a lot easier than React. When I write in React I have to think about the hooks and when I'm initializing them; when I write in Svelte or Vue the $state/ref() systems just work and I don't have to think about them. I can even initialize a $state inside an if block if I need to; I admit I'm no React expert, so I should ask you. If you needed to create a piece of state inside an if block in React, how would you do it? Is the only answer "Move the hook call outside the if block, and just don't use it if the if block doesn't run"?

johnfn · 2025-09-20T00:53:03 1758329583

We'd probably have to talk about it IRL; the question you're asking kind of implies you're coming at the whole thing from the wrong direction, and so any answer I give would be unsatisfying unless I could somehow transfer the entire philosophy in my head to you.

The short answer is that what you're asking doesn't really make sense to me. I think of the lifetime of state as analogous to member variables of a class. In the same way that member variables in a class are always visible to the entire class, a state variable in React is always visible to the entire component. You wouldn't want it to be scoped to an if block any more than you'd want a member variable to be scoped to an if block.

Maybe now you're thinking "well, that's limited in an annoying way, and Svelte is better". Maybe. But I suspect for any problem that exists that you think needs conditionally-scoped state, there's a nice clean solution in React. That's what I suspect, anyways. But I very rarely run into problems in React that I can't express in nice, clean code.

machiaweliczny · 2025-09-16T17:08:07 1758042487

That someone can initialize a state in if block is not something good. React won with Angular 1.0 because noobs abused two-way binding making fking mess everywhere. Now in react they abuse useEffect but it’s a bit easier to control. I work currently in Svelte and never use 2-way binding and are careful to package state mgmt well but I like it. It’s similar to react with mobx but more performant although has no good component libraries. SvelteKit is also generally fine

rmunn · 2025-09-13T16:33:44 1757781224

> ... Batchelder's piece on the "Unicode Sandwich," ...

Is this the piece you mean? https://nedbatchelder.com/text/unipain.html

mixmastamyk · 2025-09-13T18:00:05 1757786405

I think so, looks like it was from a presentation.

rmunn · 2025-09-13T16:22:16 1757780536

Which is why I've seen lots of people recommend testing your software with emojis, particularly recently-added emojis (many of the earlier emojis were in the basic multilingual plane, but a lot of newer emojis are outside the BMP, i.e. the "astral" planes). It's particularly fun to use the (U+1F4A9) emoji for such testing, because of what it implies about the libraries that can't handle it correctly.

EDIT: Heh. The U+1F4A9 emoji that I included in my comment was stripped out. For those who don't recognize that codepoint by hand (can't "see" the Matrix just from its code yet?), that emoji's official name is U+1F4A9 PILE OF POO.

GoblinSlayer · 2025-09-13T16:51:13 1757782273

For more fun you can use flag characters.

rmunn · 2025-09-13T09:01:12 1757754072

Probably a good idea, but when UTF-8 was designed the Unicode committee had not yet made the mistake of limiting the character range to 21 bits. (Going into why it's a mistake would make this comment longer than it's worth, so I'll only expound on it if anyone asks me to). And at this point it would be a bad idea to switch away from the format that is now, finally, used in over 99% of all documents online. The gain would be small (not zero, but small) and the cost would be immense.

int_19h · 2025-09-13T21:35:29 1757799329

Didn't they limit the range to 21 bits because UTF-16 has that limitation?

rmunn · 2025-09-15T01:54:26 1757901266

That is indeed why they limited it, but that was a mistake. I want to call UTF-16 a mistake all on its own, but since it predated UTF-8, I can't entirely do so. But limiting the Unicode range to only what's allowed in UTF-16 was shortsighted. They should, instead, have allowed UTF-8 to continue to address 31 bits, and if the standard grew past 21 bits, then UTF-16 would be deprecated. (Going into depth would take an essay, and at this point nobody cares about hearing it, so I'll refrain).

gpvos · 2025-09-15T11:39:49 1757936389

I suppose it's still possible to extend to 31 bits in the future, once UTF-16 has become obsolete enough. How big is the need for it right now?

rmunn · 2025-09-16T04:18:35 1757996315

Interestingly, in theory UTF-8 could be extended to 36 bits: the FLAC format uses an encoding similar to UTF-8 but extended to allow up to 36 bits (which takes seven bytes) to encode frame numbers: https://www.ietf.org/rfc/rfc9639.html#section-9.1.5

This means that frame numbers in a FLAC file can go up to 2^36-1, so a FLAC file can have up to 68,719,476,735 frames. If it was recorded at a 48kHz sample rate, there will be 48,000 frames per second, meaning a FLAC file at 48kHz sample rate can (in theory) be 14.3 million seconds long, or 165.7 days long.

So if Unicode ever needs to encode 68.7 billion characters, well, extended seven-byte UTF-8 will be ready and waiting. :-D

gpvos · 2025-09-16T12:06:53 1758024413

See my comment on how Perl stores up to 2^63-1 in a UTF-8-like format: https://news.ycombinator.com/item?id=45227396 .

account42 · 2025-09-15T15:49:51 1757951391

The problem is that now there are a bunch of UTF-8 tools that won't handle code points beyond 21 bits.

gpvos · 2025-09-15T19:22:01 1757964121

Fair enough, it will take some time to weed those out.

rmunn · 2025-09-13T08:57:17 1757753837

Sun Tzu would like a word or two with you.

rmunn · 2025-09-13T08:50:20 1757753420

The fact that you advocate using a BOM with UTF-8 tells me that you run Windows. Any long-term Unix user has probably seen this error message before (copy and pasted from an issue report I filed just 3 days ago):

    bash: line 1:  #!/bin/bash: No such file or directory

If you've got any experience with Linux, you probably suspect the problem already. If your only experience is with Windows, you might not realize the issue. There's an invisible U+FEFF lurking before the `#!`. So instead of that shell script starting with the `#!` character pair that tells the Linux kernel "The application after the `#!` is the application that should parse and run this file", it actually starts with `<FEFF>#!`, which has no meaning to the kernel. The way this script was invoked meant that Bash did end up running the script, with only one error message (because the line did not start with `#` and therefore it was not interpreted as a Bash comment) that didn't matter to the actual script logic.

This is one of the more common problems caused by putting a BOM in UTF-8 files, but there are others. The issue is that adding a BOM, as can be seen here, *breaks the promise of UTF-8*: that a UTF-8 file that contains only codepoints below U+007F can be processed as-is, and legacy logic that assumes ASCII will parse it correctly. The Linux kernel is perfectly aware of UTF-8, of course, as is Bash. But the kernel logic that looks for `#!`, and the Bash logic that look for a leading `#` as a comment indicator to ignore the line, do *not* assume a leading U+FEFF can be ignored, nor should they (for many reasons).

What should happen is that these days, every application should assume UTF-8 if it isn't informed of the format of the file, unless and until something happens to make it believe it's a different format (such as reading a UTF-16 BOM in the first two bytes of the file). If a file fails to parse as UTF-8 but there are clues that make another encoding sensible, reparsing it as something else (like Windows-1252) might be sensible.

But putting a BOM in UTF-8 causes more problems than it solves, because it *breaks* the fundamental promise of UTF-8: ASCII compatibility with Unicode-unaware logic.

mikelabatt · 2025-09-13T14:32:29 1757773949

I like your answer, and the others too, but I suspect I have an even worse problem than running Windows: I am an Amiga user :D

The Amiga always used all 8 bits (ISO-8859-1 by default), so detecting UTF-8 without a BOM is not so easy, especially when you start with an empty file, or in some scenario like the other one I mentioned.

And it's not that Macs and PCs don't have 8-bit legacy or coexistence needs. What you seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important.

Since we now have UTF-8 files with BOMs that need to be handled anyway, would it not be better if all the "Unicode-unaware" apps at least supported the BOM (stripping it, in the simplest case)?

rmunn · 2025-09-13T16:14:03 1757780043

"... would it not be better if all the "Unicode-unaware" apps at least supported the BOM (stripping it, in the simplest case)?"

What that question means is that the Unicode-unaware apps would have to become Unicode-aware, i.e. be rewritten. And that would entirely defeat the purpose of backwards-compatibility with ASCII, which is the fact that you don't have to rewrite 30-year-old apps.

With UTF-16, the byte-order mark is necessary so that you can tell whether uppercase A will be encoded 00 41 or 41 00. With UTF-8, uppercase A will always be encoded 41 (hex, or 65 decimal) so the byte-order mark serves no purpose except to signal "This is a UTF-8 file". In an environment where ISO-8859-1 is ubiquitous, such as the Web fifteen years ago, the signal "Hey, this is a UTF-8 file, not ISO-8859-1" was useful, and its drawbacks (BOM messing up certain ASCII-era software which read it as a real character, or three characters, and gave a syntax error) cost less then the benefits. But now that more than 99% of files you'll encounter on the Web are UTF-8, that signal is useful less than 1% of the time, and so the costs of the BOM are now more expensive than the benefits (in fact, by now they are a lot more expensive than the benefits).

As you can see from the paragraph above, you're not reading me quite right when you say that I "seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important". Compatibility with 8-bit text encodings WAS important, precisely because they were ubiquitous. It IS no longer important in a Web context, for two reasons. First, because they are less than 1% of documents and in the contexts where they do appear, there are ways (like HTTP Content-Encoding headers or HTML charset meta tags) to inform parsers of what the encoding is. And second, because UTF-8 is stricter than those other character sets and thus should be parsed first.

Let me explain that last point, because it's important in a context like Amiga, where (as I understand you to be saying) ISO-8859-1 documents are still prevalent. If you have a document that is actually UTF-8, but you read it as ISO-8859-1, it is 100% guaranteed to parse without the parser throwing any "this encoding is not valid" errors, BUT there will be mistakes. For example, å will show up as Ã¥ instead of the å it should have been, because å (U+00E5) encodes in UTF-8 as 0xC3 0xA5. In ISO-8859-1, 0xC3 is Ã and 0xA5 is ¥. Or ç (U+00E7), which encodes in UTF-8 as 0xC3 0xA7, will show up in ISO-8859-1 as Ã§ because 0xA7 is §.

(As an aside, I've seen a lot of UTF-8 files incorrectly parsed as Latin-1 / ISO-8859-1 in my career. By now, if I see Ã followed by at least one other accented Latin letter, I immediately reach for my "decode this as Latin-1 and re-encode it as UTF-8" Python script without any further investigation of the file, because that Ã, 0xC3, is such a huge clue. It's already rare in European languages, and the chances of it being followed by ¥ or § or indeed any other accented character in any real legacy document are so vanishingly small as to be nearly non-existent. This comment, where I'm explicitly citing it as an example of misparsing, is actually the only kind of document where I would ever expect to see the sequence Ã§ as being what the author actually intended to write).

Okay, so we've established that a file that is really UTF-8, but gets incorrectly parsed as ISO-8859-1, will NOT cause the parser to throw out any error messages, but WILL produce incorrect results. But what about the other way around? What about a file that's really ISO-8859-1, but that you incorrectly try to parse as UTF-8? Well, NEARLY all of the time, the ISO-8859-1 accented characters found in that file will NOT form a correct UTF-8 sequence. In 99.99% (and I'm guessing you could end up with two or three more nines in there) of actual ISO-8859-1 files designed for human communication (as opposed to files deliberately designed to be misparsed), you won't end up with a combination of accented Latin characters that just happen to match a valid UTF-8 sequence, and it's basically impossible for ALL the accents in an ISO-8859-1 document to just so happen to be valid UTF-8 sequences. In theory it could happen, but your chances of being struck by a 10-kg meteorite while sitting at your computer are better than of that happening by chance. (Again, I'm excluding documents deliberately designed with malice aforethought, because that's not the main scenario here). Which means that if you parse that unknown file as UTF-8 and it wasn't UTF-8, your parser will throw out an error message.

So when you encounter an unknown file, that has a 90% chance of being ISO-8859-1 and a 10% chance of being UTF-8, you might think "Then I should try parsing it in ISO-8859-1 first, since that has a 90% chance of being right, and if it looks garbled then I'll reparse it". But "if it looks garbled" needs human judgment. There's a better way. Parse it in UTF-8 first, in strict mode where ANY encoding error makes the entire parse be rejected. Then if the parse is rejected, re-parse it in ISO-8859-1. If the UTF-8 parser parses it without error, then either it was an ISO-8859-1 file with no accents at all (all characters 0x7F or below, so that the UTF-8 encoding and the ISO-8859-1 encoding are identical and therefore the file was correctly parsed), or else it was actually a UTF-8 file and it was correctly parsed. If the UTF-8 parser rejects the file as having invalid byte sequences, then parse it as the 8-bit encoding that is most likely in your context (for you that would be ISO-8859-1, for the guy in Japan who commented it would likely be Shift-JIS that he should try next, and so on).

That logic is going to work nearly 100% of the time, so close to 100% that if you find a file it fails on, you had better odds of winning the lottery. And that logic does not require a byte-order mark; it just requires realizing that UTF-8 is a rather strict encoding with a high chance of failing if it's asked to parse files that are actually from a different legacy 8-bit encoding. And that is, in fact, one of UTF-8's strengths (one guy elsewhere in this discussion thought that was a weakness of UTF-8) precisely because it means it's safe to try UTF-8 decoding first if you have an unknown file where nobody has told you the encoding. (E.g., you don't have any HTTP headers, HTML meta tags, or XML preambles to help you).

NOW. Having said ALL that, if you are dealing with legacy software that you can't change which is expecting to default to ISO-8859-1 encoding in the absence of anything else, then the UTF-8 BOM is still useful in that specific context. And you, in particular, sound like that's the case for you. So go ahead and use a UTF-8 BOM; it won't hurt in most cases, and it will actually help you. But MOST of the world is not in your situation; for MOST of the world, the UTF-8 BOM causes more problems than it solves. Which is why the default for ALL new software should be to try parsing UTF-8 first if you don't know what the encoding is, and try other encodings only if the UTF-8 parse fails. And when writing a file, it should always be UTF-8 without BOM unless the user explicitly requests something else.

mikelabatt · 2025-09-13T21:25:33 1757798733

Even the Amiga with its 8-bit text encoding was 40 years ago. Are you saying that for some radical reason modern apps on any platform should refuse to process a BOM? Parsing (skipping) a simple BOM header isn't the same as becoming fully Unicode-aware. I did not invent the BOM for UTF-8, it's there in the wild. We better be able to read it, or else we will have this religious debate (and technical issues porting and parsing texts across platforms) for the next 40 years.

rmunn · 2025-09-14T03:20:54 1757820054

That's not what I'm saying at all, I'm saying that in the absence of a BOM header a Unicode-aware app should guess UTF-8 first and then guess other likely encodings second, because the chance of false positives on the "is this UTF-8?" guess is practically indistinguishable from zero. If it isn't UTF-8, the UTF-8 parsing attempt is nearly guaranteed to fail, so it's safe to do first.

I'm also saying that apps should not create a BOM header any more (in UTF-8 only, not in UTF-16 where it's required), because the costs of dealing with BOM headers are higher than they're worth. Except in certain specific circumstances, like having to deal with pre-Unicode apps that default to assuming 8-bit encodings.

mikelabatt · 2025-09-15T00:49:21 1757897361

Makes sense, thank you. The observation about false positives for UTF-8 tending to zero helps understand. So I will vote for UTF-8 without BOM from now on (while encouraging parsers to deal with it, if present).

3036e4 · 2025-09-13T10:31:20 1757759480

Also some XML parsers I used choked on UTF-8 BOMs. Not sure if valid XML is allowed to have anything other than clean ASCII in the first few characters before declaring what the encoding is?

rmunn · 2025-09-13T11:36:57 1757763417

My search also turned this up:

https://x.com/jbogard/status/1111328911609217025

If that link doesn't work, then try:

https://xcancel.com/jbogard/status/1111328911609217025

Source (which will explain the joke for anyone who didn't get it immediately):

https://www.jimmybogard.com/the-curious-case-of-the-json-bom...

rmunn · 2025-09-13T11:30:58 1757763058

Not ALL of the 20th-century Internet has bit-rotted and fallen apart yet. (Just most of it).

rmunn · 2025-09-13T08:13:58 1757751238

I'm assuming you misspelled Shift-JIS on purpose because you're sick and tired of dealing with it. If that was an accidental misspelling, it was inspired. :-)