More

sparkie · 2026-02-24T17:32:52 1771954372

My luggage was missing when I landed at KIX.

But it wasn't the airport's fault - my luggage was still in Amsterdam.

Arrived <24 hours later and they delivered it to my hotel in Osaka.

ghaff · 2026-02-24T17:48:20 1771955300

In my experience, that's far and away the most common scenario. Luggage misses a connection, doesn't get on a flight that has ben changed because of weather, or otherwise ends up somewhere it's not supposed to be. Many airline tracking systems are better than they used to be but AirTags or equivalent are not a bad idea.

lilytweed · 2026-02-24T18:07:13 1771956433

I once had SAS lose my luggage on a direct flight from Copenhagen to Tokyo Haneda. I was sure that such a thing was impossible, but I learned an important lesson that day.

aix1 · 2026-02-25T06:05:06 1771999506

Weird things do happen. I was once looking out of an airplane window waiting for takeoff when I saw my suitcase getting offloaded from the hold and taken back to the terminal. Spoke to the crew, who spoke to ground staff, who brought the bag back.

Not the slightest idea what it was all about... (I'm guessing some sort of mix-up in which they thought I'd failed to board?)

sparkie · 2026-02-20T14:58:26 1771599506

Defer might be better than nothing, but it's still a poor solution. An obvious example of a better, structural solution is C#'s `using` blocks.

    using (var resource = acquire()) {

    } // implicit resource.Dispose();

While we don't have the same simplicity in C because we don't use this "disposable" pattern, we could still perhaps learn something from syntax and use a secondary block to have scoped defers. Something like:

    using (auto resource = acquire(); free(resource)) {

    } // free(resource) call inserted here.

That's no so different to how a `for` block works:

    for (auto it = 0; it < count; it++) {

    } // automatically inserts it++; it < count; and conditional branch after secondary block of for loop.

A trivial "hack" for this kind of scoped defer would be to just wrap a for loop in a macro:

    #define using(var, acquire, release) \
        auto var = (acquire); \
        for (bool var##_once = true; var##_once; var##_once = false, (release))

    using (foo, malloc(szfoo), free(foo)) {
        using (bar, malloc(szbar), free(bar)) {
            ...
        } // free(bar) gets called here.
    } // free(foo) gets called here.

usefulcat · 2026-02-20T16:14:49 1771604089

That is a different approach, but I don't think you've demonstrated why it's better. Seems like that approach forces you to introduce a new scope for every resource, which might otherwise be unnecessary:

    using (var resource1 = acquire() {
        using (var resource2 = acquire()) {
            using (var resource3 = acquire()) {
                // use resources here..
            }
        }
    }

Compared to:

    var resource1 = acquire();
    defer { release(resource1); }
    var resource2 = acquire();
    defer { release(resource2); }
    var resource3 = acquire();
    defer { release(resource3); }
    // use resources here

Of course if you want the extra scopes (for whatever reason), you can still do that with defer, you're just not forced to.

sparkie · 2026-02-20T16:42:44 1771605764

While the macro version doesn't permit this, if it were built-in syntax (as in C#) we can write something like:

    using (auto res1 = acquire1(); free(res1))
    using (auto res2 = acquire2(); free(res2))
    using (auto res3 = acquire3(); free(res3)) 
    {
        // use resources here
    } 
    // free(res3); free(res2); free(res1); called in that order.

The argument for this approach is it is structural. `defer` statements are not structural control flow: They're `goto` or `comefrom` in disguise.

---

Even if we didn't want to introduce new scope, we could have something like F#'s `use`[1], which makes the resource available until the end of the scope it was introduced.

    use auto res1 = acquire1() defer { free(res1); };
    use auto res2 = acquire2() defer { free(res2); };
    use auto res3 = acquire3() defer { free(res3); };
    // use resources here

In either case (using or use-defer), the acquisition and release are coupled together in the code. With `defer` statements they're scattered as separate statements. The main argument for `defer` is to keep the acquisition and release of resources together in code, but defer statements fail at doing that.

[1]:https://learn.microsoft.com/en-us/dotnet/fsharp/language-ref...

sparkie · 2026-02-20T14:32:09 1771597929

Defer is a restricted form of COMEFROM with automatic labels. You COMEFROM the end of the next `defer` block in the same scope, or from the end of the function (before `return`) if there is no more `defer`. The order of execution of defer-blocks is backwards (bottom-to-top) rather than the typical top-to-bottom.

    puts("foo");
    defer { puts("bar"); }
    puts("baz");
    defer { puts("qux"); }
    puts("corge");
    return;

Will evaluate:

    puts("foo");
    puts("baz");
    puts("corge");
    puts("qux");
    puts("bar");
    return;

vlowther · 2026-02-20T15:13:42 1771600422

That is the most cursed description I have seen on how defer works. Ever.

sparkie · 2026-02-20T15:32:24 1771601544

This is how it would look with explicit labels and comefrom:

    puts("foo");
    before_defer0:
    comefrom after_defer1;
    puts("bar");
    after_defer0:
    comefrom before_defer0;
    puts("baz");
    before_defer1:
    comefrom before_ret;
    puts("qux");
    after_defer1:
    comefrom before_defer1;
    puts("corge");
    before_ret:
    comefrom after_defer0;
    return;

---

`defer` is obviously not implemented in this way, it will re-order the code to flow top-to-bottom and have fewer branches, but the control flow is effectively the same thing.

In theory a compiler could implement `comefrom` by re-ordering the basic blocks like `defer` does, so that the actual runtime evaluation of code is still top-to-bottom.

sparkie · 2026-02-10T02:27:33 1770690453

Compile using `-fkeep-inline-functions`.

yxhuvud · 2026-02-10T06:58:18 1770706698

Doesn't help. The point is to avoid having to invoke a C compiler when working in X language. But it would certainly be nice if the distros enabled that.

sparkie · 2026-02-10T02:26:13 1770690373

`inline` is a hint, but he declares `static_inline` in the preprocessor to include `__attribute__((__always_inline__))`, which is more than just a hint. However, even `always_inline` may be troublesome over translation units, though we can still inline things in different translation units if using `-flto`, I believe there are occasional bugs. For libraries we'd also want to use `-ffat-lto-objects`.

sparkie · 2026-02-10T02:21:21 1770690081

A subset of C could still use existing C compilers and get the optimizations. The front-end would just restrict what can be expressed in it.

troad · 2026-02-10T02:51:54 1770691914

What's the benefit of this?

Say I am writing a transpiler to C, and I have to choose whether I will target C89, or some arbitrary subset of C23. When would I ever choose the latter?

The only benefit I could think of is where you're also planning to write a new C compiler, and this is simplified by the C being restricted in some way. But if you're doing this, you're just writing a frontend and backend, with an awkward and unnecessary middle-end coupling to some arbitrary subset of C. What's the benefit of C being involved at all in this scenario?

And say you realise this, and opt to replace C with some kind of more abstract, C-like IR. Aren't you now just writing an LLVM clone, with all the work that entails? When the original point of targetting C was to get its portability and backends for free?

sparkie · 2026-02-10T02:10:27 1770689427

`uintptr_t` and `intptr_t` are integer types large enough to hold a pointer. They're not pointer types (They're also optional in the standard).

In the first `my_func`, there is the possiblity that `a` and `b` are equal if their struct layouts are equivalent (or one has a proper subset of the other's fields in the same order). To tell the compiler they don't overlap we would use `(strong_type1 *restrict a, strong_type2 *restrict b)`.

There's also the possibility that the pointers could point to the same address but be non-equal - eg if LAM/UAI/TBI are enabled, a simple pointer equality comparison is not sufficient because the high bits may not be equal. Or on platforms where memory access is always aligned, the low bits may be not equal. These bits are sometimes used to tag pointers with additional information.

sparkie · 2026-02-02T07:09:42 1770016182

I was disappointed when MS discontinued Axum, which I found pleasant to use and thought the language based approach was nicer than a library based solution like Orleans.

The Axum language had `domain` types, which could contain one or more `agent` and some state. Agents could have multiple functions and could share domain state, but not access state in other domains directly. The programming model was passing messages between agents over a typed `channel` using directional infix operators, which could also be used to build process pipelines. The channels could contain `schema` types and a state-machine like protocol spec for message ordering.

It didn't have "classes", but Axum files could live in the same projects as regular C# files and call into them. The C# compiler that came with it was modified to introduce an `isolated` keyword for classes, which prevented them from accessing `static` fields, which was key to ensuring state didn't escape the domain.

The software and most of the information was scrubbed from MS own website, but you can find an archived copy of the manual[1]. I still have a copy of the software installer somewhere but I doubt it would work on any recent Windows.

Sadly this project was axed before MS had embraced open source. It would've been nice if they had released the source when the decided to discontinue working on it.

[1]:https://web.archive.org/web/20110629202213/http://download.m...

sparkie · 2026-02-02T05:07:23 1770008843

> I would use std::uint64_t which guarantees a type of that size, provided it is supported.

The comment on the typedef points out that the signature of intrinsics uses `unsigned long long`, though he incorrectly states that `uint64_t` is `unsigned long` - which isn't true, as long is only guaranteed to be at least 32-bits and at least as large as `int`. In ILP64 and LLP64 for example, `long` is only 32-bits.

I don't think this really matters anyway. `long long` is 64-bits on pretty much everything that matters, and he is using architecture-specific intrinsics in the code so it is not going to be portable anyway.

If some future arch had 128-bit hardware integers and a data model where `long long` is 128-bits, we wouldn't need this code at all, as we would just use the hardware support for 128-bits.

But I agree that `uint64_t` is the correct type to use for the definition of `u128`, if we wanted to guarantee it occupies the same storage. The width-specific intrinsics should also use this type.

> I would be interested to see how all these operations fair against compiler-specific implementations

There's a godbolt link at the top of the article which has the comparison. The resulting assembly is basically equivalent to the built-in support.

wheybags · 2026-02-02T10:05:35 1770026735

> though he incorrectly states that `uint64_t` is `unsigned long`

It probably is, he's just probably using MacOS, where both long and long long are 64 bit. https://www.intel.com/content/www/us/en/developer/articles/t...

(that's the best linkable reference I could find, unfortunately).

I've run into a similar problem where an overload resolution for uint64_t was not being used when calling with a size_t because one was unsigned long and the other was unsigned long long, which are both 64 bit uints, but according to the compiler, they're different types.

This was a while ago so the details may be off, but the silly shape of the issue is correct.

sparkie · 2026-02-02T11:24:26 1770031466

> It probably is

This was my point. It may be `unsigned long` on his machine (or any that use LP64), but that isn't what `uint64_t` means. `uint64_t` means a type that is 64-bits, whereas `unsigned long` is simply a type that is larger than `unsigned int` and at least 32-bits, and `unsigned long long` is a type that is at least as large as `unsigned long` and is at least 64-bits.

I was not aware of compilers rejecting the equivalence of `long` and `long long` on LP64. GCC on Linux certainly doesn't. On windows it would be the case because it uses LLP64 where `long` is 32-bits and `long long` is 64-bits.

An intrinsic like `_addcarry_u64` should be using the `uint64_t` type, since its behavior depends on it being precisely 64-bits, which neither `long` nor `long long` guarantee. Intel's intrinsics spec defines it as using the type `unsigned __int64`, but since `__int64` is not a standard type, it has probably implemented as a typedef or `#define __int64 long long` by the compiler or `<immintrin.h>` he is using.

PhilipTrettner · 2026-02-06T13:34:06 1770384846

Sorry I'm a bit late to the party.

long and long long are convertible, that's not the issue. They are distinct types though, so long* and long long* are NOT implicitly convertible. And uint64_t is not consistently the correct type.

See: https://godbolt.org/z/bYb7a38dG

I'd prefer if the intrinsics use the same uint64_t but they don't.

sparkie · 2026-02-02T04:07:26 1770005246

Cryptography would be one application. Many crypto libraries use an arbitrary size `bigint` type, but the algorithms typically use modular arithmetic on some fixed width types (128-bit, 256-bit, 512-bit, or some in-between like 384-bits).

They're typically implemented with arrays of 64-bit or 32-bit unsigned integers, but if 128-bits were available in hardware, we could get a performance boost. Any arbitrary precision integer library would benefit from 128-bit hardware integers.

ThatGuyRaion · 2026-02-02T04:40:17 1770007217

I suppose that makes sense -- though SIMD seems more useful for accelerating a lot of crypto?

sparkie · 2026-02-02T05:10:00 1770009000

SIMD is for performing parallel operations on many smaller types. It can help with some cryptography, but It doesn't necessarily help when performing single arithmetic operations on larger types. Though it does help when performing logic and shift operations on larger types.

If we were performing 128-bit arithmetic in parallel over many values, then a SIMD implementation may help, but without a SIMD equivalent of `addcarry`, there's a limit to how much it can help.

Something like this could potentially be added to AVX-512 for example by utilizing the `k` mask registers for the carries.

The best we have currently is `adcx` and `adox` which let us use two interleaved addcarry chains, where one utilizes the carry flag and the other utilizes the overflow flag, which improves ILP. These instructions are quite niche but are used in bigint libraries to improve performance.

wahern · 2026-02-02T08:16:02 1770020162

> but It doesn't necessarily help when performing single arithmetic operations on larger types.

For the curious, AFAIU the problem is the dependency chains. For example, for simple bignum addition you can't just naively perform all the adds on each limb in parallel and then apply the carries in parallel; the addition of each limb depends on the carries from the previous limbs. Working around these issues with masking and other tricks typically ends up adding too many additional operations, resulting in lower throughput than non-SIMD approaches.

There's quite a few papers on using SIMD to accelerate bignum arithmetic for single operations, but they all seem quite complicated and heavily qualified. The threshold for eeking out any gain is quite high, e.g. minimum 512-bit numbers or much greater, depending. And they tend to target complex or specialized operations (not straight addition, multiplication, etc) where clever algebraic rearrangements can profitably reorder dependency chains for SIMD specifically.