Hacker Newsnew | past | comments | ask | show | jobs | submit | delifue's commentslogin

Its main benefit is to avoid having extra data structure (like hash map) to find the missing or duplicate, using O(n) time and O(1) space.

No, again, that's not my point. The code from the article is O(2n) when it could be O(n). I know we're not supposed to care about constant factors, but I've lived in a world where not micro optimizing the ever loving shit out of my software could potentially make people throw up, so this sort of stuff kind of stands out to me.

The code in the article is written in Python, whose only for loop is for-each. It is 2N XOR operations, regardless of whether you use one or two loops.

I probably would have written it with a single loop, using the `enumerate` iterator adapter. But in Python, two loops is almost certainly more efficient.


You can loop over the range and then do result ^= i ^ A[i]. If adapters are slow you don't need them here.

Having only one loop gives you a better memory access pattern, because it's 2 XOR operations in between each memory access. Two loops is the same number of instructions, but it spends one loop ignoring memory and then another loop doing rapid-fire memory accesses. In python there's enough overhead that it's unlikely to matter. But in a faster language running on a busy machine that could make a real difference.


for i in range(n - 1):

Is pretty much a standard for loop. Between that and

for n in numbers:

You can do pretty much the same things as a more conventional language.

You could also solve it pretty simply like this:

expected_sum = (n * (n + 1)) / 2

missing_num = expected_sum - sum(numbers)

This only iterates the list once. This would probably be my solution if I was given this task in an interview.


real world performance will depend on how much of that N fits in cache. and in what cache it fits (L1, 2, 3). once loaded, you may not pay much cost to access each value a second time.

Doing 2 loops over the data means you have to do a full pass over the data twice. If your N doesn’t fit in L3 then you’re going to load N twice instead of once. Loading twice, even out of L1 is still slower than never loading twice at all.

Exactly. And there's also the fact that sometimes the data stream we're processing is unbounded and ephemeral. For example, reading values from a physical sensor. It may not match up to this specific example, but the point remains that a single "loop" over a data set might be all you get, so pack as much into that loop as you can.

O(2n) doesn't exist. The whole point of big O is that you ignore such "trivial" things as what factor comes before the n

Did I not say that?

>we end up performing two loops instead of one, all because sticking three operations in one statement is "yucky"

You seem to believe that "O(2n)"

  for value in range(1, n + 1):
    result ^= value
  for value in A:
    result ^= value
is slower than "O(n2)"

  for value in range(1, n + 1):
    result ^= value
    result ^= A[value-1]
simply because the latter has one "for loop" less. Am I misunderstanding you, or if not, why would this matter for speed?

Unless both loops get unrolled it's ever so slightly slower due to having to check for the end value twice. Plus potentially a cache hit at the start of the second loop.

None of this is as straightforward as it seems.

A "for" loop in Python isn't particularly cheap. It compiles to some static overhead to set up the iterator, then each loop iteration compiles to the "FOR_ITER" opcode, a "STORE_FAST" opcode to assign the iteration value to a variable, and the body of the loop.

"FOR_ITER" calls the "__next__()" method of the iterator (which is on top of the interpreter object stack), catches the StopIteration exception to know when to terminate the loop (by jumping past the loop body), and stores the iterator value to the top of the stack. What "__next__()" does is totally opaque - we don't know what kind of object A is - but since we've added the overhead of a function call already it wouldn't matter if it was a super tight bit of machine code, we're already paying a (relatively) hefty runtime cost.

A particularly bad implementation of "__next__()" for some custom iterable collection might be so stupid as to walk through the collection until it reaches the current index's item and returns that, so "for value in A" could in fact be O(n^2).

Plus, "result ^= A[value-1]" is substantially more work than "result ^= value", so even just on the loop bodies the two examples aren't very similar at all. Evaluating "A[value-1]" may wind up calling a "__getitem__()" method on A.

If A is, say, a linked list or binary tree, iterating it is very cheap but indexing it is O(n), so the second loop might be O(n^2) where the first is O(n).

So maybe we be a bit more Pythonic, and do:

    for i, value in enumerate(A)
        result ^= i
        result ^= value
One loop, no indexing of A! But we've not actually saved anything: the __next__() method of enumerate's iterator will increment its index then call the __next__() method of A's iterator, (approximately) the same work as if we'd done two FOR_ITER, one for an index and one for A.

Why would this matter for speed? I don't know. Unless 'n' is pretty big a human won't even notice the execution time of any of this code.


Even assuming python's foreach loop in these cases get optimized down to a very bare for loop, the operations being performed are dominated by the looping logic itself, because the loop body is so simple.

Each iteration of a for loop performs one index update and one termination comparison. For a simple body that is just an XOR, that's the difference between performing 5 operations (update, exit check, read array, XOR with value, XOR with index) per N elements in the one loop case versus 7 operations (update, exit, read array, XOR with value, then update, exit, XOR with index) in the two loop case. So we're looking at a 29% savings in operations.

It gets worse if the looping structure does not optimize to a raw, most basic for loop and instead constructs some kind of lazy collection iterator generalized for all kinds of collections it could iterate over.

The smaller the loop body, the higher the gains from optimizing the looping construct itself.


You said the code from the article is O(2n) when it could be O(n), but those are the same thing.

You did, but it might not be an effective strategy to mention asymptotic complexity to help forward your argument that one linear implementation is faster than another.

Whether it's a win in Python to use one or two loops isn't so clear, as a lot is hidden behind complex opcodes and opaque iterator implementations. Imperative testing might help, but a new interpreter version could change your results.

In any case, if we want to nitpick over performance we should be insisting on a parallel implementation to take advantage of the gobs of cores CPUs now have, but now we're on a micro-optimisation crusade and are ignoring the whole point of the article.


Go keeps on not adding syntax sugar for if err != nil. https://go.dev/blog/error-syntax

But often the problem of language can be partially solved by IDE. IDE can already generate if err != nil branches. Goland can fold if err != nil branches. https://github.com/golang/vscode-go/issues/2311


The hardware(GPU)'s architectural limitations may slow research more than PyTorch. The hardware lottery https://hardwarelottery.github.io/

I've been trying to get BitGrid [1] a hardware lottery ticket for a decade now.

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...


"Software 2.0" reminds me of "web 3.0".


Luckily it’s nothing like that


Vibe coding is easy. Vibe debugging is hard.


The fifth point is the same as in 'Ask for no, don’t ask for yes'

https://www.mooreds.com/wordpress/archives/3518

https://news.ycombinator.com/item?id=43144611


If you use Box to refer to parent, then parent cannot own the child (unless using things like Arc<Mutex<>>).


The "if I then go and code in Python, Java or C#, pretty much all objects have the overhead of an Arc" is not accurate. Rust Arc involves atomic operation and its preformance can greatly degrade when the reference count is being mutated by many threads. See https://pkolaczk.github.io/server-slower-than-a-laptop/

Java, C# and Go don't use atomic reference counting and don't have such overhead.


Not allowing take reference can avoid interior pointer. Interior pointers make memory safety harder and also make GC harder.


If I keep removing one element in front and adding one element on back, then normal ring-buffer deque will involve no copying, but this will keep doing copying to empty space, so its performance could be much worse than deque if the queue is large.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: