Programming Algorithms in Lisp: Data Structures

tom_mellior · on Aug 7, 2019

Everything about this is bizarre. It spends a lot of time explaining what a pair is, but then just assumes that computational complexity and O-notation are known.

Union-find is a strange choice as a first data structure, but OK, maybe it could be made to work. However, none of the variants presented is like union-find in actual implementations. There is no path compression. The second variant of uf-add looks like it tries to do something path compression like, but it's in the union or (usually) find operation where it needs to be.

And then, the uf-union operation: No practical union-find implementation accesses the list of all entries in this operation. A linear traversal over all entries defeats the purpose of having all operations be amortized almost-constant time. (Also, uf-add seems to allow adding new entries to the data structure but does not add them to the points list. There are some bugs waiting to happen.)

I won't try to judge the Lisp part of this book, but the algorithms and data structures part is deeply flawed, and nobody should try to use this as a first or primary learning resource.

vseloved · on Aug 7, 2019

> but then just assumes that computational complexity and O-notation are known.

At the bottom, you can find a reference to the previously published chapters. Complexity & big-O was addressed in the initial. (This is part of a book, all the org details are explained in part 1)

> There is no path compression.

The second variant is path compression. Yes, it may also be implemented in find, but it will make the code more complicated, in my view.

> And then, the uf-union operation: No practical union-find implementation accesses the list of all entries in this operation.

You are correct here. I'll make a change, thanks.

> Also, uf-add seems to allow adding new entries to the data structure but does not add them to the points list

This isn't needed here as it is supposed that we already have the list of points, it's just not arranged for efficient disjoint test. I, actually, had the reference to it, initially, but removed as I considered it redundant. I ll think of adding it back.

tom_mellior · on Aug 7, 2019

> At the bottom, you can find a reference to the previously published chapters. Complexity & big-O was addressed in the initial.

I stand corrected, thanks. I knew this was part of a book and did go to the sidebar to check the titles of older chapters, but there the first part is called '"Programming Algorithms" Book', not 'Introduction & Complexity' like at the bottom. Going only by the title, I didn't think there would be complexity there. My fault.

> The second variant is path compression.

Yes, but only on the "add" operation. It's more important to have it elsewhere. Imagine a set of elements {a, b, c, d, e} where you do a sequence of pairwise union operations on (a, b), (b, c), etc., and end up with the data structure degenerated into a linked list a -> b -> c -> d -> e. If you now repeatedly check whether a and b are in the same class, you end up searching the entire list twice, every time you do this search.

You could try adding path compression in union, but I don't think a complete, efficient solution is possible: You would always be able to construct small trees that you would have to link to produce bigger, deeper trees.

That's why path compression is always where it matters and can make a useful difference: In find. In the example above, depending on the details, you would have one or two full linear searches, and then every query on a and b would be constant time.

> Yes, it may also be implemented in find, but it will make the code more complicated, in my view.

That's one of the reasons why union-find might not be the best introductory data structure. As for your view, to some extent, when you teach standard stuff, your view is not as important as actually teaching what you claim to teach. If you claim to teach union-find, you don't get to pick something somewhat similar but with radically different complexity.

> This isn't needed here [...]. I ll think of adding it back.

No, don't! You're right that a full list of nodes is never needed by the data structure itself. A user of the data structure might need one, but also they might not. I have in the past used union-find in settings where it was more convenient not to have a complete list of entries.

vseloved · on Aug 7, 2019

I examined Union-Find in more detail and found the critical flaw in my description that you were pointing too. Unfortunately, I forgot about it and had a simplified view of this data structure that focused on improving Find but neglected Union. The funny thing is that it was discussed, at times, at group job interviews that I participated as one of the interviewers, and we didn't go these deep to uncover that flaw :) (or, maybe, it was long ago enough for me to forget).

So, I've corrected the Union-Find example to be in line with proper presentation (https://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf) removing uf-add in the process. Thanks for the very valuable input.

Surely, that made the example not so bright (as we couldn't reduce everything to a constant-time version with very simple code), but I don't think that the example is inappropriate. It still remains quite simple from the code standpoint. Another reason I picked it was that it didn't require an explanation of any additional data-structure, even, an array, and all the information could be efficiently represented using structs. This still holds.

verandacoffee · on Aug 7, 2019

In the first installment of this series, the author use := for assignment, instead of setf. This is at least using undefined behaviour, as keywords are meant to be constant symbols, not themselves assignable to. Sbcl allows assignment to the function slot of a keyword, but LispWorks doesn't. Using undefined behaviour almost certainly guarantees incompatibilities between implementations, and should always be avoided. Here, it is a central feature. This makes the book untrustworthy, whatever good features it might have.

I must point out that setf is extensible (defsetf and so on) and therefore much better anyway.

(I must say though that one of the poor choices made in the C-family of languages is the choice of = as assignment, when Algol already had used := for this.)

vseloved · on Aug 7, 2019

This was already discussed in great detail (greater than such minor thing should be) in the comments to the first and next chapter. And, by the way, Lispworks also allows it (see here: https://github.com/vseloved/rutils/pull/39 - a person even bothered to provide a fix to make it work there). So, get over it: this book is not about nitpicking but about trying to explain things in an intelligible and accessible manner. A luxury I didn't always have with these topics...

verandacoffee · on Aug 7, 2019

I'm not sure pointing out undefined behaviour is nitpicking. It could in some cases be crucial. At the very least, it ties your code to only some Common Lisp implementations.

I know you can assign to keywords in LispWorks (using a restart), but that should be done in your rutils library, wrapping the definition of :=, :*, :/. And then you should do the same for all other implementations, which may differ in what condition is signalled, and whether there really is a restart for this, and of course whether it actually is allowed in that implementation.

vseloved · on Aug 7, 2019

Sorry, getting back to the initial point: can you point to a statement in the standard that forbids assigning to symbol-function of keywords or dims it unspecified behavior? I see only this: http://www.lispworks.com/documentation/HyperSpec/Body/t_kwd.... that doesn't state anything like this. Besides these constraints, the keyword symbols are normal symbols so they should abide by the same rules.

verandacoffee · on Aug 7, 2019

Following your link: "3. It causes the symbol to become a constant variable." and following the link to constant variable: "constant variable n. a variable, the value of which can never change; that is, a keyword[1] or a named constant. ``The symbols t, nil, :direction, and most-positive-fixnum are constant variables.''"

This doesn't mention the function slot of the symbol, so one could perhaps argue that therefore the function slot is not meant to be a constant variable, but one could equally well argue the opposite. The various implementation obviously differ on this. That alone means that this is either undefined behaviour, or that some of them have this as a very long-standing bug, at least since their inception.

vseloved · on Aug 7, 2019

> This doesn't mention the function slot of the symbol, so one could perhaps argue that therefore the function slot is not meant to be a constant variable

Exactly

> but one could equally well argue the opposite

No, as variables and functions are different parts and there's a more general rule about functions applied to keywords.

Your argument seems to be based on a notion that keywords are something unique and special in CL. They are, but only to the extent that is specified in this section. Otherwise, they are the same as other symbols.

To me, Lisp is not about "forbidden all that's not explicitly allowed", but about "allowed all that's not explicitly forbidden" mentality. And it's, actually, an important trait of the language that makes me value it more than others. So, sorry, I value your opinion, but I think that such things as := need to exist if only to broaden the horizons of people with such opinions :)

verandacoffee · on Aug 7, 2019

> ...I value your opinion...

And I value your work on this book.

I apologize for having said at the beginning that "This makes the book untrustworthy, whatever good features it might have." It was unwarranted, and not substantiated, and much too harsh.

I do however stand by my claim that it is undefined behaviour. (You could help by making the suggested changes to rutils.)

OskarS · on Aug 7, 2019

> To implement an arbitrary data structure, these languages provide a special mechanism called records, structs, objects, etc. The proper name for it would be "tuple". It is the data structure that consists of a number of fields each one holding either a primitive value, another tuple or a pointer to another tuple of any type. This way a tuple can represent any structure, including nested and recursive ones. In the context of type theory, such structures are called sum types

Wait, what? Tuples aren't sum types, they're product types. Like, they're the "cartesian product" of some other types. A sum type would be like a tagged union (or Rust enum) where something can be either one type or the other. I haven't studied a lot of algebraic type theory, but Wikipedia agrees with me on that point [0]

Seems like a fairly elementary mistake in an article of this kind...

EDIT: also, this, a few paragraphs down:

> Moreover, in C++ the pass-by-reference has two flavors: pass the pointer (a common approach) and pass by what's actually called a reference, which is a syntax sugar over pointers that allows accessing the argument with non-pointer syntax (dot instead of arrow)

Pass-by-reference in C++ is most certainly not just syntactic sugar for passing in a pointer. A compiler may implement it that way (or it may not), but it's totally wrong to say that passing by reference is just syntactic sugar so you don't have to use arrow notation. The most obvious differences is that a reference can never be null, it can't be changed to point to some other object, you can't do pointer arithmetic to it and you always dereference it when use it. Also: in idiomatic C++, "passing by reference" is far more common than "passing by pointer".

I'm going to stop reading this article now. I don't have a lot of faith that this person really knows what they're talking about.

[0]: https://en.wikipedia.org/wiki/Algebraic_data_type

vseloved · on Aug 7, 2019

> Wait, what? Tuples aren't sum types, they're product types.

Thanks for the correction. As I wrote in the introduction, there will be some errors in these beta-version chapters that are published on the blog (as I, obviously, don't have correctors and reviewers), so I'm thankful to all who notice and point those.

> it's totally wrong to say that passing by reference is just syntactic sugar

OK, I'd say that it's still syntactic sugar with some benefits :) But I don't agree that it's totally wrong. Some of the things you mentioned are just side-effects (that you can't do pointer arithmetic or pass a null pointer. Actually, it seems that you can with some jiggling: https://stackoverflow.com/questions/8571078/pass-by-pointer-...). The point I wanted to make is that pass-by-reference is mostly the same thing. It's, actually, quite a confusing topic (as are many C++ solutions) that was not properly presented to me when I stduied it in school, and this book neither tries to present it, but I had to, at least, mention it here as I was talking about different passing styles.

OskarS · on Aug 7, 2019

It's not mostly the same thing, the difference is pretty significant both in practice and in a "type-theoretic" sense. I get what you're saying, that "pass by reference" and "pass by pointer" are both methods to pass an object into a function without copying, but they are very different. An article like this where the whole point is to talk about these things at a more fundamental level, it is bad to claim that they are.

vseloved · on Aug 7, 2019

The point of the article is not to talk about low-level specifics of some language, apart, from, maybe, Lisp sometimes. But, surely, not C++ (although, if you could provide a link with a good and accessible explanation of the differences of the 2 styles I'd add it as a reference for further reading). Stil, I don't see that fundamental difference that you mention. If the differences originate from the points that you listed in the original comments can you, please, make the link more explicit? I understand pass-by-reference as being a slightly higher-level and safer C++ alternative to C's direct pointer-based style. But, surely, I may be missing something. But, it seems that the people who answered the SO question I linked are also missing it. For example, here's a quote: "Basically, references and pointers are not very different from an implementation point of view, the main (and very important) difference is in the philosophy: a reference is the object itself, just with a different name."

pjmlp · on Aug 7, 2019

Here is pass by reference and pass by pointer in Pascal.

    type

    Point = record
        x, y : integer
    end;

    procedure ByPointer(p : ^Point);
    begin
        p^.x := 12
    end;

    procedure ByReference(var p : Point);
    begin
        p.x := 12;
        p.y := 33
    end;

    procedure example
    var
      p : Point
    begin
       p.x := 23;
       p.y := 28;
       WriteLn('The point is ', p.x, ' ', p.y);

       ByPointer(@p)
       WriteLn('The point is ', p.x, ' ', p.y);

       p.x := 23;
       p.y := 28;
       WriteLn('The point is ', p.x, ' ', p.y);

       ByReference(p)
       WriteLn('The point is ', p.x, ' ', p.y);
    end;

Actually the concept is already exposed in Algol, a couple of years before C was even an idea.

vseloved · on Aug 7, 2019

good point, so, here, it is exactly syntax sugar as, unlike in C++, nothing else except syntax is different from the pointer-based version. Not to mention that, in the article, I was mentioning the general concept that has one (and single) variant, in C and Java, and 2 in Pascal or C++

pjmlp · on Aug 7, 2019

Actually it is more than syntax sugar.

I should have provided a better example, this is impossible to do languages like Java and in C requires explicit double indirection to fake how reference types work.

    type
        Point = record
            x, y : integer
        end;


    procedure ByPointer(p : ^Point)
    begin
        New(p);
        p.x := 12
    end;

    procedure ByReference(var p : ^Point)
    begin
        if p <> nil then
           Dispose(p)
        end;
        New(p);
        p^.x := 12;
        p^.y := 33
    end;

    procedure example
    var
      p : ^Point
    begin
       New(p);
       p^.x := 23;
       p^.y := 28;
       WriteLn('The point is ', p^.x, ' ', p^.y, ' at ', p);

       ByPointer(p)
       WriteLn('The point is ', p^.x, ' ', p^.y, ' at ', p);

       p^.x := 23;
       p^.y := 28;
       WriteLn('The point is ', p^.x, ' ', p^.y, ' at ', p);

       ByReference(p);
       WriteLn('The point is ', p^.x, ' ', p^.y, ' at ', p);  (* here p will point to the one allocated in ByReference *)
       Dispose(p);
    end;

jstimpfle · on Aug 7, 2019

> Actually it is more than syntax sugar.

Where? I see only syntax sugar. The catchy phrase from the Google C++ style guide is stuck in my head: references have value syntax but pointer semantics. And that's it (and that's why I think references are mostly a bad idea).

> C requires explicit double indirection

Your 2nd example is no different, it's still just "indirection" (value syntax but pointer semantics!). If what you are indirecting is itself a pointer type (as in your 2nd example), then yes, there technically is a "double indirection", but that's besides the point.

> impossible to do in languages like Java

actually, "in every memory safe language there is". You can do this in Pascal because Pascal is really just C-style systems programming. With a few features like references and ARC that pretend otherwise but quickly fall flat on their head.

pjmlp · on Aug 7, 2019

Ok I might have to give in with the syntax sugar part, but safety?

Here is my example in C#. So now C# is unsafe?

    class Point
    {
        public int x, y;
    }


    static class CSharpIsUnsafe {
        static void ByPointer(Point p)
        {
            p = new Point();
            p.x = 12;
        }

        static void ByReference(ref Point p)
        {
            p = new Point();
            p.x = 12;
            p.y = 33;
        }
        
        public static void Main(string[] args)
        {
           p = new Point();
           p.x = 23;
           p.y = 28;
           Console.WriteLn($"The point is {p.x} {p.y}");
           

           ByPointer(p)
           Console.WriteLn($"The point is {p.x} {p.y}");

           p.x = 23;
           p.y = 28;
           Console.WriteLn($"The point is {p.x} {p.y}");

           ByReference(ref p);
           Console.WriteLn($"The point is {p.x} {p.y}");
        }
    }

jstimpfle · on Aug 7, 2019

I guess it might very well be safe. I don't know C# very well, but I'd assume that ref is sort of a proxy that

1) under the hood stores an (object) reference to the lowest structural ancestor that is GC'ed, as well as an offset from that object to the ref'ed property.

2) is itself GC'ed

Escape analysis might in some cases optimize out the GC'ing of the ref, but that of course doesn't work if the ref does escape.

Something like that, but in any case it's something very different than the Pascal/C++ examples we've talked about.

pjmlp · on Aug 7, 2019

The generated code is similar as how a C++ compiler would do it.

Easy to check with VS Assembly visualizer or WinDbg alongside SOS.dll.

D, Ada, CLU, Mesa, Mesa/Cedar, Algol, PL/I, PL/S, PL.8, Delphi, Dylan, .NET, Swift, Oberon, Oberon-2, Component Pascal, Active Oberon, QuickBasic, Turbo Basic are just a couple of examples of languages allowing for pass by value and pass by reference, some of them also quite safe.

And then there is also call by name introduced alongside Algol.

vseloved · on Aug 7, 2019

And here's another quote from this very much upvoted answer (https://stackoverflow.com/a/596750/977052): "A reference can be thought of as a constant pointer (not to be confused with a pointer to a constant value!) with automatic indirection, ie the compiler will apply the * operator for you."

seanhunter · on Aug 7, 2019

The word "continuous" here is used (incorrectly) to mean "contiguous".

"Continuous" means ongoing in time (as in "continuing"). "Contiguous" means "touching on a boundary". So a "contiguous data structure" is one that "occupies a single chunk of memory and its contents are stored in adjacent memory blocks."

Continuity would only really mean anything for time-series data.

lgessler · on Aug 7, 2019

At least in general usage, outside of any domain-specific conventions, "continuous" doesn't have to have the temporal sense. The first entry in the OED for "continuous":

> a. Characterized by continuity; extending in space without interruption of substance; having no interstices or breaks; having its parts in immediate connection; connected, unbroken.

> 1673 N. Grew Idea Phytol. Hist. ii. iii. 65 It is compounded of two Bodies. The one parenchymous; continuous throughout; yet somewhat pliable without a solution of its continuity.

> 1704 I. Newton Opticks ii. ii. 41 The dark intervals must be diminished, until the neighbouring Rings become continuous, and are blended.

> 1796 R. Southey Joan of Arc vii. 6 Round the city stretch'd Their line continuous, massy as the wall Erst by the fearful Roman..raised.

> 1859 C. Darwin Origin of Species xi. 352 In most cases the area inhabited by a species is continuous.

> 1879 J. N. Lockyer Elem. Lessons Astron. vi. 228 If we light a match and observe its spectrum, we find that it is continuous—that is, from red through the whole gamut of colour to the visible limit of the violet.

> 1881 J. C. Maxwell Treat. Electr. & Magnetism (ed. 2) I. 6 Without describing a continuous line in space.

seanhunter · on Aug 7, 2019

OK well "contiguous" is more specifically applicable here at least. If you search "Continuous data structure" on ddg for me it auto-includes results for "contiguous data structure".

vseloved · on Aug 7, 2019

thanks for pointing this out

zshrdlu · on Aug 7, 2019

I'm enjoying this series. A while back I was playing around with binary arithmetic coding and statistical modelling. For the modelling I was using a simple Markov model, and I hit performance issues. I explored all kinds of trees but eventually settled on Hash tables. Eventually I had to use my own specialized bit vectors and hash tables, but it was my first brush with a problem-space where data structures and algorithms are critical.

The algorithm worked amazingly well for small strings, outperforming the PAQ versions I tried, but it lost compression power as the input got longer. I didn't have the mathematical expertise to figure out why that was the case.

Still didn't get it to the kind of performance I liked, but maybe someone with greater insight could take a look at it: https://github.com/shrdlu68/ac-experiment

taeric · on Aug 7, 2019

Common lisp's multiple values took some getting used to. In particular, having gotten used to tuples and lists, it is somewhat bending to realize you can return multiple things without either.

dreamcompiler · on Aug 7, 2019

The key feature of multiple values is that each value gets passed back to the caller on the stack, not the heap. So they don't cons garbage the way returning a list or struct would.

This is a natural thing to do in assembly language. Common Lisp just makes it available in a high level language.

heisig · on Aug 7, 2019

Even better, the first few multiple values are typically returned in CPU registers and not on the stack.

meenzu · on Aug 13, 2019

This is great thanks for creating this! I love the first few chapters and I'm really liking the code explanations in this one