Designing a Programming Language

toolslive · on July 17, 2015

"A program written in C can be deployed on virtually any operating system". Let me kill that illusion: While it is possible to write C that can be deployed on virtually any operating system, doing so is quite hard.

Operating systems have different system calls, and even if they are the same family (Linux vs Solaris fe) the semantics can differ. Another thing is that the resolution of the basic primitive types can vary: 'long' is not equally wide on all platforms, heck, even 'char' can be 16 bits on some exotic systems. Anyway, writing portable C is an extra level of complexity. Just write up a little socket server in C and try to get it running on Windows, Linux and Solaris, maybe mix in 32 and 64 bit and see if you still have the same opinion afterwards.

ectoplasm · on July 17, 2015

Love it or hate it, that's what Autoconf and friends are for. I get to use the same CLI on every computer I have thanks to Autoconf. Also, how many of the compilers and VMs for the languages you love were written in something other than C/C++?

toolslive · on July 17, 2015

most of them end up self-hosting, but plenty of them are initially written in language that's well geared to building compilers. Rust fe was bootstrapped using OCaml to write the first compilers. Smart choice.

jwdunne · on July 17, 2015

I believe many lisps do this too. This is where I learned about self-hosting compilers before realising they're pretty much everywhere. I remember looking at the source of a CL impl and thinking "what? It's all in Common Lisp?!"

Despite my ignorance, it's pretty common. C compilers tend to be self-hosting from my understanding. It seems like madness to maintain it all in x86 without some justification.

jwdunne · on July 17, 2015

Taking another look at SBCL, even the assembler and its code is written in Lisp:

https://github.com/sbcl/sbcl/blob/master/src/assembly/x86/ar...

Very interesting stuff.

In the same repo, you can also see that the compiler is also written in SBCL.

gmfawcett · on July 17, 2015

It's written in ANSI Common Lisp, not in an SBCL dialect: you can bootstrap SBCL with an ANSI CL implementation other than SBCL. IIRC, that was one of the key reasons for the original SBCL/CMUCL split.

raindev · on July 17, 2015

Some interesting thoughts on Autoconf could be found in https://queue.acm.org/detail.cfm?id=2349257

fizixer · on July 17, 2015

I just read this article for the first time. And even though it raised some interesting points that I completely agree with, he builds his argument based on some inaccuracies. Most importantly, his writing gives the impression that autoconf was written by some dot-com wunderkinder.

Autoconf's wikipedia page says it first appeared in 1991, very likely to solve configure problems of the mid-to-late 1980s. That's as far from dot-com as you can get.

He also bashes the m4 macro language. I have never learnt the language but it's possible the obscurity of the langauge is due to not a lot of people spending time to learn it. And should they? I can't say for sure, but I do know the motivation for why m4 was created. It was meant to be a feature-rich and consistent alternative to the C preprocessor which is full of warts and is not turing complete (which m4 is). There are C programs out there written without a single line of '#' magic, using m4 preprocessor instead. You could say m4 syntax ugly, but you cannot easily defend that there is no need for a turing complete, preferably general purpose, macro language. C++ templates try to provide another alternative, and unsurprisingly, are known to be one of the syntactically complex aspects of the language.

And today we don't have Cray and DEC and all those obscure architectures, but we do have ARM, and 32/64, and GPUs, and so on. So the architecture proliferation doesn't look like dying anytime soon. And we have the autoconf/configure alternative LLVM, very loosely speaking (an attempt to let one source code work with multiple architectures), which is by many standards an order of magnitude more complex than the former.

Just trying to put things in perspective.

mannykannot · on July 17, 2015

To further the interests of accuracy, Poul-Henning Kamp's bashing of m4 is only with regard to its use to implement autoconf, though my guess is that he doesn't have any use for it in any other circumstance - and neither has the rest of the world, were it not for the unfortunate accident of it being chosen to implement autoconf.

I imagine it would be fairly easy to "defend that there is no need for a turing complete, preferably general purpose, macro language" empirically on the basis of the number of things that get done without one, and more formally on the basis of Turing equivalence.

ectoplasm · on July 17, 2015

There's no need for a systems language besides C?

mannykannot · on July 18, 2015

Need? no. Could it be improved on? I am certain it could (and Rust looks like an interesting way to do so), but that improvement would probably not be via a Turing-equivalent macro language that could, to an arbitrary degree, subvert the semantics of the language in which the programs are actually written. The trend has been to move away from macro-processing code, rather than towards strengthening the macro processor's power.

One might argue that C++'s template sub/co-language is a counter-example, but, putting aside the question of whether it is actually progress (on balance, I think it is), it has a syntax which discourages its use in a completely arbitrary way.

ectoplasm · on July 18, 2015

Sorry, I was being cryptic, my fault. I was generalizing your claim about macro languages to other languages.

I mean, doesn't your original argument imply that there is no need for a systems language besides C empirically on the basis of the number of things that get done without one, and more formally on the basis of Turing equivalence?

mannykannot · on July 18, 2015

You are right about the timing. I believe the fragmentation started at Berkeley, and was greatly expanded by the mini-computer manufacturers. Ironically, far from this being a consequence of AT&T attempting to commercialize Unix, it followed from AT&T not being allowed to do so (there is another irony in Kamp being a BSD user.)

And while I thoroughly dislike having anything to do with autoconf and its relatives, I have to admit the need for something like it, on account of decisions beyond the control of the FOSS community and the contingencies of history. Even the choice of m4 may have seemed more reasonable at the time, given fewer alternatives (though there was Perl, and perhaps TCL, to choose from.)

While Kamp's description of the current situation seems accurate, his attribution of blame does not, and his lament that things could have been so much better is influenced by his somewhat inaccurate hindsight.

xigency · on July 17, 2015

Well in this case, the program hardly needs an operating system at all. The only library calls required by default are printf and fgets. Assuming there's a way to get text and print it, it can run. There also needs to be about 50-100k of RAM when the interpreter is loaded into memory.

toolslive · on July 17, 2015

well, language features tend to need OS/system features. for example, trying to support dynamically loaded libraries, or leveraging multiple cores. It might not be relevant for a little toy language, but when you go to the next level, you will hit this and by then, it might be too late and you're stuck with a sub awesome choice.

vidarh · on July 17, 2015

Once you need to do those things, C is where you'll find the most pre-existing tooling pretty much regardless what platform you're on.

(note that personally I strongly believe in self-hosting compilers, but not because it's the easy approach, but because it's not: it forces you to make sure you can support all of the hard bits yourself; if I wanted easy, I'd pick C)

lmm · on July 17, 2015

You need the ability to make system calls, and you probably want the ability to call the standard library, which means supporting the "C" ABI (kind of a misnomer - system libraries might be written in e.g. Pascal and you wouldn't notice the difference, they just have to support the same calling convention. But you do need to be interoperable with C)

jrapdx3 · on July 17, 2015

The article starts out with a simple, incremental approach, pretty easy to grasp the concepts all the way through the discussion of the lexer and digesting the program input.

Then all of a sudden the density of the material accelerates dramatically. The discussion of the parser, and parsing as it applies to the language, becomes very opaque to readers not already familiar with the technologies being described.

It's perfectly OK to proceed that way for a sophisticated audience, but the first part of the article is written at a much more introductory level. An unsuspecting beginner may very likely feel overwhelmed about half-way through, and probably give up at that point.

When I got to the explanation of the AST, no surprise, it was exactly representable as an sexpr, which anyone with a few hours experience with Lisp or Scheme would recognize. Not sure, maybe it would have made more sense to begin with the parsed goal (the AST) and work backwards to explicate how parsing of the original syntax was done.

I guess parsing is a really hard subject to teach, but it's the core idea that the reader needs to grasp to be able to understand PL construction as the author laid it out. A still gentler introduction could be possible, if it doesn't sound so easy to do.

oskarth · on July 17, 2015

I noticed this phenomena in my own writing and I think I know where it comes from. In Polya's list of heuristic (from How to Solve It) he writes about the the two rules of teaching:

> The first rule of teaching is to know what you are supposed to teach. The second rule of teaching is to know a little more than what you are supposed to teach.

When we learn something and then write about it to teach others, oftentimes we are excited about the things we just learned, so we write about that, even if we don't fully understand it yet. Then we add an introduction to the "meat" (i.e. the limits of our current understanding) which is very clear, precisely because we are "overqualified" to teach it. Another way to state Poyla's rule for teaching is as follows:

Being overqualified in knowledge is a prerequisite for being qualified as a teacher of that knowledge.

One way I like to picture knowledge is as a dark room. You might step into one but that doesn't mean you know every corner of it. For example, imagine a proof or a technique. Unless you can know by heart why each assumption is there, or when the technique fails, you don't fully understand it. If you are guiding someone through a dark house, as you go from room to room, you are only qualified to give someone directions about the room you already been in, not the one you are in right now - even if that's what is on your mind.

Note that this isn't unique to people writing technical blogs, it happens in academia a lot too. There's a reason Feynman's lecture on "undergraduate physics" attracted the attention of graduate students.

agumonkey · on July 17, 2015

Your comment resonates very weirdly with my experience.

College teachers are guilty of the exponential wall. Easy and lengthy intro then steep but short core material (often cut short because the bell is ringing). Impossibly annoying.

The dark room exhaustive search is exactly how I ended up learning about learning. Through music though, as a self taught I had to visit every possible spots, even if most of them were holes. And most of the time, most things only 'make sense' because the other things just don't work. Now if you teach a topic to someone by listing what 'make sense', unless that person is very easy on remembering or extremely creative, it will be a meaningless burden to rote learn. On the other hand, knowing all the wrong parts, makes you very good at teaching someone since as the student will ask why his intuition is failing, you know how it feels and you know what to do about it, how to try other ideas, patterns, revisit things that he might have overseen.

Douglas Crockford said that once you understand monads, you suffer the curse of not being able to explain it anymore. I really wonder how often this happen, how many students learned shortened and polished stuff in books without the whole story, the whole experience of the actual researcher ? This tiny offset in understanding, repeated generation after generation .. might lead to large divergence. Saying this after watching (too many?) A. Kay talks about how people ignore the past, they ignore it because it's not taught that is all.

xigency · on July 17, 2015

Good points. I'll take it with a grain of salt.

I started writing this piece because a friend of mine recommended I write an article about Duck and post it on Reddit. I don't do a lot of writing and I started thinking about it and it seemed like it would really be a huge task.

HN crowd is a little bit more "sophisticated" so maybe that helps, but writing it I felt like it was going to be harder to get to the end when I started, so I really skipped over all of the details at the end. I'd like to give a better explanation of both the parsing step (to an understandable degree) and the interpreter itself, so that it is easier to follow along and implement the steps.

When I developed the project, it was sort of a stop-and-go process. I wrote an LR parser because I wanted to brush up on topics from a compilers course and it seemed like the most exhaustive way to do that. With adjustments, I can use the tables generated with any language to work on my next programming languages project.

After finishing the parser/parser generator, I was mainly throwing context-free grammars at the wall to see what was working or wasn't, without really analyzing them to see if the fit in the LR(1) category. I used a mini Java CFG I had from the compilers course and played around with that for a while, and I started working on a C grammar but that became annoyingly complex. Duck represented a kind of pseudocode that I could use.

Eventually I came back to it, maybe about a year or two later, and started to realize how easy it would be to get this thing able to run programs. So in about the course of a week I had the initial runtime functional.

I appreciate any advice for ways to explain it that might be better. I think this piece could use some editing. I would like to expand parts, I'm just concerned that it would be too long to read then. I think the initial suggestion was supposed to be a brief explanation, but I thought the only way to write this would be to go over everything completely.

I'll be sure to go back and improve this as it progresses. I've also been meaning to make improvements to the programming language itself, so I need to find a way to balance my spare time.

jrapdx3 · on July 17, 2015

> ... so I really skipped over all of the details at the end. I'd like to give a better explanation of both the parsing step (to an understandable degree) and the interpreter itself, so that it is easier to follow along and implement the steps.

That pretty well describes the problem and the way to make it more digestible. Giving a step-by-step explanation of the parser will be very useful. Illustrating what the parser is doing with clear examples would be very helpful.

Reading it again, I think the "sticky" parts begin with discussion of the grammar. A couple of examples, then the reader encounters a blizzard of phrases, "<terminal>", "<non-terminal>", "<S>", etc., strewn through the explanatory paragraphs which rapidly become confusing. Readers may be wondering "what 'production' are we talking about now?" and so on. Once losing the thread it's impossible to follow the text all the way down.

True, HN attracts a more knowledgeable crowd. Even so I'm sure there are many here who are not that familiar with the intricacies of parsing but want to gain a better grasp of it. Condensing the article's preliminaries, while expanding the parsing (and subsequent) sections and slowing down the pace, would likely make it better for both author and the reader.

Edit: typos

kwhitefoot · on July 17, 2015

I've only skimmed the beginning of the article but it seems to me that this is not about designing a language but about implementing a language. If you are going to design a language you have to consider a lot more than static versus dynamic.

I would say that the two main camps were typed versus untyped not static versus dynamic.

What about functional versus imperative versus actor.

So, in my opinion it dives much too deep into implementation much to quickly without any meaningful discussion of the design of the language or the process of designing a language.

Or, perhaps it's just the title that is wrong.

Patient0 · on July 17, 2015

I agree - all he needed to do was implement a recursive descent parser. You don't need to understand all of that LR stuff to implement a parser.

pearle · on July 17, 2015

Jonathon Blow has a great series [1] on YouTube where he's designing and implementing an alternative to C/C++ specifically aimed at meeting the needs of modern game programmers.

It's an interesting watch.

[1] https://www.youtube.com/watch?v=TH9VCN6UkyQ&list=PLmV5I2fxai...

xigency · on July 17, 2015

This is similar to a talk that Tim Sweeney (Epic Games) gave about creating a new language for games, that I thought was interesting. In particular, he talks about the tradeoffs of using C# over C++ for game development, and then talks about ways that ideas from Haskell or purely functional languages can be brought over that might improve performance and productivity for developers.

"The Next Mainstream Programming Language"

(https://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced...)

shadowfox · on July 17, 2015

I remember seeing this long back. Did he (or anyone else) make progress on this?

xigency · on July 17, 2015

Um, I sort of doubt it. Unreal Engine 4, while I haven't dug around with it, is presumably still C++ and low-level stuff.

The way the game industry is moving is still along the same path. With a lot of amature and professional developers moving towards platforms like Unity, it's sort of a backslide in my opinion. Then again, it's almost as if the target platforms have become too powerful for people to really care what they're developing on.

(My opinion, not trying to be controversial.)

kctess5 · on July 17, 2015

Great read - concise yet decently in-depth. Something cool about parsers is how many practical uses there are, and how easy they are to play with.

I decided to learn Go recently, so I first wrote a little parser combinator library (which also returns concrete syntax trees). Then I implemented a little grammar for a shorthand to define parsers, and did the cst -> parser step to make it spit out working parsers. Example here [1] I got it to work for recursively defined context free grammars with some fun [2] deferred function reference trickery.

I haven't decided what to do next, but I've been kicking around the idea of writing a notation to describe cst -> ast conversion. That way one could easily define parsers that generate abstract syntax trees (which can be much easier to perform high level graph manipulations on then the underlying cst.) This, paired with a lexer, could be used for quite a few practical applications, including writing special-purpose languages/compilers.

I'm also interested in investigating the efficiency/time complexity of the library I've made and seeing what can be done to speed things up. I'm fairly sure a few tricks could speed things up significantly. Could be interesting to do some profiling to see how Go is doing with the functional code.

I did this all as a learning exercise/toy project, so I know that there are similar things already out there and that this is a solved problem. I'll might write something about all this once I do something more interesting with it. Eventually I might try to write an optimizing compiler...

[1] https://github.com/kctess5/Go-lexer-parser/blob/master/main.... [2] https://github.com/kctess5/Go-lexer-parser/blob/master/parse...

keedot · on July 17, 2015

I accept that some people don't know how, or don't want to dedicate the time to making a responsive site. I get that, I have sites that don't make sense in mobile, so we don't bother with a mobile design. But don't take away the ability to zoom on mobile if you don't. You lose a good portion of your audience.

xigency · on July 17, 2015

Try refreshing the page.

The site is hosted on GitHub pages and I used a template. I found this tag in the HTML

And changed it to

keedot · on July 17, 2015

Much better, thank you!

sunilkumarc · on July 17, 2015

The page is not responsive. Hosting on Github pages doesn't mean the web page will be responsive. It depends on the template you have used which is not responsive apparently.

xigency · on July 17, 2015

Yeah, I'm not a UX guy, I'm a programmer. It's plaintext rendered with HTML. The <meta> tag above prevented scaling on WebKit phones. I don't really believe in using templates or overlays because I'm afraid of them interfering with usability, disabling the ability of a viewer to read the text, or being distracting.

If you are interested in the content I've posted, you would be best off printing it out and reading it. It's probably easier on the eyes.

acqq · on July 17, 2015

You still have widths expressed in pixels which still limits the chance of your text being readable on some smaller devices. At least I can report that your page is fully readable using Safari on iPhones in the "reader mode" (which avoids all the original formatting of the page).

posnet · on July 17, 2015

He was responding to the fact that user zooming was disabled and that he changed it so it works now.

girvo · on July 17, 2015

xigency didn't say it was responsive, merely that the template they were using is disabling user scrolling by default, which they have said is now fixed.

arundelo · on July 17, 2015

"Additionally, there are certain types of circular references that may never be freed under this scheme, although that's not a huge concern to look at from the start."

Early versions of PHP were released in the late 90s, with PHP 4, the first Zend Engine version, released in 2000. The first PHP that didn't leak cycles (PHP 5.3.0) was released in 2009. Beware of how long "not a huge concern to look at from the start" can last.

http://php.net/manual/en/features.gc.collecting-cycles.php

http://php.net/ChangeLog-5.php#5.3.0

xigency · on July 17, 2015

Yes, that is a good point. I had an adviser in college who specialized in garbage collection algorithms and I unfortunately didn't have the chance to take his course or learn from him, but there are definitely tradeoffs to all of the methods out there.

charriu · on July 17, 2015

I think the first section confuses a couple of things.

First, a program can be compiled (static) or interpreted (dynamic) as stated in the article. However, that does not mean that you can't have a dynamic type system in a compiled language, or vice versa.

Also, if you add type inference, the examples given for variables in dynamic languages are perfectly valid examples for variables in a language with a static type system (the type would just be defined on first assignment).

oxymoron · on July 17, 2015

Static and dynamic are theoretically unrelated to compiled or interpreted. The reason why it seems like these two are the same is that it's much harder to implement a compiled version of a dynamic language than an interpreted dynamic language, although I'm sure it's been done. There are plenty of interpreted static languages though.

I'm missing a discussion of weak vs strong in the article though. Visual Basic might be statically typed but it does have implicit type coercion which gives it a very different feel from — say — C#.

vidarh · on July 17, 2015

It's harder to compile, period. But's it's in many ways easier to compile a dynamic language than a static language, as you can implement a single mechanis for defining classes and calling methods. That mechanism can often be very similar to how it would be in an interpreter.

What is harder is to achieve the same performance jump from interpreted to compiled. Exactly because the "naive" approach to compiling dynamically typed languages gives you similar machinery to a well written interpreter, while for static language the "naive" approach often gets you much more efficient results.

qznc · on July 17, 2015

> it's much harder to implement a compiled version of a dynamic language

No it is not. All major dynamic languages can be compiled usually JIT. Every major browser has a JIT compiler for Javascript. Python, Ruby, and Lua have JIT-compiled forks/reimplementations.

The difference of JIT vs ahead-of-time compilation is irrelevant for the high level of the article.

sklogic · on July 17, 2015

> All major dynamic languages can be compiled usually JIT.

JITs are profoundly different from standalone compilers. And are many times more complicated.

> The difference of JIT vs ahead-of-time compilation is irrelevant for the high level of the article.

No it's not irrelevant. JITs are starting to kick in at a very different point than the static compilers: they've got a path trace, runtime type information and profiling information, while the static compiler only got an AST.

JITs can do gradual refinement, and for this they may implement multiple different compilation techniques. Static compiler must do all it can at once, there is no other chance to improve.

KC8ZKF · on July 17, 2015

Common Lisp is an example of a compiled dynamic language.

xigency · on July 17, 2015

And a rather opaque one at that. I've never heard of anyone discussing the implementation details of a Scheme or Lisp compiler, leading me to believe that they must have really complicated internal logic.

I might be better served applying my time towards development on Chicken (Scheme).

xigency · on July 22, 2015

In response to being downvoted — I would assume that tracking all of the records in function closures (first-class functions) would be a real headache, though slightly aided by the fact that garbage collection is available, and the question of whether to support eval raises a number of issues.

In any case, compiled Common Lisp is not something that resembles other compiled languages... and the purpose of my article is to discuss either designing an interpreter or a compiler, not a dynamic compiler, recompiler, or compiled code that requires some kind of runtime interpreter as a side, all of which are extraneous use cases for hackers not people learning how to program. They also make for poor pedagogical tools.

_19qg · on July 22, 2015

There are books about Lisp implementations and papers about various Lisp compilers. Actually there is a whole bunch of literature about it.

Personally I find that the article needs a bunch of improvements:

static vs. dynamic vs. statically typed and dynamically typed

These are different things. Either we are talking about runtime behavior (a dynamic language can change itself or the program at runtime) or we are talking about typing.

For example Java is a statically typed language, but it allows the use of class loaders which can reload classes at runtime - which provides some dynamic features. On the other end of the spectrum are many compiled Lisp implementations which allow various changes to the programming language at runtime, for example via an extensive meta-object protocol.

> Attempting to access a value that has not been named in a relevant scope leads to a syntax error issued at compile time.

A syntax error? Really? Syntax?

> So called dynamically typed languages are sometimes referred to as duck-typed or duck languages.

Definitely not. Duck typing is a only a special form of dynamic typing, usually in an object-oriented language. Non object-oriented dynamically typed languages don't provide duck typing, because they lack OOP objects.

Does compiled Lisp look different?

Maybe... Let's look at SBCL on a 64bit Intel processor:

    * (defun calc (a b c) (+ (* 2 a) (* 3 b) (* 4 c)))

    CALC

    * (disassemble #'calc)

    ; disassembly for CALC
    ; Size: 118 bytes. Origin: #x1002EC2599
    ; 599:       48895DE8         MOV [RBP-24], RBX               ; no-arg-parsing entry point
    ; 59D:       BF04000000       MOV EDI, 4
    ; 5A2:       488BD1           MOV RDX, RCX
    ; 5A5:       41BBD0020020     MOV R11D, 536871632             ; GENERIC-*
    ; 5AB:       41FFD3           CALL R11
    ; 5AE:       488BF2           MOV RSI, RDX
    ; 5B1:       488B5DE8         MOV RBX, [RBP-24]
    ; 5B5:       488975F0         MOV [RBP-16], RSI
    ; 5B9:       BF06000000       MOV EDI, 6
    ; 5BE:       488BD3           MOV RDX, RBX
    ; 5C1:       41BBD0020020     MOV R11D, 536871632             ; GENERIC-*
    ; 5C7:       41FFD3           CALL R11
    ; 5CA:       488BFA           MOV RDI, RDX
    ; 5CD:       488B75F0         MOV RSI, [RBP-16]
    ; 5D1:       488BD6           MOV RDX, RSI
    ; 5D4:       41BBD0010020     MOV R11D, 536871376             ; GENERIC-+
    ; 5DA:       41FFD3           CALL R11
    ; 5DD:       488BDA           MOV RBX, RDX
    ; 5E0:       48895DF0         MOV [RBP-16], RBX
    ; 5E4:       488B55F8         MOV RDX, [RBP-8]
    ; 5E8:       BF08000000       MOV EDI, 8
    ; 5ED:       41BBD0020020     MOV R11D, 536871632             ; GENERIC-*
    ; 5F3:       41FFD3           CALL R11
    ; 5F6:       488BFA           MOV RDI, RDX
    ; 5F9:       488B5DF0         MOV RBX, [RBP-16]
    ; 5FD:       488BD3           MOV RDX, RBX
    ; 600:       41BBD0010020     MOV R11D, 536871376             ; GENERIC-+
    ; 606:       41FFD3           CALL R11
    ; 609:       488BE5           MOV RSP, RBP
    ; 60C:       F8               CLC
    ; 60D:       5D               POP RBP
    ; 60E:       C3               RET
    NIL

Now we can even define the types in Common Lisp:

    * (defun calc (a b c)
      (declare (fixnum a b c))
      (the fixnum
           (+ (the fixnum (* 2 a))
              (the fixnum (* 3 b))
              (the fixnum (* 4 c)))))
    STYLE-WARNING: redefining COMMON-LISP-USER::CALC in DEFUN

    CALC
    * (disassemble #'calc)

    ; disassembly for CALC
    ; Size: 38 bytes. Origin: #x1002F4B802
    ; 02:       48D1E1           SHL RCX, 1                       ; no-arg-parsing entry point
    ; 05:       488D047F         LEA RAX, [RDI+RDI*2]
    ; 09:       48D1F9           SAR RCX, 1
    ; 0C:       48D1F8           SAR RAX, 1
    ; 0F:       4801C1           ADD RCX, RAX
    ; 12:       48C1E602         SHL RSI, 2
    ; 16:       48D1FE           SAR RSI, 1
    ; 19:       4801F1           ADD RCX, RSI
    ; 1C:       48D1E1           SHL RCX, 1
    ; 1F:       488BD1           MOV RDX, RCX
    ; 22:       488BE5           MOV RSP, RBP
    ; 25:       F8               CLC
    ; 26:       5D               POP RBP
    ; 27:       C3               RET
    NIL
    *

xigency · on July 27, 2015

Overkill.

This article isn't a dictionary entry, an encyclopedia, or a research paper. It describes what's necessary to build a programming language for someone who has the necessary skills, which is someone who programs.

The entire premise of this programming language isn't based on as rigorous ideas as what you're criticizing, and that's been the reaction since the first time I posted about it.

As for what I was talking about, I'm only interested in dynamically typed languages that are compiled and provide support for 'eval in this case, because I don't believe that this is possible without providing a complete library for an interpreter with a compiled executable. For every other case, who really cares? Being able to replace every variable declaration with the keyword `auto' does not a new (or useful) programming language make. Additionally, all of these other gray areas in-between are not something I am interested in.

For this purpose, when speaking of the Duck programming language, dynamicism refers to being able to literally manipulate types and data in any way imaginable, both in terms of runtime behavior AND typing. In any case, it is designed to be the most dynamic language, as the union of all of these features, and as such that invalidates a huge number of complaints.

>> Attempting to access a value that has not been named in a relevant scope leads to a syntax error issued at compile time.

> A syntax error? Really? Syntax?

This is a very minor complaint and mirrors 99% of the criticism I've received.

I wish I could get more interesting feedback for the content of my writing rather than the semantics.

bigtunacan · on July 17, 2015

As is JavaScript

sklogic · on July 17, 2015

No, they're very far away from each other on the dynamism scale. There is absolutely no point in trying to compile JavaScript statically, but yet it works very well for Common Lisp. The opposite is also true, smart tracing JITs won't do much for Common Lisp, but shine for JavaScript.

bigtunacan · on July 17, 2015

I was just adding on that JavaScript is another example of a dynamic language that is compiled (using JIT in this case) rather than strictly interpreted such as the MRI implementation of Ruby.

bigtunacan · on July 17, 2015

Since both this and my parent post on JavaScript being compiled were downvoted without comment I can only make the assumption this is because someone came along who believes this is not the case.

So first I will link out to this StackExchange doc that does a pretty good job of describing this.

http://programmers.stackexchange.com/questions/138521/is-jav...

The JavaScript specification says nothing about the language being interpreted or compiled. Today; most current versions of major browsers use a just in time compiler to handle JavaScript. V8 (Chrome & Nodejs), SpiderMonkey (Firefox), Chakra (InternetExplorer), Carakan (Opera), JavaScriptCore(Webkit/Safari)

For more information take a look at the Wiki entry for EcmaScript engines.

https://en.wikipedia.org/wiki/List_of_ECMAScript_engines

nemoniac · on July 17, 2015

Could you elaborate on this point?

sklogic · on July 18, 2015

The biggest gain of the tracing JITs is elimination of dynamic dispatch. CL does not use it to an extend comparable to JavaScript, Python or Smalltalk.

_19qg · on July 22, 2015

That depends largely on the implementation and libraries. Some make extensive use of CLOS in I/O for example.

xigency · on July 17, 2015

The only difference for me is how often I need to cast types so I guess that's kind of lost on me.

oxymoron · on July 17, 2015

Sure, that's what it comes down to. Languages with implicit type coercion do have a set of bugs not found in strongly typed languages though, but the additional safety of a strong language is -- like you say -- gained at the expense of writing casts. That does seem like a trade-off worth discussing in an article about programming language design.

hvidgaard · on July 17, 2015

C# allows you to do implicit casts when you do not lose information, e.g. int to long. I cannot remember the last time I had to cast to something where I would potentially lose information.

However, I don't think manual casting is a cost at all - it's more of a guide telling you something in the design is wrong, or a reminder to make a comment of why you do it. And for any long maintanence projects, you want to catch all the bugs you can as early as you can.

xigency · on July 17, 2015

Yes, but those are all tasks which the compiler has to do when compiling source. So after a programmer has written code but before it can even run.

Really any run time could completely disregard types, it just might lead to a huge number of failure cases. These generalizations are there for us to start to get things to work.

When programming languages add features like these, especially 10 or 20 years after they were developed, they definitely feel tacked-on and if you inspect how they work, it's not pretty. If it's an interesting quirk then it might be an example to study.

bigtunacan · on July 17, 2015

Designing a programming language is something I've found very interesting for the past few years now, but this is the one area of programming that really still eludes me and seems unapproachable. I've been through Understanding Computation, but at a certain point I just no longer understood what the hell was going on I was just typing in the examples.

I have the Principles of Compiler Design book and I've made a run at that a couple of times, but it seems to theoretical for me to do anything useful with it.

I've been thinking about taking a go at creating something with ANTLR alone with the books The Definitive ANTLR 4 Reference and Language Implementation Patterns, but I'm not sure if I'll have much more luck there.

If anyone knows of a better path to get from here (can't design my own language) to there (clarity and understanding) I would love to hear your ideas.

I'm not trying to create the next Rust, JavaScript, etc... I just believe that being able to implement my own language would give me a deeper understanding. In fact I'm very much interested in the idea of reimplementing an interpreter or compiler for languages I already work with as a way of learning this rather than trying to create something new.

sklogic · on July 17, 2015

A word of advise - forget about parsing for now. It's unimportant anyway and obscures the whole thing beyond recognition.

Start by designing embedded DSLs on top of sufficiently expressive languages (i.e., any language with compile-time macros), it will give you enough of understanding of how compilers work. From there you can move to standalone compilers and, finally, if you ever need to (most often it's not necessary), to the syntax front-ends.

rspivak · on July 17, 2015

I started a series of articles called "Let’s Build A Simple Interpreter" which might be what you're looking for:

- http://ruslanspivak.com/lsbasi-part1/

- http://ruslanspivak.com/lsbasi-part2/

xigency · on July 17, 2015

falcolas · on July 17, 2015

I bring this up every time, because it's a great approach to writing a compiler, and for showing off Lisp. Even better, if you get really stuck, there are already examples out there to learn from, in multiple languages.

Make A Lisp (MAL)

https://github.com/kanaka/mal

xigency · on July 17, 2015

Lisp is great because it's easy to parse. There's really no need to have either section on lexer or parser when working with s-expr. This would be a great example of where a recursive-descent parser excels, and not only is it easy to build, it's easy to understand. I think writing an interpreter or compiler for Scheme or Lisp would be a much better place to start for learning how to implement programming languages, it's just not what I chose to do here.

kctess5 · on July 17, 2015

I just made a comment here, I might suggest that you do something like what I did. A solid understanding of parsers, lexers, and syntax trees would be an excellent place to start if you're interested in writing your own language.

Start with the basics, get a firm grasp of that, then move on. The 'language design' part of things seems to be basically just figuring out how to combine basic ideas (parsing, lexing, ref counting, etc...) in new, fancy ways (or not so fancy, depending on your goals.) If you're interested in compilers/languages then you're probably going to get mighty familiar with manipulating graphs, so I'd try to read up on some graph theory. Language design/implementation gives us the opportunity to mathematically prove correctness of various features, which can be fun, if you're into that kind of thing.

I think another great way to learn about language design is to actually just look at other languages and play with them. Once you know know the basic concepts it's usually easy enough to guesstimate how various language features work behind the scenes just by using them. A lot of very smart people put a lot of time into making languages like Rust/Go/etc, so learn from them, and learn their weaknesses.

p.s. I'm not sure how much you've already researched, so I apologize if this is above/below where you're at.

tel · on July 17, 2015

So as you're probably aware, languages arise from the composition of a few simpler steps:

1. Strings to ASTs (parsing and, for efficiency and ergonomics, lexing)

2. AST to "outputs" (interpretation)

3. AST to "better" AST (optimization)

4. AST to "analysis outputs" (verification, maybe type checking)

5. AST to some other language, maybe assembly (compilation/transpilation, sometimes called codegen, depending on how you look at it)

You should pick the ones of these you are most interested in! Most likely, to be honest, this will not be parsing or lexing, but if they are you're in luck because there's a lot of research here.

Instead, just get enough parsing so that you can stop worrying about it. Parse a lisp-like language (very easy) or use an embedded DSL (harder, more flexible).

From here, focus on step (2) as it's a significant design challenge and will get your feet very wet in what PLs are. You may never have to focus on the others depending on your proclivities and the styles of languages you want to build.

Steps (3, 4, 5) are all deep rabbit holes themselves. Steps (3) and (5) mostly are required if you're interested in knowing how to make languages fast—although (5) is also necessary if you want to target a particular platform—e.g. write something to run on the Erlang VM. (4) is REALLY interesting, in my opinion, but optional if you want it to be.

lmm · on July 17, 2015

> If anyone knows of a better path to get from here (can't design my own language) to there (clarity and understanding) I would love to hear your ideas.

http://www.hokstad.com/compiler was what made it make sense to me. I found the dragon book approach (where you start with this 4-stage design and you're supposed to implement a parser and lexer and all this because I said so) hopelessly confusing and unmotivated. Hokstad takes you through writing a compiler in a way that makes sense, the same way you'd write any other program.

aodh · on July 17, 2015

I strongly recommend following the Kaleidoscope LLVM tutorial: http://llvm.org/docs/tutorial/

It should take maybe a day to work through. Even if you don't plan on ultimately using LLVM as a back-end compiler, this touches on a number of other useful concepts that will make your journey much easier.

I wanted a more flexible parser for the language I'm working on, so I'm using a variant of an Earley parser that I wrote five years ago. ANTLR uses LL parsing.

dan00 · on July 17, 2015

It might help to start with something more low level, like writing a compiler and emulator for an assembly language. A quite good candidate for this are artificial, simplified cpus, like the dcpu16.

qznc · on July 17, 2015

http://prog21.dadgum.com/30.html

https://programmers.stackexchange.com/questions/165543/how-t...

Or do you want to write an interpreter instead? Well, the frontend is the same anyways. Instead of transforming the AST, you evaluate it instead.

bigtunacan · on July 17, 2015

More likely an interpreter than a compiler in the beginning. Ideally I would like to reimplement something like Ruby, R, or JavaScript since these are the languages I work with on a more day to day basis.

vidarh · on July 17, 2015

I would recommend against Ruby and Javascript (R I don't know enough about) for a first attempt.

I'm working on a Ruby compiler (http://hokstad.com/compiler). An interpreter is far easier, but still a huge amount of work. I'm sure it'd be fun, but I'd suggest trying a smaller/simpler project first and then revisit Ruby or JS.

Ruby is an absolute nightmare to parse. Both Ruby and Javascript have complex execution models that seem deceptively simple at first glance. They're quagmire's.

If you do want to do something Ruby-like, define a cut-down version. E.g. write down a grammar that covers the most important subset of Ruby that you use, and similarly define a small set of the core functionality. Parsing 80% of Ruby is easy. Chances are you've never used or seen most of the remaining 20% outside of rants about the complexity of the Ruby grammar. E.g. the plethora of ridiculous supported mechanisms for quoting text that allows you to do criminally unreadable stuff like "% x " (which, excluding the double-quotes parses as the string "x", as "%" introduces a quoted sequence started and terminated by the character after "%", except if the character is one of a set of exceptions of various kinds; this of course only when it occurs in a context when it can't be parsed as the infix modulo operator....)

You can write a full compiler for a simpler language in the space it takes to just parse Ruby.

You could probably define a suitably small cut-down version of Ruby that'd still be fun/usable relatively easily, especially if you're not concerned about actually being compatible.

sea6ear · on July 17, 2015

Lua might be a decent language to write an interpreter or compiler for.

It's a fairly small language and I've seen enough re-implementations of it that there some is evidence that writing a new version is reasonably doable.

amelius · on July 17, 2015

Anybody know the difference between the "green" and the "red" dragon book?

pdpi · on July 17, 2015

green dragon book is https://en.wikipedia.org/wiki/Principles_of_Compiler_Design

red dragon book is the first edition of https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniq...

dummy7953 · on July 17, 2015

I really wish more help from human-computer interaction & cognitive psychologists would be used to design a programming language.

The programming language is (to me) the most expensive part of a system, because it determines how the coder works & interacts with the system. And the human resources involved in building the system are the most expensive (besides those used to maintain and run it daily).

There must be symbols used to write code that are more easy to cognitively process than others. There must be keywords that are easier to memorize than others. There must be ways to write a statement that are less prone to error than others. Let's start working on this important and neglected problem.

tormeh · on July 17, 2015

Those are the "language by committee" that people complain about. Ada was made that way, if I remember correctly.

sklogic · on July 17, 2015

Nice. But, as usual, a bit too much emphasis on syntax and parsing - the least important part of the language. Semantics and intermediate representations are much more important. Also, not quite convinced about the choice of C vs. LLVM IR.

qznc · on July 17, 2015

Why intermediate representation? Build a straightforward AST then convert to LLVM.

Semantics is important. Before you start writing the frontend, I would advise to write some programs in your language like a tutorial for newbies.

sklogic · on July 17, 2015

> Why intermediate representation?

Because you'd need not just one, but many of them, in order to keep things simple and manageable - for the philosophical background on this read about the Nanopass approach: http://www.cs.indiana.edu/~dyb/pubs/nano-jfp.pdf

> Build a straightforward AST then convert to LLVM.

It's only possible for the very low level languages without any interesting features.

Firstly, you'd need some type system, even if it's mostly dynamic. Secondly, your frontend may provide quite a bit of syntax sugar that will need further lowering before you can easily turn it into IR.

Even before typing, you'd have to handle the lexical scoping, which may also involve an additional intermediate representation (one distinct IR per each pass is normal).

With this approach you can keep your entire compiler pipeline as a sequence of trivial, fully declarative, pattern-based tree rewrites. With the less fine-grained approaches you'd end up with a barely maintainable mess, where multiple passes functionality is combined into a single bloated, twisted pass.

> Semantics is important.

Of course it's the only thing that is important. Unlike syntax.

davorb · on July 17, 2015

> one distinct IR per each pass is normal

Source?

sklogic · on July 18, 2015

> Source?

1) Personal experience

2) The link I provided above (and all the other relevant Nanopass papers too)

chrisseaton · on July 17, 2015

> Build a straightforward AST then convert to LLVM.

How well do you think LLVM is going to optimise that code? You can't just give LLVM dumb IR directly generated from your AST and expect it to somehow work magic.

LLVM is low level. You need to lower your high level operations, such as method dispatch and metaprogramming, a long long way before LLVM will do anything sensible with optimising it.

For example Rubinius generates LLVM IR directly from bytecode for Ruby (so it's even a level more sophisticated than you suggest), and for every benchmark I run it's basically no faster than MRI!

For another historical example see Unladen Swallow.