Hacker News new | past | comments | ask | show | jobs | submit login

I found the code surprisingly readable.

Most variables names are what I expected them to mean despite their shortness: pc, sp, bp are registers, a is the accumulator, fd is a file descriptor (of the input file, what else?), tk is for the token, t is temporary, etc... For the less obvious ones, it is usually not that hard to infer their meaning from either the code or comments.

Because yes, they are comments, not many, but they are helpful. For example, the VM has unusual instructions (for me) like LEV and ADJ, and they are commented. The "obvious" ones like MUL and SHR are not.

The variable names are not "needlessly cryptic". I've seen (and written, not proud of it) a lot of needlessly cryptic variable names, and believe me, these are crystal clear by comparison. Here, there is a clear influence from assembly mnemonics that really helps understanding.

Now on the why. This is minimalism, and minimizing the number of comments and variable name length is part of it. It is actually a very interesting exercise. The golden rule in making understandable code is making it as short as possible. There is a limited amount of space on your screen and in your mind, and the shorter your code is, the more you can see/understand at once. Of course, too much is too much, you don't want to do things IOCCC style, and striking a balance is difficult. So once in a while, reading or writing very compact code can help you understand where shaving off characters is fine and where it really hurts understanding.




> The golden rule in making understandable code is making it as short as possible.

Absolutely not true, in any way.

And I fail to see a single reasonable argument for why writing "tk" should be better than writing "token". Just name things using real words. Cryptic abbreviation does not help anyone.


You will mentally read "token" every time you see "tk" anyway.

The same way you "grep", "ls" or "cd" --- it easily becomes as natural as any other language, unless you consciously try to stop yourself from learning.


No, you won't. You'll read "tk", and then do a little mental substitution to "token". It's tiny, but it adds to your mental load.

You could just not do that, and be happier. Why add the extra indirection?


I do agree that `tk` is an unusual abbreviation, but `tok` for `token` is pretty common in language implementations. `ty` for `type` is also usual.

The names are not only read but also being manipulated in your mind. You will see tons of `token`s and `type`s throughout the code and while you can read them fine you will have hard time dealing with them in your mind. For uncommon names that are usually read out of the context, longer names are preferred. For common names or local enough names, short mnemonics that evoke the original name really help. When it is not possible, people usually develop a specialized terminology for them.


I don't have any problem holding properly named variables in my mind, so I don't see what the problem is here?

The only issue is that if you are writing gigantic statements with lots of variables, it gets too long to see properly. And that is a sign that you are writing too large expressions, and need to start cutting them into smaller parts anyway.


You don't always replace X and Y (as in coordinates) to something longer. I know, they can be (and probably should be) replaced with left/right and top/bottom when we are talking about boxes, but not all X and Y can be replaced.


You replace them with abscissa and ordinate actually. And Z will be applicate, it seems, although I already heard azimuth used for this one. I'm not aware of a dedicated terms for higher dimensions, so your beautiful (w,x,y,z) quaternion tuple is a big deal, isn't it.

Anyway, you can always use an array `coordinates` with whatever dimensions you want.


Sure, but those are commonly used names and well established, so they act as actual, proper words in this kind of discussion. There are some widely accepted short names that can still be descriptive.

But arbitrary abbreviations like "tk" don't fall into this category.


You need some basic understanding of 2D graphics to understand what is X and Y. In fact, they are not good names otherwise because it is pretty easy to mix them up---one more reason to prefer left/right/top/bottom whenever possible. I've seen some codes using Column and Row instead of X and Y for that very reason. (Not to say that I like it.)

If you are okay with X and Y, you should probably have to accept that the definition of "commonly used", "well established" and "actual, proper" words is subjective and different areas and projects have different notions of them. I'm okay with `tk` if it is used consistently and doesn't interfere otherwise.


Yes they do. tk, pc, sp, a, etc all fall into the well established category in this context.


The first time you'll read "HN", and then do a little mental substitution to "Hacker News".

Why do humans abbreviate?


Unfortunately, it seems that most of the time it's because they are lazy sheeps who don't even have a clue of what means 90% of the abbreviations they use, but don't want to look like the sole ignorant in the room asking it, and won't take time to look at it by themselves later, even for the 10% they most often encounter.


Bad example. ls and friends are meant to be typed constantly and therefore need to be short.


I read "grep", "ls" and "cd" as "grep", "el ess" and "cee dee". I don't even know what grep and ls stand for.


ls is a shortening of "list", i.e. list directory contents, and grep is apparently Global Regular Expression Print.


grep comes from an amalgamation of commands for a global search applying a regular expression and printing the results I.e., “g/re/p”

ls just simply stands for list


No, we won't; "token" is a function name, not a local variable name. If it was written "token", we'd have to substitute it back to "tok" every time we read it.


Once you know “tk” means token, you don’t need it to be “token”.


Absolutely true. C4 doesn't go far enough.

The source of my favorite C compiler, on the other hand:

http://www.kparc.com/b/


This source would look beautiful as pattern on a t-shirt or socks. :-)


Why not just write out ProgramCounter, StackPointer, Accumulator, InputFileDescriptor etc? It doesn't take significantly longer to type. It's faster to read because you can recognize the shape of those words and don't have to mentally substitute the actual works. Code is for reading, not for writing.


Is it faster to read? Maybe in terms of bytes/s, but not in lexemes/s nor in terms of getting an overall idea of how everything works.

Compare with this, which is probably more in the style you're thinking of:

https://github.com/dotnet/roslyn/blob/master/src/Compilers/C...

There's so much "noise" that it's hard to see the "big picture", and the repetition of VeryLongIdentifiers causes https://en.wikipedia.org/wiki/Semantic_satiation to occur quickly.


If you're talking about a small, quickly-written, one-off piece of code, then I think truncated variable names are OK.

If the code is anything that anyone else (including future you) will have to read, or a part of a larger system, then descriptive variable names are best.

I can't count the number of times I've dropped into some source code with variable names that didn't mean anything and with no comments describing what they mean.


If one understands anything at all about a cpu, then pc, sp, a mean something instantly.


Agree on those, but t and tk can mean anything. We can symbolize everything and, as someone else argued, once you know tk is token you can just read it, but replacing variable names with the shortest possible names is still called obfuscation for a reason.

I personally also hate the Java (and to a lesser extent, C#) custom to write MemoryLocationRepresentation when you can say pointer, but there is certainly a middle ground. Token is 5 characters, not 30.


Consider non-native English speakers. For them, the abbreviation makes it much harder to read.


I am not a native English speaker. Reasonably skilled, but nowhere near native.

Abbreviations don't make it harder because of it. If anything, it is less of a problem. Because using the proper English word doesn't help more than using an abbreviation if you don't know the meaning of the English word in the first place.

On a side note, I have more trouble understanding code written in French (my native language) than in English. Simply because when we learn programming, we learn it with the English terms. For example, we know what a "token" is in the context of a "parser", that's how we call it. The french translation would be "symbole" and "analyseur syntaxique" respectively, but you will be better understood if you use the English words.


>Because using the proper English word doesn't help more than using an abbreviation if you don't know the meaning of the English word in the first place.

If you don't know the meaning of an English word, you can use a dictionary. If you don't know the meaning of some ad hoc abbreviation, unless you can waste even more human time by asking at people who already are in the secret, you are left on your own.

> On a side note, I have more trouble understanding code written in French (my native language) than in English.

USA soft power is strong, that's it. It's people duty to take care of better mastering their own languages if they don't want to see it ineffective in their daily linguistic needs.

People know what a token is in the context of a parser, only after they learned it. When this is not the learner native language, they will learn it most likely without having a clue of how it makes sense in the semantic network of English. If a French is first introduced to this notion using the term "lexie" (which also exists in English by the way, as a borrowing from French to English in linguistic this time), chances are far greater that it will evoke something meaningful to this person, as it's lexically close to the term lexic. Using French morphemes, one could also easily produce terms like métataxeur[1], or even distaxeur and transtaxeur.

>but you will be better understood if you use the English words.

Chance are greater that they will see what you are referring to as they already crossed the term before more often. It doesn't necessarily imply that they will better understand what it means. When a notion is well assimilated, it's recognized in any language mastered, even when it's expressed under a bright new metaphor.

[1] see https://fr.wiktionary.org/wiki/m%C3%A9tataxe and https://fr.wiktionary.org/wiki/-eur


Was there a period in the 1960s or 1970s where French speakers used native terms instead of English for computing terminology?

I'm wondering about this because a Brazilian friend is doing a computer history project and he noticed that 1970s documentation used literal Portuguese translations of English technical terms, and the translations are no longer transparently comprehensible to present-day Brazilians because of the subsequent switch to using the English terminology. For example, the documentation refers to a "montador", and he had to translate that into English for his Brazilian audience ("assembler").


If they don't speak English, it matters even less...

(I've read code written by Chinese --- variables named dzhq, xljn, etc. are not uncommon. If anything, they like to abbreviate even more.)


If they're not fluent in the same abbreviations but have decent English-as-a-second-language skills, they can read Rosalyn style code but not 2-letter abbreviations.

Heck, I can't even read my own 2-letter abbreviations a year later sometimes.

When I write the code, I'm likely coming off reading a paper or datasheet that used certain abbreviations. I might have seen the word "token" so many times in that week so in that moment, I can't imagine what else 'tk' might mean. But it's when I come back a year later off a heat stake project that used K-type thermocouples where seeing 'token' is much clearer.

If those Chinese variables were named DaanZenghQian (sorry, I know my Mandarin sucks) instead of dzhq you might have a chance to translate that into "result of the upper thousands" for whatever that means in your context.

Pretend you're someone who doesn't have exactly the state of mind and background knowledge you have right now. That might be a Chinese person with limited English, it might be your coworker who was working in Delphi instead of assembler in the 90s, it might be yourself with a bit of time elapsed. That's the person who you need to be writing for, not for you in the moment of writing it.


I have read many Korean codes and while there are lots of Latin transliterations abbreviations were rare.


I, as a non-native English speaker, disagree.


Because e.g. pc and sp are exactly the abbreviations used in assembler for some decades?


We should probably mention that "e.g." is an abbreviation for the Latin "exempli gratia", and means "for example." ;-)


You don't have to 'mentally substitute' the actual words. PC, SP, A, etc. are the words themselves. StackPointer is a pointless formalism.


Could you also provide the meaning for the other (pointless) abbreviations? :)


Because it make the lines longer and long lines are bad. If it results in a horizontal scrollbar, it is terrible, but even without it, there is a reason papers are often printed in column format and most coding rules specify a maximum line length (often 80, though 120 is becoming popular these days, with big wide screen and all that).

So long lines need to be split. Which is difficult to do properly and results in more lines, and more lines mean less of the code is visible at once and that makes it harder to see the big picture.

But to each his own I guess. Anyway, you can try it out yourself. Just take the code, do the replacements and see for yourself.


Started here: https://github.com/psychoslave/c4

But help would be welcome to retrieve the intended meaning for many of variable names that were turned to nonsense, be it a comment here, an issue on the repository, a pull request or anything else.


There are often ways of reformatting a line to break it if it's too long that also does not require renaming things. For example, a long list of conditions in an if statement can be broken into one condition per line. Results of comparisons can be put into their own variables. Logic flow can be adjusted and produce the same result. And so on.


Are you reading this on an Apple Watch? I still generally use 80 characters out of habit, but given how monitors have grown, 120 or even 140 should be the new norm.


Adding an extra column for code|docs|other context is so much more useful than allowing longer lines for obese identifiers that rarely serve to make a point more clear.

I'll take my four or five columns of 80 chars over two columns of 120-140 chars any day.


It takes longer to type and read. All the little seconds fiddling with the mouse, popup menus, hand eye coordination wastes your time and prevents muscle memory. Its hard to reach max throughput with long variable names.


Typing speed is absolutely not the limiting factor for programming productivity. If you are actually limited by typing speed, you are doing something very, very wrong.


Besides, this issue has been solved for many years now. Auto-completion in modern IDEs has gotten really good.


Humans don't read words letter by letter, you recognize the whole word pattern. Abbreviations are actually slowing you down on this point, at least the first times you encounter each new one. Having a longest but more usual term will take you least time of reading treatment.

Autocompletion will rarely ask you to type more than four keystrokes for selecting any arbitrary long term.

Meaningful terms in context often happen to be far more easier to grep.

Except for sounding far more impenetrable to the lay man, there is not much left to these H4x0r turns. Of course jargon curse is not a prerogative of CS, this is a common spontaneous social behaviour.


Most of these abbreviations are well established. 'pc', 'sp', and 'a' are the names of those registers in many assembly languages.


To clarify, you don't usually see just "a" for an accumulator, as there are usually more than one accumulator-style registers in a CPU, and in many cases they are split along byte (possibly word) boundaries.

So you end up with accumulators called "A" and "B", but are composed of registers "AX" and "AY", and "BX" and "BY", with each being one byte (or word) wide; X and Y being high and low bytes/words of the register (and dependent on "endian-ess" too).

Sometimes you even get where multiple registers can be referenced by a singular name - "D" is a popular choice, and may be made up of "A" and "B" (being low/high "registers" of the larger word). IIRC, the 6809 was like this (?) - A and B were 16 bit registers, but could be referenced as a 32-bit word "D" (or maybe I am thing of the 68k or some other architecture - it's been a long while).

The only other time I have ever seen singular letters used for registers in assembly was for very old pre-microcomputer systems (beasts like the Univac and System/360 - though I think the PDP-8 had similar style). Also some of the very early "microcontrollers" (which were more like glorified sequencers with some extra memory and rudimentary branching, if any) had similar "registers" (Radio Shack once sold, as a part of their "Science Fair" electronic kits, a "Microcomputer Trainer" that was something like a very small 4-bit microcontroller with 128 bytes of memory or something like that - to teach assembler and a bit of hardware interfacing - it had "small" registers like that referred to in single letters).


The 6502 is still in production, and has single-character register names (A, X, Y, P, S).


The 8080 had A, B, C, D, E, H and L. These mostly carried over to the 8085. Newer chips have ax/al/ah, eax, rax type names the grew out of the original names. The Zilog Z80 and Sharp LR35902 were mostly 8080 compatible.

The MOS 6502 has, as gmfawcett said, single-letter names. These in turn carried over to Western Design Center (WDC)'s 65C816. There are actually separate instructions for loading and storing in A, X, Y and Z at least on the '816. LDX, STX, and so on. This means the Ricoh 2A03, Ricoh 5A22, Hitachi 6309, MOS 8501, MOS 8502, and the later MOS 65xx series and the CSG chips. A fun fact is that the 6502 had especially fast access to its zero page memory and special instructions for some functions on that page, the first 256 bytes of RAM. Language implementers sometimes made up for the dearth of registers by treating certain addresses in the zero page as additional registers.

The Motorola 6800 had two accumulators, A and B. The stack pointer was merely S. X is the index register. It also treats the zero page specially. The 68000 series broke with this, having eight address registers a0-a7 and eight data registers d0-d7.

All of the above used A as an accumulator at least by convention in the materials.

SP is the literal name of the stack pointer on x86 in 16-bit mode. It's also used as an alias for R13 in at least some Arm (AArch32 on v7 and earlier for example). SP and PC are the stack pointer and program counter on the PDP-11. It's aliased to r1 on the Intel 80960 (i960) since that is the stack pointer on that platform.

The PDP-8 used similar zero-page tricks to the MOS 6502, only given that it had one (1 !!!) register, that was necessary.

All of these processors where CPUs for commercially successful systems. They might "only" be microcontrollers today.

The MOS 6502 / 6510 and its variant the WDC 65C816 was in the Commodore 64, Commodore PET, the Vic-20, the Apple II, the Atari 2600, the Atari 400/800/600XL/800XL/1200XL/800XE/65XE/130XE, Nintendo Famicom, SuperFamicom, the NES, the SuperNES, BBC Micro, Ohio Scientific Challenger 4, Atari Lynx, Apple III, Apple IIgs, Acorn Atom, Acorn Electron, Franklin Ace, and loads of clones.

The Z80 was in most Amstrad models, in the original TRS-80, the MSX standard, VTech Laser, Intercompex Hobbit, Mattel Aquarius, the Microbee, the NEC PC-6000 & PC-8800 series, Sinclair ZX line & Timex Sinclair, Coleco Adam, and again a bunch of clones.

The Motorola 6809 was in the Tandy Color Computer, while the smaller CoCo MC-10 used the 6803. A few other companies built around this chip family, too.

The Commodore 128 featured both a 6500 series processor and a Z80.

Several of these processors still have versions produced in 2020, although they're not for your main desktop or your phone. Several of them are targets for emulation or new hobbyist software due to the popularity of their platforms. And yes, some of them are used as microcontrollers. Microcontrollers need code written for them, too.


> The golden rule in making understandable code is making it as short as possible.

If that were the case people wouldn't even bother with assembler mnemonics, comments and white spacing. People would write their Javascript / CSS minified from the outset and code golfing would be a best practice rather than a niche activity that some developers do for fun.

I do actually get the point you're trying to make in your post and you do raise some valid points but that sentence is massively overreaching and thus works against you.


since all seems so clear to you, would you mind give some translation of each non plain English word used, or some glossaries explaining each?

Including: p -> position, pointer ? lp -> location pointer ? bss ??? e -> expression ? emitted code? le -> location of emitted code?

Num number? -> why 128, I guess it's related to ASCII ending at 127 Fun function? Sys ??? Glo ??? Loc location? Id identifier?

[reserved keyword are mostly complete word] Char (charset, sign) Else Enum (enumeration, roll) If Int Return Sizeof (size of, heft) While

Assign (assignation, peg) Cond (condition, ply) Lor (logical? or, ere) Lan (logical? and, also) Or Xor (exclusive or, otherwise) And Eq (equals, dows) Ne (not equals, jars) Lt (lower than, Gt (greater than, Le (lower equal Ge (greater equal Shl (shift left, haw) Shr (shift right, gee) Add Sub (subtract, take) Mul (multiply, time) Div (divide, rive) Mod (modulo, lap) Inc (increment, amp/eke/pip) Dec (decrement, ebb/dip) Brak (break, blow)

between parentheses I provided a guess, and a real English word that could carry the same meaning, generally in less than four letters.

I didn't go further in the code so far.


> Most variables names are what I expected them to mean despite their shortness: pc, sp, bp are registers, a is the accumulator, fd is a file descriptor (of the input file, what else?), tk is for the token, t is temporary, etc... For the less obvious ones, it is usually not that hard to infer their meaning from either the code or comments.

Maybe you have the right background to channel the author's particular form of abbreviation, but I have no idea what bp is supposed to stand for, despite having read and understood how it's being used in the program. If that name was supposed to communicate something, I am really not sure sure what it was. Yes, I understand it's probably a reference to some register in some architecture, but that's only effective communication to an audience with a background in that architecture. Even if you know pc = program counter, what exactly does that communicate to someone who doesn't know assembly language?

Of course, using "pc" allows a person with a background in assembly to quickly grok what that variable is, so there are upsides. A best-of-both-worlds approach might be to name it pc and teach people who aren't familiar with assembly language what that means with a comment, like:

    int *pc; // program counter, points to the current instruction
Ask yourself, even if the person knows a = accumulator, what is the accumulator accumulating? When you realize that that isn't even a really sensible question in the context because "accumulating" isn't even really what that variable does, then I have to wonder why you think that's a good name for that variable.

> Because yes, they are comments, not many, but they are helpful. For example, the VM has unusual instructions (for me) like LEV and ADJ, and they are commented. The "obvious" ones like MUL and SHR are not.

> The variable names are not "needlessly cryptic". I've seen (and written, not proud of it) a lot of needlessly cryptic variable names, and believe me, these are crystal clear by comparison. Here, there is a clear influence from assembly mnemonics that really helps understanding.

That's helpful if you know the specific assembly language the author is referencing. But if you don't know assembly language, or if you know a different assembly language, it's not helpful. That SHR instruction you said is obvious doesn't exist in MIPS[1], which is what a lot of assembly beginners will get introduced to. Oh and by the way, which assembly is being imitated isn't documented, so you can't even look that up easily.

The "unusual instructions (for me)" aside is telling: not everyone is you. If your variable names only communicate to you, they don't communicate (an activity that famously involves just one person[2]).

> The golden rule in making understandable code is making it as short as possible.

That's total nonsense. Code becomes understandable when you see it as communication, which starts with understanding who your audience is, and catering your communication to their vocabulary. If your audience has a strong background in x86 assembly, they probably have the vocabulary they need to understand this program. But that is actually a quite narrow audience.

Just to be clear, I don't think this is a bad program. There's a lot to be said for choosing a goal and following through with it, and some of this code is downright brilliant. But effective communication, it is not.

[1] http://inst.eecs.berkeley.edu/~cs61c/resources/MIPS_Green_Sh...

[2] https://www.xkcd.com/1984/




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: