Hacker News new | past | comments | ask | show | jobs | submit login
Lowercase letters save data (endtimes.dev)
217 points by zdw on Jan 20, 2024 | hide | past | favorite | 148 comments



The SVG path shown is broken, showing slightly why this would be a dodgy approach to apply here.

Shown are these:

  <path d="M 1 1 L 1 -1 Z" />
  <path d="m 1 1 L 2 0 z" />
But these are inconsistent. These show what I think was intended (a line from (1,1) to (2,0)):

  <path d="M 1 1 L 2 0 Z" />
  <path d="m 1 1 l 1 -1 z" />
When you look at realistic illustrations, “lowercase everything” is fairly obviously not a good approach in path data: sometimes absolute coordinates will be shorter, other times relative coordinates will be shorter; and the difference, raw or compressed, will normally be more than the case difference.

No, if you’re trying to compress, you should be taking the holistic approach, adjusting not just case, but the entire path data. For example, one of these two solutions, dropping the superfluous whitespace and removing the line-to opcode which is implied after an opening move-to:

  <path d="M1 1 2 0z"/>
  <path d="m1 1 1-1z"/>
You can also contemplate whether you actually want to close the path. In this trivial case, it’ll only make a difference if you have exactly one of stroke-linejoin and stroke-linecap set to round.


It's not that lowercase saves data. It's that normalization saves data. You can normalize to uppercase and get the same effect.

In fact, the argument to use uppercase for normalization is that lowercase does not round trip.

You are assuming ASCII content (or UTF-8 content that happens to only be ASCII). Throw Unicode into the mix and you get problems to deal with.

Separately, after normalization, to decompress the data you are assuming every sentence must be capitalized and anything that appears to be a title or proper name must be title cased. I imagine that for some pages this isn't 100% so after decompression you will probably make have a letter be uppercase where it was originally not. Thus, your technique must be classified as a lossy algorithm.


> In fact, the argument to use uppercase for normalization is that lowercase does not round trip.

I believe neither upper or lower case round trips properly, which is why Unicode invented fold case.

Whenever this topic comes up I like to refer to one of my favourite talks on the topic: A Million Billion Squiggly Characters by Ricardo Signes: <https://www.youtube.com/watch?v=TmTeXcEixEg>. Whilst nominally about Unicode handling in Perl, it's general enough that anyone using any language can learn something from it.


> You can normalize to uppercase and get the same effect.

No, you can't (unless you do it very carefully). Not every lower case letter can be transformed to an upper case ascii letter and back to the same lower case ascii letter. Allow me to introduce you to the letter i (in Turkey).

If you naively uppercase('i') in an standard alphabet, you'll get "I" which isn't even the same letter if you are reading it in Turkey.

http://www.i18nguy.com/unicode/turkish-i18n.html


You're saying it's not the "same effect" but don't we have an equivalent issue with lowercasing?


No, because the unicode committee screwed it up and rejected fixes for it. So, it lowercases correctly, but not uppercases.


Which character lowercases but not uppercases correctly? In which locales?

My understanding is that I will either lowercase to i or to ı depending on locale. Just like i will uppercase to I or to İ depending on locale.

In both cases, you have a problem if you want to support multiple locales. In both cases, if you set a fixed locale then it's fine.


Uppercase Eszett lowercases to lowercase eszett, which then uppercases to SS, which then lowercases to ss.

It’s not incorrect but does show that you can get four different strings from one with just case operations.


Yeah, I know that one. I wasn't trying to ask for any character that has a problem in one direction, I want to know what character they were talking about because it sounds like they're confused about how i, ı, İ, and I work.


It rather means that case operations are locale-specific. This is indeed unfortunate, as Unicode tried hard to split characters that look same or similar but have different behaviors like that, but inevitable because Turkish (among others) is otherwise using normal Latin characters. In the other words, Turkish speakers are unconsciously mixing Turkish `i` and normal Latin `i` in their texts and there is no way to separate them after the fact [1].

[1] See https://unicode.org/mail-arch/unicode-ml/Archives-Old/UML009... and https://unicode.org/mail-arch/unicode-ml/Archives-Old/UML009... (and especially note the transcoding issues)


>You can normalize to uppercase and get the same effect.

This isn't quite true. Most HTML, JS, and user content can't be uppercased.

>Thus, your technique must be classified as a lossy algorithm.

I think you might be misunderstanding some parts of this. The article is just suggesting changing the case to sentence case everywhere, whether transferred or displayed. The compression comes from separate lossless algorithms.


Converting everything to lowercase for the purposes of saving space when zipping is technically a lossy compression algorithm. It's an algorithm for saving space, and it doesn't recreate a perfect copy of the original data.


I used this effect in my chess trainer (the link is in my profile). I have a file with 800 compressed chess positions, and I used a lot of tricks to get the file as small as I could. One trick was to invert the color of every piece on White's side of the board. In a typical position outside of the endgame, White's pieces will tend to be on his side of the board, and Black's will tend to be on his side. By inverting the piece colors on one side, I ended up with positions where almost all of the pieces are black, and here and there you'll have a white piece. It's similar to how most of the letters in text are lowercase, and occasionally you'll see an uppercase letter. I think I shrank the file by 5% or so this way.


Here is how I would it

Board is 8x8, which is 6 bits. 4 piece type (rook, queen, knight, bishop. I will represent others (king and pawn) in a special way) is 2 bits. So 8 bits in total per piece

I would not use a bit to represent piece color but instead use a seperator to seperate one player's piece from another. Repeating the same position as the last piece can act as a separator since two pieces can't share same square

Kings are special that they must exist, so you don't have to specify their type

Pawns are special, most likely you will have 1 pawn (of each color) in one column and they can be on rows 2 to 7, or they can be dead or they can be "irregular" which means the pawn moved to another column and there are two pawns in that column now. This data is 3 bits per pawn so 48 bits in total. After that data you can put position of all irregular pawns (additional 6 bits for each irregular pawn). If a pawn promotes, it is represented as a regular piece (and can be considered dead in this 48 bit pawn data)

So a chess state is

   Pawn data (48 bits)

   Irregular pawns (6 bit each but I can't tell how many there can be at max. 8 maybe?)

   White king pos (6 bits)

   White pieces (6 bits for pos + 2 bits for type \* number of pieces)

   Seperator (6 bits)

   Black King pos (6 bits)

   Black pieces (same as white)

   Seperator (6 bits, represents end of data)
I think the worst case scenario is there are 8 promotions which would make it add up 248 bits

If number of pawns on board is small, it might worth representing them using their full positions and using a bit to decide which representation you picked


You also need to account for:

- Whose turn is it

- If the king is in its initial position, whether the king has moved previously (that player can't castle anymore).

- For the pawns, you may need the previous position/move to allow en-passant capturing [1]

The standard specification used for this is the Forsyth–Edwards Notation (FEN). [2] It was invented way earlier than computers, so it was targeting humans / isn't optimized for digital compression, but it at least specifies _almost_ everything that you need to track.

The _almost_ is because to account for the repetition rule [2] you need to track all previous positions. At this point, the problem changes from "how to efficiently store the positions of the pieces in the board" to "how to efficiently store (partial) games", and you are probably better off listing the moves than the board state.

[1] https://en.wikipedia.org/wiki/En_passant

[2] https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notati...

[3] https://en.wikipedia.org/wiki/Threefold_repetition


I love bit-bashing to squeeze something down to the smallest representation possible. There was a related discussion here on HN recently, where I came up with this:

A good representation needs to keep the full game history due to the no-repeat rule, and also because this is generally useful. There are only 32 pieces so a 5-bit number can select one uniquely. Each piece has a maximum of about 32 positions it can move to on its turn, for another 5 bits. Then the encoding is just 10 bits per ply. Typical games are 40 moves (80 ply) and hence require just 800 bits or 100 bytes for the whole game history. Some additional data is required to track promotions, but that's it.

Then you could get clever with Huffman coding or the like, since some moves are more common than others. E.g.: on the first turn, only the pawns or the knights can move. For quite a few turns, pieces either can't move or are very restricted in where they can move to. Pawns always have so few moves available that their next move could be represented with just 2 bits instead of the full 5. Etc...

In practice, it ought to be possible to get the typical standard game representation down to 8 bits per move or less, especially for short games.

To represent games starting from arbitrary configurations, there needs to also be included 32 6-bit numbers to encode where each of the unique pieces is on the 64-square board. This adds just 24 bytes for this initial "setup".


You could generate a list of legal moves with a specific algorithm, and use the necessary number of bits to indicate which one happened.


Just encoding algebraic chess notation gets you to 9 bits without much skill: 3 bits for the 6 pieces, 3 bits for rank and 3 bits for file, with the edge cases shoved in the extra encoding space given that you have 3 bits to encode 6 pieces.

On average, chess has a branching factor of 30-40, so you need a touch more than 5 bits on average even with clever Huffman coding. Just Huffman coding regular algebraic chess notation for movement would probably get you down to 6 bits on average.

Of course, the other issue is that the tighter you compress the data, the harder it is to actually manipulate that data, and it's not clear that the space saving for denser encodings is worth it.


That’s a lot of extra wasted bits! There are only 32 pieces in total, and only one of 16 can move each turn. So a mere 4 bits is sufficient to select the moved piece.

The assumption here is that this is an encoding “at rest” and isn’t used directly, except perhaps for equality comparisons. The decoder would have to track the game state and expand this into a much larger format for programmatic manipulation.


>To represent games starting from arbitrary configurations, there needs to also be included 32 6-bit numbers to encode where each of the unique pieces is

In that case, you can either omit pieces not in play or omit pieces in their standard starting position to shave off a few more bits.


I remembered current turn later but I could not think of your 2nd and 3rd point, you are right

I guess you can't just take a look at board to see current game state, I was too focused on what game looks like while there are additional off the board info


https://news.ycombinator.com/item?id=39066345

You could probably modify this solution, I am already using 3 bits for piece type

0-3 rook queen bishop knight

4 Non-moved king (can castle)

5 moved king

6 pawn that moved 2 squares last turn (can be captured using en passant)

7 other pawns

And then a single extra bit to store whose turn it is


instead of encoding "moved king", encode "moved rook" on each rook. Since a pawn can never be on its home rank, overlap rook (not moved) & pawn (en passant) codes. Now you have a code for "king (to move)" and "king (not to move)".

If you use a variable length encoding you can also use different static probabilities for the different ranks, since "rook (not moved)" and "pawn (en passant)" can only appear on certain ranks and only depending on the piece color (well, pawns can't appear on ranks 1 or 8 at all)


You’d need the same information for rooks to handle castling.


That's a better reasoning in general (see my other comment for the GP's scheme), but you can do much better.

First, you should encode the number of total pieces (and their general compositions) in the beginning, which can greatly shrink all subsequent fields. For example you are using a repeating square index to denote the end of piece list, but you have at most 15 such pieces unless there is a chess variant where new piece is created out of nothing, so you can just encode that as 4 bits instead of a 6-bit separator.

Second, use a fractional number of bits per each field. Each piece occupies one square and prohibits some more squares, and those squares should not be available for encoding. So you need an arithmetic or equivalent coding for the maximal efficiency. A careful ordering can then even eliminate much more squares from further encoding. I think you've essentially done this for pawns, but you can do much more in general.


Using 3 bits for piece type (including empty) and 1 bit for color makes a simple matrix enconding of the board just 8x8x4=256 bits. Add run length encoding for empty squares and it will be much smaller.

4 bits per square has room in it for RLE flags and even counts, without increasing the worst case size.


248 is max in my case (actually later I noticed I could shave another 4 bits), if there are less pieces, there will be less data

But probably with encoding your solution would be better


But probably there can be a middle grounds.

1 bit for each square represent there is a piece on that pos so 64 bits

then for each side, 4 bits to represent number of pieces and 3 bits for each piece to represent their type

At max 64 + 2x(4 + 3 x 16)=168 bits for 16 pieces each side.

edit: nvm this wouldn't work since you wouldn't know the colors of the pieces on board. You need to store 4 bits for each piece, 1 color + 3 type so it is 64 + 2 * 4 * 16 = 196. I am not sure if this is better anymore


Actually it's 192 bits, not 196.

To beat it you must encode each piece with a different number of bits. E.g.

  Q c00
  N c01
  P c10
  B c110
  R c1110
  K c1111
where c is 0 for white or 1 for black. Since there are 2xK, 2xQ, 4xB, 4xN, 4xR, 16xP, you will need 2x5 + 2x3 + 4x4 + 4x3 + 4x5 + 16x3 = 112 bits. Adding the 64 bits for the bitmap, the grand total would be 176 bits.

The issue is that each time a P is promoted to R or B (a stupid thing to do, since a Q would be better) you need one or two bits more. On the other hand, you spare 3 to 5 bits for each captured piece. In normal play, captures outnumber promotions, and promotions are either to Q or N, never R or B, but theoretically...


https://news.ycombinator.com/item?id=37525348

As far as max size goes, 64 bits plus data per piece seems to be the right path. Especially if you huffman the pawns to be 3 bits each while other pieces are 4 bits.

The grandparent method gets pretty bloated when you start promoting.


Even something very simple like listing piece positions in a given order should result in less. Let's say King, Queen, Rooks, Bishops, Knights, pawns. Give a position for each for both players in 6×32 (192) bits.

If a piece is taken, repeat the position of the king. If a pawn is promoted (and there are more than than the starting number of the piece it was promoted into) repeat the position of the piece it was promoted into.

1 bit for whose move it is. (193b)

3 bits for the en-passant column. (196b)

50 move rule (I suppose 7 bits, as you need it in ply?) 203b. Then the data for the extra pieces, in the order encountered if any. ?×6b (how many are possible? There are 16 pawns, but I suppose it isn't possible to promote all)


My objective was to reduce average size, the example calculation I made is the max size


My objective was to show that it's quite complicated for dubious gain. What do you think the average would be?


  What do you think the average would be?
No idea, I don't know what kind data OP stores (mostly few pieces? or lots of pieces? etc). You can probably tailor your format if you have more insights. But no matter what, you can always have some gain compared to a naive approach. I would say storing a pos for each piece might be naive, since there are a lot of rules that allows you to exploit things

My 248 bit was the absolute worst, when you have 8 promotions on board and lost none of (non-pawn) pieces. For example the initial board would be about 190 bits as well and most would be lower as you capture pieces. I think that would be a worst case for you as well but I don't get this part

  If a pawn is promoted (and there are more than than the starting number of the piece it was promoted into) repeat the position of the piece it was promoted into
You mean you first need to tell the position of the piece it is promoted to and then another position for the position of the pawn (now promoted)? Then you have a pretty bad worst case as well

But you are right, I made an overcomplicated solution. Here is a simpler one that should perform better https://news.ycombinator.com/item?id=39066557


>Then you have a pretty bad worst case as well

It would be really rare though, virtually all positions would be 192bits for the position as such.

I don't think you need en passant for each pawn, there can only be one, so you only need to mark the column, with 3 bits, and there always must be a column where it can't happen, that you can use where there isn't any. Or, maybe it's more, IDK.


How would you enforce the three repetitions rule this way?


I guess you can't do that without the full PGN. I guess the 50 move rule might be extraneous as well.


You don’t always need the entire move list.

Whenever any piece is taken (permanently decreasing the number of occupied squares), a pawn is moved, castling rights change, or an en passant opportunity isn’t taken, there won’t be any more repetitions of the board matching past board positions, so you can forget all earlier board positions.


After some more thinking and feedback:

  2 bits per square
  00 empty
  01 white pawn
  10 black pawn
  11 other piece
  =128 bits
4 bit index of white king on board (one of the others)

4 bit index of black king

for each "other piece" on board that is not a king: 1 bit color + 2 bit type

1 bit for current player

2 bits for each king if they moved (optional, only exists if king is at thair initial square. Otherwise it is moved)

1 bit for each pawn if they can be captured en passant (optional, these bits only exists for their pawns if there is a pawn at 4th row for white or 5th row for black)

(And i will ignore the draw rule)


> =128 bits

Notice that this alone is more than a lot of positions with my "naive" implementation - 36 bits per side for the best case, with only the king, and at most one queen: 6b for the king, 6 bits for the queen, or its absence, 6 bits each for the absent rooks, bishops, knights and pawns.


I thought of combining the two:

1 bit per square (empty/non empty) 64 bit.

Then we need at worst 5 bits (32 non empty squares or less) per piece to indicate the positions of the big pieces. (80 bits, typical worst case)

Then we only need 1 bit per pawn, to indicate color. (16 bit at most)

The typical worst case: 160 bits.


Now this is some good nerd sniping content.

A few quick thoughts.

You could have a move counter, and pick representations based on total moves (early on, you could probably use lots less space simply storing the movement from original position and piece type).

Bishops can only be in one of 32 spots but you need to know which one (ordering?).

Trying to not shave off bits but have more zeros so that it can be compressed more effectively - having positions based on the delta from their most likely position could help.


  Bishops can only be in one of 32 spots but you need to know which one (ordering?).
I was thinking of that but piece type is already 2 bits for 4 piece types so you can't have unique color bishop types without adding additional bits. Ordering wouldn't work since you can only have one bishop or an extra one or two of same color etc.


   Seperator (6 bits)

   Black King pos (6 bits)
Maybe we can even skip the separator?

Since each player has to have a king on the board. So the first piece in the list is the white king, and everything after it is white, until the second king in the list, and then that one and everything after it is black.


Number of chess pieces are dynamic so you wouldn't know if white pieces are over and next piece is black king (kings don't have a type bits)

But now I think about it, instead of a separator I could simple use 4 bits to represent how many (non pawn) chess pieces there are for each player. It is at most 15 (7 initial + 8 promotion) so only 4 bits instead of 6 bit seperator shaved another 4 bits!


Don't forget to track if the rooks or kings have moved, for castling eligibility. And what the previous move was for en passant. And the full game history back to the most recent capture for draw by repetition.


  And the full game history back to the most recent capture for draw by repetition.
Ouch, I will leave this as an exercise to reader


That part is easy.

1. store the oldest full board you care about

2. for each turn, record either:

2A. two 6-bit coordinates

2B. 1-8 bits to say which move they picked out of the full list of legal options


You could actually run a small chess engine a few moves ahead to sort the full list of legal options into the order of obviousness.

Then you can use a variable length encoding (Huffman coding?) to encode the moves, with more obvious moves taking fewer bits. You could even encode multiple moves into a single code point, and it might be possible to encode an entire sequence of obvious moves (like a complex exchange) into a single code point, which might be encoded as just a single bit.


You would definitely use arithmetic encoding at that point. Then you can directly use fractional bits to encode moves.


How do you account for promoted pieces?


They are dead and the piece that got promoted is just another regular piece


I guess your code is not publicly available, though I can easily see what you've done. And I'm not sure it took a lot of tricks. In my understanding, your format is essentially as follows:

    Prepare an empty chess board.
      Empty square is represented as 0.
      White piece is represented as +1 to +6 in the order of RNBQKP.
      Black piece is represented as -1 to -6 in the order of RNBQKP.
    
    Also prepare the reference chess board, which is same to initial positions except:
      3rd and 4th ranks are filled with black pawns, like 2nd.
      5th and 6th ranks are filled with white pawns, like 7th.
      Somehow, f1 and f8 are filled with rooks, not bishops. (Is this a bug?)
    
    Read the initial board. Start at a1 and continue until all squares are filled:
      Read one byte as bit fields AAAABBBB.
      Skip A squares. Wrap to the first file of the next rank on the last file.
      If -5 <= B-7 <= 6,
        The current square is set to B-7.
        If the current square is on 1st--4th ranks, invert the piece's color (if any).
        Skip to the next square.
      Otherwise, let C be 1 if B-7 = -6 (normally a white pawn), or B-5 otherwise.
      Repeat the following C times:
        The current piece is taken from the same square in the reference board.
        Skip to the next square.
    
    Read one byte as bit fields AAAABBBB.
    A indicates the current run. White if A = 1, black otherwise.
    I don't know B, but only one half-move will be read unless B = 1.
    
    Read one half-move or as many half-moves as possible:
      Read two bytes A and B, which are interpreted as square indices 0..63 (a1..h8).
      If A is on the 7th rank and a white pawn is at A,
        B's rank is reinterpreted as the promoted piece index (0 = no promotion, 1..4 = RNBQ).
        B's rank is then always reset to the 8th.
      If A is on the 2nd rank and a black pawn is at B, do the similar adjustement.
      Put a move from A to B, with the promotion indicated if any. But do NOT alter the board.
This format uses 1/3 to 1 byte to encode a single piece, unless there are 4 pieces or less in which case there may be some more overhead (since you can skip at most 15 squares at once). This is not particularly efficient; even a simple Huffman coding will be as efficient as that [1]. You should also make use of the fact that there are at most specific number of pieces per type, since once you've seen a black king, you won't see more so the corresponding representation is wasted. (Promotions make other pieces more complicated though.) I guess 10 bytes might be possible without a very complicated scheme.

This format also uses two bytes to encode a single turn. This again is hardly efficient, especially because there are much smaller number of legal (half-)moves given a particular position. (218 is believed to be the maximum, and the upper bound is not much larger than that [2].) So you can just enumerate legal moves and encode the index as a single byte. If the number of legal moves is far smaller, say 16, you can pack multiple moves into a single byte as well. A much involved scheme would then assign a smaller number of bits to a more likely move, using heuristics and neural networks and whatever else [3].

[1] https://stackoverflow.com/a/66345772

[2] https://old.reddit.com/r/chess/comments/o4ajnn/whats_the_mos...

[3] https://triplehappy.wordpress.com/2015/10/26/chess-move-comp...


Ah, I hadn't seen Adler's comment before. I did most of my work on this before he posted that comment. His scheme is probably more efficient than mine (he is Mark Adler, after all), but I'd have to run it on the same set of positions to know for sure.

You figured out how I'm representing the position correctly, but I'm only storing one half-move per position, specifically the current player's best move. I use that for the "Show hint" feature.


> I'm only storing one half-move per position, specifically the current player's best move. I use that for the "Show hint" feature.

I think you are describing B = 1 case. When is the B != 1 case used then?

I also wonder why the reference board seems to have an obvious mistake (rooks in place of bishops). Again, I hardly know any chess, so that might be my oversight and that does improve the compression for this particular data set.


> This format uses 1/3 to 1 byte to encode a single piece... even a simple Huffman coding will be as efficient as that.

Not when you consider that his file was compressed after encoding (per the article).


On <!DOCTYPE html> versus <!doctype html>: it’s actually not so simple. If you’re using gzip, you might manage to save as much as a few bytes (though typically only 1–2), but if you’re using Brotli, it’ll actually compress something like 2–4 bytes worse, because Brotli includes a static dictionary based on a popularity contest of largely-not-optimised content.

This is a thing I dislike about Brotli: it encourages you to follow the herd, rather than doing what would otherwise be more optimal. (Sure, the encouragement is just a byte here, a byte there, but it’s the principle of the thing.)

Here are sizes based on running `‹brotli or gzip› --stdout <<<'‹doctype›‹charset›' | wc -c`:

  ┌─────────────────┬────────────────────────┬──────────────────────┐
  │     BROTLI      │ <meta charset="utf-8"> │ <meta charset=utf-8> │
  ├─────────────────┼────────────────────────┼──────────────────────┤
  │ <!DOCTYPE html> │           19           │          27          │
  │ <!doctype html> │           23           │          29          │
  └─────────────────┴────────────────────────┴──────────────────────┘
  
  ┌─────────────────┬────────────────────────┬──────────────────────┐
  │      GZIP       │ <meta charset="utf-8"> │ <meta charset=utf-8> │
  ├─────────────────┼────────────────────────┼──────────────────────┤
  │ <!DOCTYPE html> │           58           │          56          │
  │ <!doctype html> │           58           │          56          │
  └─────────────────┴────────────────────────┴──────────────────────┘
In practice, when you add more content, predominantly lowercase, doctype tends to save a byte or two compared to DOCTYPE. So gzip gravitates to the bottom left corner—that removing unnecessary characters and matching case where case-insensitive is good. But Brotli was made by popularity contest, and so the inferior representation actually ends up compressing better than the superior, because most people did things the inferior way!


But the static dictionary is not used as is, but transformed with a lot of options. For example both `<!DOCTYPE html>` and `<!doctype html>` appear in the dictionary, and the transform 43 (FermentAll) will turn them into `<!DOCTYPE HTML>`. (No transform would generate `<!doctype HTML>` though.) Therefore Brotli is not only following the herd, so to say. And the static dictionary only improves a very short input---any long-term pattern will be captured regardless of the static dictionary.


I believe that when I investigated this a few years ago I tried it on a few things and found it consistent.

Right now, I take my HN front page and add <!DOCTYPE html> and <!doctype html>. Uppercase becomes 5678 bytes; lowercase 5675 bytes, three bytes smaller.


While Brotli tries to use all those transforms, I guess it is still not exhaustive enough to fully make use of them, given that Brotli (and many others) is asymmetric in design and compressors can be heavily tweaked without affecting decompression too much.


If you ever happen to study computer science, you may come across a subject called "coding theory". It introduces you to compression as well as many other topics such as error correction (Reed-Solomon, used in RAID5, RAID6 and ECC-RAM), line coding (sometimes you need to flip bits to keep the clock on both sides in sync, no clock-signal for a long time may cause clock loss) and a lot of wonderful but weird stuff.

Let us go into detail on compression: There is a representation. US-ASCII uses 8 bits per latin letter, UTF-32 uses 4 bits per latin letter. It is just a temporal representation to the machine -- usually in memory only, it does have the same amount of information, you can save it more efficiently to disk. You would not want to save either format to disk, it is a waste of space.

Information content (I hope my translation is correct, scan Wikipedia for details) cannot be compressed. But it can be calculated. The more seldom a letter, the more information its occurence carries. As soon as each letter is not equally frequent (compare space and "q") the information density drops. Calculation is quite simple: Count the occurence of each letter, count the caracters used (if there is no "q" in the text, you got to save one letter and its encoding) and apply some math

https://en.wikipedia.org/wiki/Entropy_(information_theory)

For some easy examples, think of morse code and Huffman coding -- not every letter needs to be encoded using the same amount of bits.

> How much data can lowercase save? #

Nothing. Either there is (almost) no information to it in the first place, in that case compression will take care of it for you. There could only be information to it if uppercase letters were equally likely as lowercase letters

> How much data can lowercase save? #

Why do you even stop at letters? You could build a dictionary of words and compress references to it. The compression efficiancy would only depend on the amount of words, regardless of case and regardless of character set. That is why entropy depends on "symbols" instead of "letters"


In most cases uppercase vs lowercase doesn’t matter. Compression work either well. An interesting exception are QR codes. They are substantially smaller if you only use upper case characters because there is a more efficient encoding for them.


I've only dabbled with QR codes but they are interesting. Do you have any links that go into more detail about uppercase making them smaller?



In addition to mitsuhiko's link, there is even a dedicated binary-to-text encoding suitable for the alphanumeric mode of QR codes: https://datatracker.ietf.org/doc/html/rfc9285


Yes but they greatly enhance readability and are carriers of information. Don't throw the baby out with the bath water. Just because you can doesn't mean you should. There are so many cliches that apply here.


I'm surprised by the strength of your comment. The article doesn't argue for lowercasing everything, and cites several cases (ah!) where it shouldn't hurt. Including case insensitive languages where case doesn't carry any information by definition and where it's already common practice and uncontroversial to lowercase everything, and English title case. With this in mind, my next question is:

> they greatly enhance readability

Do you have any evidence for this? My native language is not English and we use the Latin alphabet. The case for titles is the same as for regular sentences. We are doing fine. I would even say that the English title case feels somewhat odd and confusing. Titles are also virtually always set apart / rendered differently as normal text so I fail to see how normal case for titles loses information and how title case helps in any way. Actually, I even believe that title case sometimes loses information. Capital letters mark (Mark, btw :-)) proper names, this marking is lost with title case and this may sometimes bring ambiguity.


> Do you have any evidence for this?

US street and highway signage went from all caps to mixed case in the past 30 years. I believe the US Department of Transportation did studies and found mixed case was easier to read. Although they use a non-standard font.


Interesting, though one could as well conclude from this that all caps is not very readable. We need to know the actual motivation behind this change, not a supposition.

My question would be from title case to normal case for titles. We already know that normal case is readable, that's what we use for normal text.

We also know (?) that people speaking another language with a Latin alphabet don't have trouble reading titles.


Heyo, just curious what language(s) use the latin alphabet without case?


We have case, but we don't have title case. We write titles using normal case, with an upper case letter at the start and for proper names but no other words.

(French, but I think that's also true for Spanish, Italian and many other languages. I don't know if German have a specific case for titles like English but they have an upper case letter for common nouns).


Correct. Sentence case in Spanish: "El ingenioso hidalgo don Quijote de la Mancha" and Italian: "Il nome della rosa". Sentence case also in German, with the usual rule that all nouns are capitalized: "Kritik der reinen Vernunft". Same for Russian (non-latin of course): "Война и мир".

I suspect English is the only language to use special, though not standardized, rules for titles.


this is because we recognize most words by shape. capital case is all single-height while lowercase is mixed height (it has letters that ascend and descend). for situations like road signs, being able to scan and see the “shape” of a word quickly is very helpful.

so that is an argument in favor of using lowercase.


This is just a myth.


While the word shape meme is a myth [1] there are some half truths to it.

It is known that mixed casing does disrupt reading [2], which gives some weight to the idea that the shapes of words and letters does affect reading.

For myself I know that my reading speed and comprehension suffers when reading title cased sentences. This leads me to always "fix" [3] documents I update or otherwise contribute to at work when others do it for headings, subheadings, headers and footers.

[1] https://www.mrc-cbu.cam.ac.uk/people/

[2] https://pubmed.ncbi.nlm.nih.gov/9293635/

[3] I say "fix" because I know it's a generally acceptable formatting method when writing, I just don't like it.


Link 1 is wrong


Thank you for pointing that out! Correct link:

https://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/


> In this paper, Pelli and colleagues show that when reading words that have been distorted by presenting each letter in visual noise (like an out of tune television), readers do not perform as well as an 'ideal observer' who can recognise words based on their shape alone

I said being able to see the word shape is very helpful. I did not say that you can read based on the word shape alone.


You may have said it, but you should be aware that it isn't true.

https://learn.microsoft.com/en-us/typography/develop/word-re...

> The weakest evidence in support of word shape is that lowercase text is read faster than uppercase text. This is entirely a practice effect. Most readers spend the bulk of their time reading lowercase text and are therefore more proficient at it. When readers are forced to read large quantities of uppercase text, their reading speed will eventually increase to the rate of lowercase text.

> Haber & Schindler (1981) found that readers were twice as likely to fail to notice a misspelling in a proofreading task when the misspelling was consistent with word shape (tesf, 13% missed) than when it is inconsistent with word shape (tesc, 7% missed). This is seemingly a convincing result until you realize that word shape and letter shape are confounded. The study compared errors that were consistent both in word and letter shape to errors that are inconsistent both in word and letter shape. Paap, Newsome, & Noel (1984) determined the relative contribution of word shape and letter shape and found that the entire effect is driven by letter shape.

Word shape is not relevant to anything.


At least in titles I think "falsifying" the case of random words actually hurts readability. It can get confusing sometimes, especially when it includes a person's name that also happens to be a noun.


I agree, but it has me wondering about a semantic-based compression approach. If we know a period is often followed by a space and a capital, can we compressed based on “which periods to reapply this rule to”?


i don't agree that capital letters greatly enhance readability

if they do at all, the benefit is not worth having an entire second alphabet to learn, and deal with, imho (i.m.h.o. i-m-h-o, etc)

there are many other, and better, ways to improve readability


Readability is a complex matter in that it depends a lot on the reader, their expectations, and how well they handle some cases vs others.

On title case in particular, I wonder how you would feel if it was applied to everything in the text, and not just the title. Would it be more readable ? For me personally it absolutely wouldn't and I'd be infuriated after three lines.

Then if most people were on the opposite side, I'd think we'd have at least a few prominent use cases where all the text would be in title case, and I can't think of any.

Is title case's point to just make it clear it's a title ? If so, we have so many better ways to convey that (bold, underlining, spacing, positionning, bullets and markers etc.). On the baby and the bath water saying, the more I think of it the more I see Title case as just a relic of some old practice I'm not sure why we cling to in this day and age...what's the baby, what am I missing ?


> Yes but they greatly enhance readability

No they don't. It is true that modern readers read lowercase text more quickly than they read capitalized text.

And it's also true that, after a small amount of practice reading capitalized text, the difference in reading speed becomes zero. There is no benefit of any kind to the lowercase letters.


The whole premise of the article is infuriating. The author claims using only lowercase _enhances_ legibility, maybe for them, but this is subjective at best, and incorrect for the general population of literate adults. Also lowercase compresses just as well as USING ONLY UPPER CASE would. You would get better compression result yet, f y lft t ll th vwls frm yr sntncs.


But the premise isn’t that using only lowercase increases legibility. The Premise is that Not Using Title Case Improves Compression and Also Increases Legibility for Author of Article.


Case-folded names in academic journals happen to be a pet peeve of mine. When the author is listed only as "VINCENT VAN GOGH", is the name rightly spelled with "van" or "Van"? Often, there's no real way to know, short of searching for the author's other works!

Some online metadata lists the name with normal capitalization, but sometimes the spelling was entered by someone whose guess is just as good as yours.


Names are awful identifiers, considering how they are misshapen by stylistic rules (case folding, reordering, abbreviation, transliteration, etc.) and how many collisions there are. I really hope some generalised system (ORCID?) finds as wide of adoption as DOI did.


why would you spell as van? the English language and coding format is Van


Because Vincent van Gogh was Dutch. In the Netherlands (and Flanders, I presume) "van" would be lowercased when placed between given name and surname (because it's not a name, "van" just means "of"), but capitalized when it stands alone with the surname:

    Vincent van Gogh
    Van Gogh
And to pre-emptively answer the next question: we don't drop the "van" because some surnames come in varieties with and without it, so you might end up confusing a hypothetical "Jan van Gent" with "Jan Gent".

But of course, you can't tell from an article if someone is Dutch or just has Dutch ancestry but long-since switched to English spelling rules. Which in some cases means the family chose to always lowercase the "van".

Source: look at my username ;)


We don't do that in Flanders. My husband's name is "Roel Van Gils". People (and computer systems) from the Netherlands do automatically change this to "Roel van Gils" but they are wrong.

Also, in Flanders, when you do write the "Van" or the "De" in your lastname with a lowercase letter, we assume you are nobility


Thank you for educating me! :) It also immediately highlights the original problem even better.

Funny how the French do the same thing as us (Dutch people) again though, I wonder if there is a quirky consequence of some historical event "encoded" there?


It took me a while to realize that "Van Morrison" wasn't the musician's surname.


It's the same in French ('de' is lowercase) though many people get that rule wrong.


You’d do it lowercase in van Goch’s name because it’s the correct way to write it for him. See https://en.wikipedia.org/wiki/Tussenvoegsel


Of course, for someone so well-known, it's easy just to look it up; I was just using that name as an example. But in general, particles in names like "van", "von", "de", etc. might be properly capitalized one way or the other, and it can be impossible to tell which without a close understanding of the author's nationality (which in turn can very difficult to determine).


Falsehoods programmers believe about names number 30:

"There exists an algorithm which transforms names and can be reversed losslessly."

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...


Counterpoint: utf8 <-> utf32


Yeah, surely the author can't be Dutch or Flemish? Where the van prefix originates?



No it’s not


And this is another case where our world is culturally biased to "the West". It doesn't stop at the lowercase letters saving data when compressed and being at an advantage. Take UTF-8, for example, and think about the difference between latin and other scripts. And then there is the whole Pandoras Box of Han Unification...


For others like me who might be curious about the Han Unification controversy:

https://en.wikipedia.org/wiki/Han_unification#Rationale_and_...

> In 1993, the Japan Electronic Industries Development Association (JEIDA) published a pamphlet titled "未来の文字コード体系に私達は不安をもっています" (We are feeling anxious for the future character encoding system JPNO 20985671), summarizing major criticism against the Han Unification approach adopted by Unicode.


And yet the same criticism is now directed to the Japan government, because Japan is trying to define the unified character set for name characters and that necessarily demands some sort of unifications [1].

[1] https://mainichi.jp/articles/20231231/k00/00m/040/131000c (Japanese, also paywalled but at least the first portion is visible)


> Also I don't like how they force all the titles into title case (edit: actually I think I'm wrong about this.)

In many cases (not sure of the rules), it fiddles with the case, as it does various other title manglings. If it does mangle things, the submitter can edit them back the way they were and it won’t mangle again.

I strongly dislike the title mangler and would fain see it removed. I feel like I disagree with its modifications most of the time, and even the ones that have some justification like removing numbers at the start as a declickbaiting measure (“400 ways to do X” → “Ways to do X”) very often harm the sense, and I dislike these cases more than I dislike its improvements.

(But maybe if the mangler were disabled I would realise that people are doing a terrible job of entering titles and it is actually fixing a lot of stuff.)


On the other hand, lowercase letters tend to be more complex using more bezier anchor points. So if your priority is not bandwidth but work memory you should maybe stick to uppercase? :D


Just use an 8x8 raster font. 8 bytes per character are hard to beat, still perfectly readable, and who needs fancy smooth outlines anyway ;)


That is gorgeous click bait! Love it. Using a somewhat surprising effect of compression to explain it is brilliant.

Obviously, no one should consider a couple bytes of compression when actually writing text, and I know the article doesn't advocate for that.


If it's about text compression can't you say the same in the favor of upper case?


I'd think so, yes. But at least for me, reading all-caps is really annoying. For me, it feels like yelling. I also find it harder to read. Uppercase letters tend to use more space, physically. So it makes a page feel really crowded if everything is in uppercase.


But all lowercase gets annoying as well; at least CSS can be used to kind of straighten it?


Not for me, personally. As long as punctuation rules are in place.


About 15 years ago, I got frustrated that the National Weather Service issued its alerts in all caps. So I built a service, IntelliCAPS, that attempts to convert all caps to mixed case using both a custom dictionary and user-submitted feedback. It worked pretty well! I could convert most of the NWS bulletins to a more readable format with about 98% accuracy. I bet LLMs could probably do way better these days.


A very vaguely related thing that might be useful to someone: in QR codes, using ALL-CAPS actually saves you data, making the QR code less "dense" and so easier to read at smaller sizes.

The trick is in the "alphanumeric" encoding mode of the QrR standard, which lets you encode 3 character per 11 bit (as opposed to 1 character per 8 bits) if you limit yourself to just numbers, uppercase letters and a few common URL symbols [0].

Besides making the code less dense, you can also add more error correction at the same density, making the code more resilient to wear, damage and reflections.

For more QR code trivia, see this excellent demo: https://www.nayuki.io/page/creating-a-qr-code-step-by-step

[0] https://www.thonky.com/qr-code-tutorial/alphanumeric-table


i would also like to say that latin capital letters are for the most part a set of totally different, but related, shapes. so our alphabet is really more like 52 letters. and for what?


For readability. Compare to Cyrillic, where lowercase letters are basically half-height uppercase letters, so everything looks like it's rendered in small caps (this is more true for handprinted lettering, but mostly true for typeset stuff too).


Probably true for Russian cyrillic alphabet tradition but not so much for Bulgarian and other cyrillic ancestry where lowercase letters differ almost for every letter with few exceptions, no matter if handwritten, cursive, italic or normal.


It's why God invented cursive.


Have you seen Cyrillic cursive?


Helps you scan for sentence beginnings and proper nouns I suppose


This is only true because DEFLATE is very old---it handles duplicates well but ignores everything else---it doesn't even handle simple sequences like `00 01 02 ... ff`. Modern browsers support Brotli as an additional content encoding, and it can actually detect and handle such cases better. Of course such modelling can be done as a preprocessing step as well, PNG notably uses filtering in addition to DEFLATE.


What I find interesting here is how the effect of using lowercase would add up and reduce carbon production significantly. It reminds me of Gerry McGovern's articles on the use of images and the environmental impact that that has. https://gerrymcgovern.com/books/world-wide-waste/


That page lowered my IQ 10 points.


And yet in the past, when space was at an even greater premium, code was written in all uppercase. The title seems to imply that uppercase is worse than lowercase in terms of data usage. Truthfully, it is "same case" that allows for less data usage. The case could be all lower, or all upper.


The article is arguing against title case, not all uppercase.


Thanks to Endtimes for this interesting article.

I hope that the Collapse (Tech) Subculture in general is more interested in improving compression algorithms than they are in making less legible content. (But if they think the legibility is fine then sure, do the latter.)


BY THAT LOGIC SWITCHING TO ALL CAPS IS JUST AS EFFICIENT. I PROMISE I AM NOT YELLING.


CONVERTING EVERYTHING TO UPPERCASE SHOULD HAVE THE SAME EFFECT ON COMPRESSION STOP IT EVEN MAKES YOUR WEBSITE LOOK LIKE A TELEGRAM STOP


Cool, but just think about how much data (and co2) you may save by skipping vovels!


Wild shot, but does anyone have data sets on how different languages compress in a similar manner? It would be an interesting mix of letters/phonemes used, word length etc.


Isn't the same true for uppercase letters, too? What about binary encoded text? Probably true if you translate the page to ones and zeros, as well.


TL;DR: dropping non relevant information from a message saves data. Slightly misleading title and article.

Nothing about lowercase saves data. Not having two (or more) different letter cases saves data; if you had ALL UPPERCASE DATA the effect would be the same. And, the fewer other chars you use, the best - stay on lowercase or uppercase ascii only, and your compression will be better.

Having "A" versus "a" conveys information; if you drop that information, your message can be naturally smaller. What if you actually needed to differentiate jack from Jack? Only using lowercase letter you may be forced to do something like _j_ack, which would probably take _more space_.

(information theory discusses those topics, especially the information enthropy concept).


> What if you actually needed to differentiate jack from Jack?

Especially if Jack's your uncle and he's on a horse...


I disagree that it's not intuitive (reason comes to mind pretty easily, no?), but it is still a great and interesting thing to write about!


Not using title case also removes ambiguity in some cases and is generally much easier to read. Not sure how or why is it so popular around here.


Wouldn't a perfect compression algorithm know that spaces are followed by Uppercase letter and take that into account?


The “saving X bytes saves Y emitted carbon” trope is a pet peeve of mine.

Nobody knows how much carbon a request’s full internet transmission path, storage, and possible 24/7 availability of a server consume. Stacks are too different, equipment widely varies, international internet traffic likely consumes more.

I’m typing this using wind energy that I pay extra for. But who knows what my ISP’s equipment and onward are using.

If you carbon-optimize your content so much that I have to make multiple requests or paste it into ChatGPT to understand it, you did not reduce your footprint.

There’s no direct byte to carbon conversion.


Saying you use wind energy you pay extra for is a pet peeve of mine :)

You're not really using wind energy. The electrons don't know who is pushing them.

You just increase the share of non-wind that others get, but as long as there are fewer wind subscribers than there is wind energy it doesn't change thing. Premium goes entirely to your energy provider, their profits and marketing department.

Happy to be proved wrong in your case if you have a link to your plan.


This is an oversimplification. I assume you're both talking about REGOs. Electrons are fungible so you're sort of right in that you aren't using the exact electrons that come from a wind turbine. But you're wrong that the extra money just goes straight to profits and has zero effect on green energy generation.

Imagine two scenarios:

1. Everyone believes your comment so nobody pays extra for green energy. Therefore the market value of RIGOs is 0. This means green energy generators get paid exactly the same market price per kWh as brown (?) generators.

2. Some people (not necessarily everyone) pay for 100% green energy. This increases the market value of RIGOs so now green suppliers get paid a bit more. Green energy is now worth more than brown energy, which changes the supply/demand curve so you get more green energy overall.

The RIGO market is a very elegant market solution to selling the "greenness" of green energy. The fact that it isn't a dumb 1:1 connection that people naively expect doesn't mean it is a scam.


THERE MIGHT BE ANOTHER WAY ABOUT THIS.


uppercase characters seem to be almost entirely superfluous to me, or at least, the benefits do not justify having an entire second alphabet of characters which everyone has to learn, machines have to process, etc.


I thought so too, until I needed to help my uncle Jack off a horse.


They add one bit of information, hardly superfluous.


tEchniCalLy, oNe biT PeR cHaRAcTER, bUT iN reALITy, nOT MuCH mORe aCTuAL iNForMaTIon


Better title - mixed case text takes up more space when compressed.


I use smaller font sizes for the same reason


ok, that's my new excuse for why i am writing in all lowercase ;-)


tl;dr:

1. It doesn't save data, it loses data.

2. If you drop the uppercase, you can compress the remaining text farther.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: