Hacker News new | past | comments | ask | show | jobs | submit login
Stringly Typed (stefanjudis.com)
40 points by todsacerdoti 1 day ago | hide | past | favorite | 65 comments





The idea of "type safety over the network" is a fiction.

When it comes down to it, what is being sent over the network is 1s and 0s. At some point, some layer (probably multiple layers) are going to have to interpret that sequence of 1s and 0s into a particular shape. There are two main ways of doing that:

- Know what format the data is in, but not the content (self-describing formats) - in which case the data you end up with might be of an arbitrary shape, and casting/interpreting the data is left to the application

- Know what format the data is in, AND know the content (non-self-describing formats) - in which case the runtime will parse out the data for you from the raw bits and you get to use it in its structured form.

The fact that this conversion happens doesn't depend on the language; JS is no more unsafe than any other language in this regard, and JSON is no better or worse of a data serialisation format for it. The boundary already exists, and someone has to handle what happens when the data is the wrong shape. Where that boundary ends up influences the shape of the application but what is preferable will depend on the application and developer.


> The idea of "type safety over the network" is a fiction

> When it comes down to it, what is being sent over the network is 1s and 0s

?

When it comes down to it, all of computing is 1s and 0s. This is not some feature that's particular to the wire.


The difference being that a client can be malicious, while e.g. a local file is assumed to behave with the same intent as another. Programs that run on one computer can always be statically verified, while the task is harder for server-client applications — the client could always be an untrusted impersonator!

This happens with local files also, and was originally called “DLL hell”. The mismatch isn’t malicious, but the effect is the same.

What does that have to do with type safety though? If anything, type safety improves whichever piece of the puzzle you do have control over by reducing the likelihood of you accepting malformed data.

A local file can be "newly local" and have recently been saved from the network or via a usb drive, etc.

And assuming a file is going to behave with good intent, or even the same intent as another file of the same format, is bad. It's how we get jpeg/png/etc parsing errors. Its how we end up with PDFs that are also valid executables, and 1000 more issues.


> The difference being that a client can be malicious, while e.g. a local file is assumed to behave with the same intent as another.

I'm not sure what it means to assume something about the behavior of a file, presumably thought of as a static piece of data, but I'd certainly disagree that a modern computing system is entitled to assume that all local apps behave with the same intent as one another (except to the extent that it assumes that all local apps behave maliciously).


> The idea of "type safety over the network" is a fiction.

Not really, it just has to be enforced at run time rather than compile or link time.


In many statically-typed languages, types do not exist at runtime - they are erased once the program is known to be internally consistent. What is left is not type safety, it is parsing and validation of unstructured binary blobs (or arbitrary strings, depending on the protocol) into structured data. Structure and types are not the same thing, and in many languages they barely even overlap.

Any input data has the same problem. Type safety exists after validation, and its guarantees hold only if your original validation was upheld.

Files, database, user input, network protocols, etc. I don’t know why the network would be any way special. You parse/validate unstructured binary blobs into structured data, and what’s left is type safety. It’s not in the runtime only because, if the compiler has done its job correctly, it is typesafe by construction.

In other words, how many times are you going to check your data structure is correct, before you start assuming it’s correct? Once — at parsing and validation — after that, you’re working with structured data, and your types are just recording the fact


While it is all 1 and 0, what those bits mean can easially be encoded. When you say what those bits mean in detail (which we need to anyway - what code page is that string), we can then assign what is valid, and in turn we can reject messages that while they are only 1/0 are still wrong. Also by assigning meaning we can get closer to what we want. the string "12345" and the number 1234 can both mean the same thing, but we can put one into 2 bytes if we want, while the string is at least 5 bytes. Not to mention a number is easier to parse, turning a string to a number is not always trivial (depending of course on which code page is active)

> At some point, some layer (probably multiple layers) are going to have to interpret that sequence of 1s and 0s into a particular shape

You do it once, at the application border. Doing it multiple times, in multiple layers, is a path to madness.


My point was more that the layers below the application will also have to parse the data into a particular format - in the case of networked applications, into TCP/IP packets, then anything particular to the message protocol, before hitting the application. And then the application will, at runtime (regardless of whether you are using a type safe language or not) have to parse and validate the shape of the data before it can be used.

Not to agree, but this is the same for files on disk or data in a database, where the malicious or misbehaving peer might just be an older version of the same program

Not trying to be mean, but there's not much content here. It's a definition of the term "stringly typed" (from another blog) followed by the idea of using appropriate types.

I guess the author is "one of today's 10,000", as they say. Wiktionary attests the term from 2019 but I'm sure I've been hearing it much longer than that.

The post is a true web-log. Someone logged something they learned and put it on the web.

I first heard of it from Jeff Atwood in 2012, loads of fun concepts here I reference often. Favorite must be "shrug report"

https://blog.codinghorror.com/new-programming-jargon/


I was working with the Torque Game Engine in like 2008 which had a scripting language where almost all data was strings. Vectors? String of three numbers with spaces in between. Looking back I think it was kind of TCL inspired. But I definitely heard it called "stringly typed".

xkcd has a relevant take on this: https://xkcd.com/1053/

TLDR, we should totally be celebrating learning in public


That's not just a take, that's the origin of the phrase OP used :)

In a beautiful meta moment, you are one of today's 10k about the origin of 10k :D


I'm witnessing...something!

> JavaScript is a weakly typed language without much type safety

This is a little inaccurate. JavaScript is dynamically typed. Values carry strong, unforgeable type information (tags), though tags for numbers are extremely cheap and optimized away whenever possible. It is not possible that JavaScript "forgets" that 1.0 is a number and allows a program to use it as a "pointer".


Strong and weak are ill-defined, but JS is generally considered weakly typed. It's about how the language treats the results of expressions on mixed types and whether it permits implicit conversions or not. Contrast with Python, which is also dynamically typed but is strongly typed (by default, you could add in your own operations to weaken these runtime type checks).

JavaScript: https://tio.run/##bdBBCoMwEIXhfU4xuNFG01C6K3iYF2tLixhREXr6NJ...

Python: https://tio.run/##K6gsycjPM/7/P1HBVkHdUJ0rCUgbc3EVFGXmlWgkKm...

Few languages are truly strongly typed (zero implicit conversions, numeric operations commonly allow them without requiring explicit casts) so it's really more of a spectrum. How much does a language allow in comparison to other languages.


Most of the well thought out writeups on types that I've seen put them on two different scales

- Strong vs Weak - how does the language convert between types

- Static vs Dynamic - Does type information apply "early" or "late" (not sure what the right term here, but generally "when it's compiled" vs "when it's run"). I've also seen this as whether the type information is applied to the variable (name of/thing that points at the value) or the value itself; which tends to work out to the same thing.

Which is, I think, more or less saying here. Just with more words.


Re: conversion between types

Most languages can be placed somewhere on a strong-weak spectrum, but JavaScript is an outlier. Two "equal" values can implicitly convert to opposite boolean values, or different numeric values, or different string values. Not just weak... maybe pathological is a better term.


In static languages, variables have types. In dynamic languages, values have types.

I think this is referring to JS's unfortunate habit of doing nonsensical things with types unless the user takes special precautions.

For years I thought I needed explicit static typing, until I tried Python, which is also dynamically typed, and found that I had none of the problems I had in JS. This is because Python is strongly typed.

Indeed Python has the opposite problem of being a bit too pedantic with the type conversions. I thought it was interesting that C#, which is also strongly typed, lets you do string + number. (IIRC it has something to do with how they both descend from Object...)

Worth mentioning that I do think static typing is a very good idea in any nontrivial program, and I wish more languages forced the programmer to be explicit here — TypeScript and even Rust have both bitten me in the ass with the type system making incorrect assumptions instead of just asking me (i.e. forcing me to actually specify the program and eliminate guesswork).


> IIRC it has something to do with how they both descend from Object...

It's far less interesting than that. The truth is that the symbol + means a wide variety of different things in the language, and it uses compile-time type information to select exactly which of those things it means. In the case where either operand is a string, it means string concatenation, with string conversion for a non-string operand if one exists.

That specific conversion part is the only thing that cares that stuff descends from Object. The conversion algorithm is "call ToString() and let dynamic dispatch sort it out."

(For completeness, the other meanings are various forms of addition, but adding two doubles is a different machine instruction than adding two floats. It really does still need to use type information to figure out which of those operations it is.)


Static typing provides different benefits to strong typing. Both are useful in their own ways. And both can get in the way (and not be worth it) in some cases.

I have such an addiction to types that the first thing I do to anything as it comes in and out of systems I own (even across systems I own) is put it back into types and error all the things if it’s invalid.

I assumed this was the regular action because it seems so much safer to me.


I think that JSON is overused (DER is better), but even without JSON, string data is also overused, in cases where numbers or other types would do better. (Unicode string types are also overused, but if a different type other than strings is better anyways, then Unicode is not the main issue here anyways.)

> If you're using a strongly typed language like TypeScript, receiving the user object as any or unknown type is unfortunate. You'll lose all the type safety and you can only regain it with manual type checking.

This is not specific to "stringly typed" stuff or to JSON, but is just the case when you transfer data that may use multiple types. In strongly typed programming languages, your program can parse it as the data that it expects and use an error handler when it is not what it is expected. (If you do expect that it may have any type, then you might be able to pass the unparsed value if appropriate; for example, I have a ASN1_Value structure in some of my C programs for this purpose.)


> I think that JSON is overused (DER is better)

When I was younger and more enthusiastic about some aspects of network communications, I set out to understand ASN.1 by reading the specifications. That was when ISO and ITU-T were even more stingy with access to their standards/recommendations, so it wasn't easy to get them as someone with no connection to standards bodies, and also without a few hundred CHF to spare. Reading those specs is an art in itself, but one gets a hang of it after a while. It went pretty well until this part of X.208:

The resulting type and value of an instance of use of the new value notation is determined by the value (and the type of the value) finally assigned to the distinguished local reference identified by the keyword VALUE, according to the processing of the macrodefinition for the new type notation followed by that for the new value notation.

That's where I burst out laughing and finally deeply understood why the ISO networking stack crashed and burned despite having some solid ideas.

All this is to point out that yes, DER is not bad at all, but the whole infrastructure it rests on is simply too alien to people outside of telecom space and those who have to deal with it by necessity because of its use in various security protocols.


Openbsd went with a stringly typed system for their pledge api.

I think it was for simplicity of use. But i find it a very strange interface from a bunch of C programmers.

https://man.openbsd.org/pledge

update: Found the commit message

    Move to next tame() API.  The flags are now passed as a very simple string,
    which results in tame() code placements being much more recognizeable.
    tame() can be moved to unistd.h and does not need cpp symbols to turn the
    bits on and off.  The resulting API is a bit unexpected, but simplifies the
    mapping to enabling bits in the kernel substantially.

I know JSON is the standard now, but are there “better” serialization formats out there? Especially since JSON doesn’t know what an integer is in the spec

I guess it depends on how you define "better".

JSON does a couple things really well, and most other things terribly.

But the things it does well are pretty valuable. So in the "strengths" category I'd put the two following points:

1. JSON is very easy to read and understand as a human

2. JSON stuck to the basics. No comments, no references, no clever tricks, and not much space to let folks try to hammer in cleverness (see - no comments).

Neither of those are all that much related to JSON itself as a format - the semantics are basically an accident of timing around JS syntax from the 2000s.

But it's very, very useful to be able to get the raw text for a network message and know exactly what's getting sent without having to have a whole specialized tool framework to parse and understand the message.

It's also useful to not let the spec get so complex that I never want to do that, even if I could (see: xml).

So with JSON - I can easily read the actual network request and understand it, even with essentially zero additional tooling AND I have a very good chance of literally being able to open a text editor and create a new message with valid syntax without any other tools or references.

Further - this holds true even if I'm not an industry expert with 20 years of experience. Most random people off the street can do it with only a couple minutes of coaching.

Not many other serialization formats can do that.

Imagine taking your 8 year old, sitting them down in front of the computer and legitimately saying "JSON doesn't know what an integer is in the spec"!

It's true... but it's absolutely not the point. For normal people "number" is complex enough. And if you need an int and not a float... you can do that processing just fine after getting a JSON payload if you'd like. It won't be as fast as a specialized format (ex protobuffs), or as flexible (ex XML) as other formats - but that's a far distant concern to "Can I hold the hammer".

JSON is really easy. "Easy" as a strength is wildly discounted, but man is it a winner when you get it. I also think it's surprisingly hard to do.


> No comments

There are a large number of people that consider that _not_ a benefit.


Yeah, and if people would solely use them as comments for humans to read... I'm with you.

But they won't. A big part of the reason comments weren't included in JSON is that people tried to get clever with them.

Directly quoting Crockford:

> I removed comments from JSON because I saw people were using them to hold parsing directives, a practice which would have destroyed interoperability.

And while I'd also love to occasionally throw a comment in a json file, I don't want to have to deal with any of the headaches they would have created in the ecosystem.

And to be fair to Crockford here - it's not like he wasn't aware this was a downside. He even released a tool as a preprocessor for JSON if you wanted to put comments in: https://www.crockford.com/jsmin.html

JSON intentionally chose to stay as simple and compatible as possible, and personally - I think that constraint was the right call.

If I'm writing files I want to throw a lot of comments in... It usually means I should move to something like YAML instead.

Again - JSON is terrible at a lot of things, but really hammered on simple and easy as focus points. If you give devs a place to store data outside the structure of the protocol... they will use it for all sorts of complicated craziness... which devolves to either multiple protocols, or a really complicated protocol.


I know his reasoning for it, I just disagree with him. People added JSON parsers that allow comments and can _still_ get tricky with them. The only thing the standard not adding them did was make sure we can't rely on them being there. And, for ANY file format that is used for config (and similar) that is supposed to be human readable, being able to add comments is pretty much table stakes imo.

sure, but you'd see folks using them to add metadata and extend json in horrible incompatible ways

Different formats are good for different things, but I think DER is much better. No character escaping is necessary, Unicode is not required (although it can be used if you want to do), arbitrary binary data can be stored, integers can be arbitrarily big (although implementations might only support integers as big as the specific application requires), you can skip past any block without needing to know how to interpret it, and many other advantages. (However, I had made up a variant with a few additional types, such as: key/value list, BCD string, TRON string, etc. This makes it strictly a superset of the types of data which can be stored in JSON (if the types you use are: sequence, key/value list, real number, null, boolean, and UTF-8 string). I use DER in some of my programs, because I think it is generally much better than JSON. Also, DER is a binary format, although I did make up a text format (called TER) which can be converted to DER (but TER is not really meant for other uses, since it is more complicated to handle).)

Define better.

As the other poster said, you could use XML which is more powerful, but as a result is a lot more complex. For most tasks I'd prefer JSON because while it is lacking, all the real world parsers I've seen are much easier to work with and I rarely need more complexity. If someone did a JSON++ (I have no doubt many people have but I'm not aware of them!) that added things like integers, without the complexity of XML that might be even better. In the real world if something should be an integer it isn't hard to check that and error out - you need to support parse errors in any data format anyway.

Protobuf is sometimes better for data serialization. It isn't human readable, but you rarely need that and saving data bytes is often useful even today. Protobuf does have your integer type that you are missing, but it has other limitations might or might not apply to you. (I don't use protobuf enough myself to know what they are.

Sqlite has more than once suggested that their database file is a great serialization format. You get a lot of power here and for complex things a database is often easier to work with than an xml file. There are various no sql databases as well that sometimes can work for this.

I've handwritten my own serialization format in the past. The only hard part is designing enough the ability to add whatever the future needs are (note that I've never had to read my serialization on a different CPU family, things like little vs big endian I'm told can be a pain)

There might be something else I didn't cover... Everything has pros and cons.


Protobuf does support JSON encoding[0], which I like as the .proto definition is quite readable, and then you can encode/decode either human readably or efficiently. It's even quite easy to have your consumer support both since the two are pretty easy to tell apart and if you know its either one or the other, you can just failover trying one to the other, possibly at some small cost... the guide also does point out some significant downsides to relying on the JSON version, but it can be useful in development and/or debugging in some cases, especially if you control both sides sending and receiving and can just toggle to use it when you want temporarily.

[0]https://protobuf.dev/programming-guides/json/


> It isn't human readable

This is a tooling problem. Wireshark can decode protobuf for you when you're inspecting gRPC traffic.


Needing that tooling is a format problem.

JSON is bad at everything except being simple and easy. Turns out simple and easy is a real winner.


JSON has one glaring flaw: nested json encoding in strings becomes awful to read. I encounter it too often in reality where individual layers use JSON, but want to support arbitrary strings in their API. Encodings which use prefix length don't suffer from this, which ironically even includes most binary formats.

Back to my main point though: normally I don't need the complexity that things like nested JSON would be. When you do though JSon is a bad format. (actually I would go so far as to say you never need something that complex - but the problems you are trying to solve with nested JSON are still complex enough that you should use a more powerful/complex framework, but better design of your data store would avoid the need for nested JSON.)

If you have the correct version available. All to often when debugging problems the person in the field doesn't have the correct tools, or doesn't know how to use them (in this case you may not want to share the proto config with that person...) As such the less tools needed to understand something the better.

It does, and it predates JSON

https://www.apple.com/DTDs/PropertyList-1.0.dtd

As elegant as anything on json.org


If you care a lot you can use Protobufs. Downside is now everything has to speak protobuf + no can't read in your network tab. Upside is (mostly) smaller payloads and a lot more type safety.

There's CBOR, but it is not nearly as compact as the C in its name implies.

I'm having a vision of XML reading your comment and going "well well well, look who's decided to come crawling back".

I can't imagine why. XML is still fundamentally, well, a markup language, not a serialization format designed as such. But the "extensible" part isn't so accurate - attributes aren't extensible. GP complains that JSON doesn't know what an integer is (as distinct from a generic number), but at least it does know more than just strings. And needing to repeat a tag name when closing it just adds useless complexity.

It’s not anymore useless than a closing } or ], except since it has the tag name in it, so when I’m reading a highly nested object I’m not stuck in my text editor looking at a bunch of }’s at random indentation levels I have to scroll all the way back up to regain any context for. Tags are text which is visual structure I can choose to read, or choose to gloss over and use as bulk to shape the data in my head.

This is one of my favorite things about the Flavour framework: strongly-typed web service calls:

https://frequal.com/Flavour/book.html#org44d6b49

Your single-page app code calls your backend API using strong types. Your code is clean and the framework handles marshaling and unmarshalling JSON.


if you have control over all the consumers of your API, run whatever makes you fast and keeps you safe.

If you want others to play with your API: JSON here we come.


When Anders Hejlsberg did a lot of those talks to sell TypeScript, he described a lot of JavaScript as "stringly typed", which is very obvious with all the addEventListener("click", ...) and so on depending on certain strings. The term itself is not compliment, if you describe someone else's API as such, it's not well taken.

JS is also "stringly typed" in a sense that you can access object's properties by just string names, foo.bar and foo["bar"] is the same thing.

It lead to a really nice Typescript feature, where you can declare something like type FooProp = "bar" | "baz", and the typechecker is smart enough to only allow these literal strings where you use values of FooProp type (e.g. when accessing properties by name, like above). This collapsed the whole crowd of strings, enums, and symbolic constants to just strings, without any loss of type safety, which I find a cognitive win.


Op might like Structurae's Binary Protocol, type safe from door-to-door[0]. There are lot more interesting use-cases there!

[0]: https://github.com/zandaqo/structurae


An example of "stringly typed" in C is when you have to pass "r+" to fopen().

Yes, that is another example, and I had thought using numbers would make more sense (and you can use enum or #define to give names to those numbers). It is not only the fopen function in C that does that; I had seen similar things in other C libraries as well, as well as in other programming languages.

An API that uses JSON isn't "Stringly Typed". An API that lacks any validation on the JSON you pass to it is. Under their definition, nearly everything is stringly typed if if passes a system boundry, because serialization transforms everything into a string - Sometimes a byte string, sure, but you end up with transport-neutral single object whose interpretation is understood by metadata, and that's a good thing, because you don't need to waste time interpreting it at every layer it passes through.

The modern advice to "Use a serialization library" is actually encoding several hard pieces of learning into one. There was a time when save files for most games were just memory dumps of large sections of memory. You dumped raw C-objects, including pointers to other objects, directly. You ended up with a tangled mess of references, but it was simple to write code for, cheap to write to disk, cheap to read from disk, and easy to break. Basically every update to a game broke all of the save files, because the most minor of tweaks could change the object layout generated by the compiler. The first change was to put magic strings at the beginning to inform the version - So at least you displayed an error message rather that executing some unexpected part of the save file as code.

This lesson was hard learned as we entered the networked age, where you couldn't trust the incoming messages weren't malicious - And you certainly couldn't trust, with all of the terribly-behaving middleware, that they were well-formed. Writing serialization/deserialization code is not hard, but it's annoyingly rote, and you would need it for dozens upon dozens of classes. So instead we switched to standardized libraries for serialization and deserialization.

Java and Python both had serialization libraries where whole objects could be serialized - Along with everything they referenced. This lead to massive security holes, because it was easy for them to take a huge chunk of working memory with them, because circular references to root objects allowed them to grab parts of other operations, or even application secrets. Python was worse, as the pickle library allowed serializing whole bytecode; Meaning every load was an arbitrary code execution.

Modern serialization libraries have come to a compromise. They serialize data only in primitives. You have to rebuild the tangled web of cross object references yourself. This often sucks, but it's far better than the alternatives we've found.

GraphQL is popular for precisely this reason. You can avoid most of the serialization and deserialization steps and query what you want directly, allowing you to access deeply-linked properties of deeply linked objects without the expensive round trips and security barriers being checked and rechecked; But the expressiveness comes at a distinct cost in terms of getting those barriers on the server side really right, because the default allow permissions make it easy to leak.


Unfortunately you can't escape stringly-typed (and other mess) in language with structural type system.

Yes, once you start noticing the (mis)use of strings, it's everywhere. I set my IDE to make strings a bright orange color so they're very noticeable in the code.

With that said, you can overdo it. For example, if you're constructing a URL in an internal method that will never be seen by the caller, it's okay to just use a bare "https" without turning it into an enum like Scheme.HTTPSecure.


my nemesis



Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: