Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The fundamental advice on Unicode is decode and encode on system boundaries. That is, you should never be working on non-unicode strings within your business logic. The same should apply to JSON. Decode it into business logic objects on entry into system, rejecting invalid data. Instead of relying on key errors and membership lookups, leave the orthogonal business of type validity to object instantiation.

This right here is the correct approach. Serialisation formats should be serialisation formats, whether they be JSON, S-expressions, protobufs, XML, Thrift or what-have-you; application data should be application data. There are cases where it makes sense to operate on the serialised data directly, for performance or because it makes sense in context, but in the general case operate on typed application values.



Part of the problem is that JSON and Sexprs aren't that they AREN'T serialization formats. They've been pressed into service as such, but they are actually notation for datastructures: In python, it may not be idiomatic to crawl dicts like this, but in JS, those aren't dicts, they're objects. If they've been de-serialized to some degree, they may even have their own methods.

By the same token, in Lisp, Sexprs aren't a serialization format. They're a notation for the linked cons cells that Lisp data is made of. In Lisp, that Sexpr will be crawled for data, or maybe even executed.

So while in Python, both may seem to be serialization formats, they aren't.

Either way, if the application programmer has any sense, they'll abstract away the format of their data. In a lisp app, you won't be cdring down a sexpr, you'll be calling a function to grab the necessary data for you, usually from a set of functions that abstract away the underlying sexpr implementation, and treat whatever it is as a separate datatype.

Of course, the sexpr might have been fed to an object constructor. Heck, it might be an object constructor, or a struct constructor. All of those types typically provide O(1) access, and autogenerated access functions, so it's the same story.


A notation for a data structure is a serialization format.

> In Lisp, that Sexpr will be crawled for data, or maybe even executed.

A s-expression cannot be executed; it's just text.

The object which it denotes can be walked or executed via eval.

Before that happens, the s-expression must be converted to that object.

In other words, deserialized by the reader.


Good point kaz. I meant that the notation signifies a specific set of general data structures, unlike XML, which doesn't specify the data-structure used in memory afaik, only the actual tree structure of the data itself.


In that s-expressions are a notation for computation, they can be executed by an interpreter. This is what we refer to as execution. Even assembly is like this, there's no other reasonable way for it to work right now.

This feels like you're hairsplitting to no obvious benefit other than increasing confusion. I could as easily say "mathematical notation isn't math it's just text, you can't evaluate '1 + 2' without a human being because otherwise those are just marks on the page otherwise." This is true(ish) (with the correct escaping), but it's difficult for me to see how it's relevant to the discussion? We could imagine the situation where I've created the Texpr, which has slightly different notation but the same properties. I don't know that we would necessarily classify it differently or treat it differently.

This leads to the alternate conclusion that maybe the s-expression is the underlying set of objects in the interpreter/compiler. (S-expressions are a special type of linked list, in that case). This rings true(er) to me, because of the way that we talk about s-expression manipulation in lisps. We most certainly are not using string operations to generate them. In which case the thing with the many parenthesis is merely standard lisp syntax, not s-expressions. This is further justified by the existence of different syntax in Dylan or Clojure, and availability of reader macro manipulation as its own entity.

Tldr; it most certainly is not text! Nor can it be executed.


> mathematical notation isn't math it's just text

That is correct. It's just text which talks about concepts that don't have text, like transcendental numbers, infinities, infinitesimals and so on.

There are functions in mathematics that can't be written down in symbols at all, like the integrals of certain functions (which themselves can be written down).

Math text has some useful properties in that certain transformations you can think of as typographical (manipulations of the text) actually preserve semantic properties in a useful way. So for instance addition commutes, semantically; and in the text, this lets us swap the left piece of text for the right one, around the plus sign.

There can be a very close correspondence between typography and semantics (like in Douglas Hofstadter's "TNT": typographical number theory, which he uses to explain Gödel's incompleteness theorem).

> This leads to the alternate conclusion that maybe the s-expression is the underlying set of objects in the interpreter/compiler.

I assure you that it isn't; not in any main-stream Lisp interpreter or compiler.

(Where by "main-stream Lisp interpreter or compiler", I intend to rule out cute hacks like this:

https://news.ycombinator.com/item?id=7956246 )


No, kaz has a point. do not confuse the shadow for that which cast it: Sexprs are a serialization format/notation for sets of conses. Most of the time, saying so is splitting hairs, but it bears mentioning here, as we're discussing serialization formats.


JSON is a serialization format that was based on the data structure notation for Javascript. It is, however, a serialization format. Javascript objects are a superset of JSON, as they can contain arbitrary objects and functions, which JSON can not, and "true" Javascript object notation can elide quote marks or use apostrophes for keys, whereas JSON strictly specifies double-quotes around keys.

The problem that arises in Python and other dynamically-typed languages is that there exists a default deserialization that is so good it very strongly tempts the programmer to use that exclusively. However, as good as the default serialization may be, it's also quite dangerous, for the reasons you mention and more. In strongly-typed languages there's a stronger focus on parsing the JSON instead, which has the advantage of producing objects with stronger guarantees (which isn't quite the same as pointing out there's stronger typing here, you could theoretically get the same guarantees in Python with some sort of JSON schema library or something), but has the disadvantage of generally being more challenging, since it's hard to beat the conciseness of "json.loads(s)" in Python. There generally is a default serialization in strongly-typed languages, but it's far more likely to become inconvenient if you need anything beyond simple numbers and strings, and people generally learn to prefer true deserialization in my experience, unless they, alas, start their program out from day 1 inputting and outputting JSON and accidentally structure their entire program around the default JSON structures and end up with the exact same problems as you'd get in Python. But as long as JSON isn't the very first thing to go in you're generally in better shape.

(I have witnessed a Java program primarily written as a map of string to map of string to map of string to string. It was unsalvagable. And for all the cynicism I may occasionally muster, I don't say that often, because refactoring can be pretty powerful in Java, but this was beyond help. It actually had no JSON in site, but the same fundamental forces were in play.)

Personally, despite preferring the more strongly typed approach in most ways, I must confess that when I'm working in Perl I am generally unable to resist the temptation to just JSON::XS::decode_json, and cover over the differences with unit testing rather than dealing with "true" deserialization. I make myself feel better by also telling myself that if I do anything else, I will confuse my fellow programmers who don't generally expect to see fancy deserialization routines when dealing with JSON, which is true enough, but in my heart I still know guilt.


> In strongly-typed languages there's a stronger focus on parsing the JSON instead

I think you mean _statically_ typed languages here. Python is a strongly typed language.


The definition of "strongly-typed" that Python conforms to is a useless one, because very few weakly-typed languages exist anymore, and even those are only very, very partially "weakly typed" by having operators that are defined to do automatic coercion, generally only between strings and numbers, which isn't even the way the original "weakly typed" was meant. The only truly "weakly typed" language I know of that is still extant is assembler/machine language, which can never really go away, where all you ever have are numbers, and thus absolutely nothing stops you from adding a string pointer to the first element of some structure. (Modulo some distinctions that still may exist between floats and integers and such... even assembler isn't as weak as it used to be, though it is still by no means a strongly-typed language.)

So I don't use the useless definition. By any useful definition of strong vs. weak typing, that is, one that actually creates two or more non-empty, non-trivial sets of members in the universe of discourse, Python is a weakly-typed language.


You can use whatever definition you want in your own head, but you cannot expect anyone else to accept it.

Furthermore, one of the most popular languages in use today is weakly typed: JavaScript!


Here's a plausible definition: Strongly typed languages disallow programs to escape the type system (e.g. put an integer into into memory, then treat it as a float, or vice versa, as in the famous Quake fast-inverse-sqrt hack.). Oh, look, Python is strongly-typed and C is weakly-typed.

Note that I do not endorse using "strongly-typed" to mean this definition, or any other. There are no useful definitions of this phrase, don't use it, except when correcting people.


I guess you need a term for "typing of moderate strength"


I'm not sure I prefer the strongly typed approach: I come from Lisp, so the approach is: "read it, validate it, and then wrap it in functions to hide the implementation in case we change it."

This works well in Lisp, where the line where objects end, and structs, lists and functions begin is hazy at best. Besides with a bit of wrangling, you could probably just pass your validated serialized data to the object constructor as the arguments. Or you could just write a struct, which is simpler than an object, provides O(1) access, and you can still probably easily pass your datastructure, or something close, into the constructor as the arglist.


What are serialization formats, then? What makes them different from notations for data structures?


Not gp, but one difference is you can have a notation that can't capture some state: think of the date class in JavaScript. JSON can't serialize this without resorting to string encoding, whereas something like a protobuf or pickle could.


What's really being described there, I think, is the notion that you can serialize and unserialize some data and get, to some extent, the "same" data back.

Whilst that's more possible with protocol buffers or pickling (or whatever your language calls it), I can't think of any languages offhand which can round-trip any data. It's generally not possible to serialize objects denoting external resources - such as open file handles or network sockets. It's also often not possible to serialize closures, weak references (without dereferencing them), and not necessarily possible to serialize self-referencing objects: e.g. a list which contains itself - pickle can handle it, but I don't believe protocol buffers can do it.


Some extended sexpr notations (particularly those used in Common Lisp, although many Schemes also support it, as do other lisps) support self-referential data structures.

fds and sockets CAN, IIRC, to some degree, be sent to other processes on the same machine, but it's fairly limited.

As for serializing closures, CHICKEN Scheme's s11n egg is the most prominent, although not the only, example. It's fairly limited, once again, to avoid sending the forest with the bannana, as Joe Armstrong would put it.

This has nothing to do with our discussion, I just thought it was cool.


That means JSON is missing support for some kinds of data, that doesn't mean that it's not a serialization format at all. Just one with less descriptive power.


That's fair, I guess it's not really a different class but a different shade of the same thing.


First off, make no mistake, JSON and sexprs ARE serialization formats. I didn't establish this well in my orignal comment. However, they're designed be a notation for specific structures: dicts and arrays in json's case, and cons cells, specifically Lisp code in the case of sexprs. This is unlike say, XML, which defines a tree hierarchy, but NOT what the underlying structures are. That's the parser's job.


From the official source: "JSON (JavaScript Object Notation) is a lightweight data-interchange format. "

http://json.org/

It's nothing about pressing into service. This is the authoritative source.


From the official source: "Democratic People's Republic of Korea"

http://www.korea-dpr.com/

It's democratic, and for the people. It says nothing about being a totalitarian dictatorship.

Not everything is/does what it says on the tin. You don't have to agree with official or otherwise authoritative sources without question.

(not that I wish in any way to compare Mr Crockford or whoever runs json.org with DPRK or its leadership - I'm just using a deliberatly extreme example to highlight that what is written may not be what is, at least not from absolutely everyone's point of view)


I hope you feel clever, because in context, this is far too inapplicable to the discussed reality of JSON to be an actual point.


> I hope you feel clever

I generally do, thanks, though I'm not any sort of genius by any measure.

> because in context, this is far too inapplicable to the discussed reality of JSON to be an actual point.

You seem to be missing a bit of an intentional context switch. The comment was more about the logic of the GP's response to "<something> isn't really X" which was "yes <something> is X, it says so on <something>'s home page", than it was about <something> or X in particular.

So it was relevant in the context of the discussion and the facts being used for reference but not, as you call out, in the context of the subject of the discussion (hence my somewhat defensive clarification of intent in the last sentence)


"lightweight data interchange format" is not the same as "serialization formats".

For one, serialization usually handles binary data as well. And types, lots of types. Serialization includes class definitions, etc, which are used to create a (in this case) python object. JSON is literally javascript object notation, and has nothing to do with pickling python.


Yes it is.

Serialization doesn't have to handle a particular type system in its entirety to be serialization.

A serialization scheme can dictate its own type system; which can be smaller than that of the programming languages which support that serialization scheme.

JSON has a simple type system; it serializes that system.

(Might you be confusing serialization for other concepts like object store databases, or image saving?)


That's the source that pressed it into service: JSON is exactly what it says it is: Javascript Object Notation. Specifically, it's a subset of javascript's, well, object notation.

The point is, the syntax behind JSON was originally designed for a specific language, as a textual representation of that language's objects. It just happened to make a convenient serialization format.


Just to be technically correct, because that's the best way to be correct...

JSON isn't a strict subset of JS. JSON strings can contain literal line terminators. JS strings cannot.


When using parsers like e.g. Jackson or Gson for Java, this process is completely transparent and does not require any active thought from the developer - well, maybe if there's very specific formats that don't map 1:1 with the class that should be instantiated or generated from the json object.

It's a bit more tricky in JS, both client-side and node. You can't work with the json string there, but after that you work directly with the json object. They're not OOP languages, really. I wouldn't want to work with too much untyped / unstructured json in back-end land myself to be fair.


I've never had good experiences with automated serialisation -- even though it sounds like other people do it with success. What's the secret?

To give you a flavour of the kind of poblem: In C# (or rather .net) json.net reads JSON and calls setters from the target class.

That means the setters have to be public, and you don't know what order they will be called in, and you have no real signal about when it is all done. The constructor is no longer enough to guarantee the object's invariants are met.

Most awkward.


It sounds like you're parsing the JSON straight into your business objects, which is the source of the problem. You need an intermediate class which represents a strongly-typed version of the JSON message. So JSON.net goes from a JSON string into this message object, then you write your own code (or, if it works for you, use a tool like automapper), to go from that into your business classes.


This is what I settled on -- at least in the hard cases. And if I understand his acronyms, it's also what @mythz is recommending.

Perhaps I should have done it for the easy cases as well (where the business objects are struct-like enough that it doesn't matter) and just lived with the boilerplate.

But I see little advantage in this over just having a dictionary that I can inspect to initialise my real business object. True that is not strongly-typed, but the stage between the message-object and the business object can have validation errors anyway, so why not treat typechecking as part of that?


In Java land with Jackson/Gson they can use the getters/setters or reflection and find the private fields. The only time it is not completely automatic is when some json object is mixed cased myField1 and my_field1. Even then, just adding an annotation fixes it. For any special formats, for example iso8601 dates, you can quickly define a serializer/deserializer and be done.

Is it really that hard in c#? It is not something I ever think about in Java.


Even beyond that, Jackson can use a private constructor if you use the @JsonCreator annotation on the constructor and @JsonProperty annotations on each parameter.


That's because you should be serializing purpose-specific DTOs or clean POCOs not Business Objects with behavior.


JSON.NET can use private setters. They just have to exist.

You do have to use thin constructors, but in JSON.NET there's a way to call a method post-deserialization.


Yes, the automatic serialization is not a solution for the most pressing problems presented by the article -- it's just the first thing from all the things that has to be done at the boundary.

You have some DTO class that is your system typed idea about the structure of the JSON -- this class is quite useful as an implicit documentation, but it really has to stay internal to the boundary. You will use an autodeserializer to such class and then you will continue by constructing real object from deserialized data that can be presented to the rest of the application. During such construction you can validate state and return errros.

This step can be eased by some validating attributes on the boundary DTO properties, but there is always some custom logic that describes what is acceptable and what is not.


Automated serialization has gotten much better than it was in the bad old days of RPC and COM!


I have nothing good to say about COM, but I'm seriously thinking about gRPC [0] to get away from the sloppy json endpoints we code around today, at work. Before I dive in I would love to hear, what it is that makes that architecture a bad one.

[0] http://www.grpc.io/


Automated serialization is the devil. Gson and Jackson require you to write EJB-style objects to get automatic serialization - default constructors, with getters and setters for each field - to achieve automatic serialization.

The problem with this approach is that you've completely abdicated the power of the type system to ensure that your objects are valid. What happens if a field is missing from the JSON? Well, that field just becomes null. So now you have one of two options:

1) Write highly defensive code with null-checks everywhere. This is a pain to write, a pain to read, and almost impossible to get right and actually prevent null pointer exceptions. This is a nightmare. Switching to a null-safe language like Kotlin doesn't really help you beyond making sure that you actually code in all the null checks - the code is still ugly and a pain to maintain.

2) Call a (potentially) expensive verification method at the beginning of each method call for your object. This is less error prone than having null checks everywhere, but it's not much of an improvement. Because verification happens not at object creation time but rather when it's used, you'll find yourself with a verification exception at the entrance to some business logic where the JSON was passed to your system a week ago, immediately stored in a schema-less ORM, retrieved now, so you kind of have an idea that you have some client which didn't populate the field, but you have no idea which of the many, myriad versions of the client is responsible. So now you're fucked, and you're doubly fucked if you're losing data because of it.

Or you could just take advantage of type safety and write immutable object factories which refuse to instantiate invalid objects. Then you can write clean code using objects which you know must be valid because of type system guarantees. Libraries like immutables.github.io make this a piece of cake.


> Gson and Jackson require you to write EJB-style objects to get automatic serialization

Not the case. I successfully used Jackson combined with Lombok to achieve some really nice DRY class definitions that Just Worked with Jackson. It took a little figuring out and a couple bugfixes to Jackson but it worked. That said, part of the hassle was that I insisted on being able to do this with @Wither so we could have the objects be immutable too.

Then you can write stuff roughly like

@Value public final class Thing { String name; int age; boolean boiling; }

Although IIRC I had to use some other random set of lombok annotations instead of @Value to get it to work right with Jackson (this was a while ago, don't recall details)

> The problem with this approach is that you've completely abdicated the power of the type system to ensure that your objects are valid.

Yeah, this was a big problem with my approach. The other choice would have been to use @Builders instead of @Withers. Then you get a little more boilerplate (having to type .build()), but you can guarantee the built objects meet consistency requirements. (In retrospect, I doubt I chose the right tradeoff there)


Automatic serialization is not the devil, overly forgiving automatic serialization is the devil. I use JSON serialization libs all the time in Scala which properly support optional vs required fields. For Java devs, Gson has bad required/optional support [0] but Jackson does have it for creator properties [1]. It is important to qualify statements like your initial one to include the specific situation which it is bad instead of using a broad brush.

0 - https://github.com/google/gson/issues/61 1 - http://static.javadoc.io/com.fasterxml.jackson.core/jackson-...


It's a language/framework issue. C# supports first class properties with get/set semantics. In your method (actions) in the controller you would write something like this:

public List<CustomerModel> Get(SearchRequest request) { ..... return customers; }

Somewhere else in the (configurable) pipeline the framework can decide how to deserialize the SearchRequest and how to serialize the List<customer> based on the Accept-Header.

(CustomerModel/Request would not be business objects. They would only be used to on the API layer)

As far as validation. You could just put attributes on the properties of the Request like [Required] and they would automatically be validated before your Get method is called. Of course if the types don't match, the framework would send the appropriate error.


> Automated serialization is the devil. Gson and Jackson require you to write EJB-style objects to get automatic serialization - default constructors, with getters and setters for each field - to achieve automatic serialization.

Gson does not require this. The following class will serialize and deserialze fine with Gson:

    class Example {
      private final int foo;
      private final String bar;
    
      private Example(final int foo, final String bar) {
        this.foo = foo;
        this.bar = bar;
      }
    }
More complicated cases will require custom serializers and deserializers, but any class that defines only basic data types (including collections) works just fine.


> Automated serialization is the devil. Gson and Jackson require you to write EJB-style objects to get automatic serialization - default constructors, with getters and setters for each field - to achieve automatic serialization.

I've used Gson fine with Scala case classes.

Although now I just use Scala JSON libraries, which do not suffer from the two problems you list at all.


Jackson allows you to create immutable object without any problems, with final fields initialized in constructor. It also supports numerous annotations that will throw an exception when field is missing, etc. You just have to know your tool and use it properly, that's all.


if you're just using python for simple scripting and a random failure now and again isn't going to ruin your day, it's fine to just use json.loads, IMO. I've written quite a few scripts where the time it would take to do it 'right' wouldn't be worth the effort.


I think this might be where tools like Flow or TypeScript can become useful, since you can have typed javascript objects.

On the other hand, the fact that there's no runtime typecheck renders static analysis somewhat impotent when it comes to the result of a network call.


...unless your application is doing an in-place edit.

For instance, if your image compression application throws out my EXIF data that it doesn't understand, I'm going to be pissed. (Unless you give me an option to preserve it.)


This right here is the correct approach. Serialisation formats should be serialisation formats, ... application data should be application data.

True, although the OP seems to be advocating having your app pretty much ignore serialization altogether in favor of object-oriented design. In particular the author objects to use of dictionaries and lists instead of objects.

It is true that if you're designing an application with a json api in mind, you're likely to stick with the data structures that are easiest to serialize.

Personally, I started writing programs that way before json became so common. I did it simply to take full advantage of the native data structures and to avoid prematurely confining myself into an object hierarchy that wasn't a good fit for the problem domain. It also winds up making code more generic and easier to rewrite in a different language if necessary (for example, moving server-side code to client javascript).


That's the approach I took with my DNS library (https://github.com/spc476/SPCDNS)---extract the DNS packet into a C structure that's easier to deal with (for instance, the A RR structure: https://github.com/spc476/SPCDNS/blob/master/src/dns.h#L270).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: