Hacker News new | past | comments | ask | show | jobs | submit login
Speeding up Go's built-in JSON encoder for large arrays of objects (multiprocess.io)
110 points by eatonphil on March 3, 2022 | hide | past | favorite | 47 comments



Whenever I know I'm dealing with a large array of objects (and still want them to be user readable/editable) I typically choose to represent them in CSV format, sometimes with a special separation character.

Is there an advantage besides convenience where one would use JSON? JSON requires far more bytes per row to represent large arrays of objects.


No. CSV plays nicely in big data. JSON doesn't.

Tabular data allows you to easily optimize performance and costs. For example, Google decided it is a good idea to export some of its billing data columns as JSON. As a consequence, filtering by a low cardinality value means parsing a large amount of data because it's in a JSON column. Something that would otherwise be cheap in a classic columnar compression. Where the value is stored once along the number of subsequent rows with the same value. BigQuery bills according to amount of data parsed.

The above anecdote doesn't cost much, fortunately, because the billing data isn't large. But, in terms of a data pipeline development costs, I don't use it in the context of big data.


It’s not as cute and dry as you claim.

CSV only plays nicely with big data where your CSV encoder and decoder both agree on how to read and write CSV data. The problem is CSV is not formal standard, sure there are documents on how CSV should work they’ve never been enshrined as a standard. This means CSV parsers often differ (like with how some JSON parsers break spec with supporting comments except where JSON differs it’s nondestructive to data integrity but where CSV differs is hugely destructive).

Common issues that break CSV:

* headings or no headings

* multi line comments not being handled the same

* delimiters differing

* differing support for quotation marks

* how to escape characters, particularly control characters such as quotation marks, new lines and delimiters

* parsing of numbers (a lot of popular CSV editors actual mangle numbers a lot)

* parsing of non-numeric data that superficially appears as numbers (like credit card data, dates, basically anything where zero padding needs to be preserved).

But the worse offence with most CSV parsers is that they’ll often silently fail (or silently do the wrong thing) thus garbling your data in ways that can be almost impossible to spot on really large data sets.

You can see a practical example this problem just saving a CSV file in Excel :)

JSON might have its warts but it is a better format for where you care about preserving the integrity of the data between two independent systems. However if you’re reliant on tabulated data then I might recommend jsonlines https://jsonlines.org/examples/ (ndjson is a very similar spec too). While they have not been formally standardised they do at least extend JSON in a complimentary way and they also solve your complaints about JSON but without creating the same problems as CSV (well, aside from the headings problem).

I appreciate if you’re working with large enterprise databases then you’re hands are likely tied as to which file format you can use. But if you’re writing your own routines then jsonlines is definitely a better format for data integrity than CSV.


> how to escape characters, particularly control characters such as quotation marks, new lines and delimiters

I didn't know CSV injection was a thing until it got flagged in a pen test. Then you look around and realize it's a widespread problem and most serializers don't even have an option to escape them.


That’s because technically you can’t escape them:

* CSV only supports one data type: string. Thus formulas are just strings processed as code by some applications based on the content of that string

* CSV doesn’t support character escaping. Everything is supposed to be read unescaped. Even new lines are literal new lines. There no support for C-style escaping. If you need to have control characters then you wrap your string in quotation marks (and the fact that quotation marks are option leads to another class of bugs). If you need quotation marks inside your quotation marks then you double up the punctuation marks (ie to print “ inside “” then your string would look like

  “Bob said “”hello”””
(Please excuse my iPhone replacing ASCII double quotes with their prettier non-ASCII counterparts)


While this might be technically true, I think it's worth pointing out the typical solution: append a character (usually either \t or ') to anything that looks like a formula (i.e. anything that begins with @, +, - or =).

The character will be rendered/visible, but that's better than letter excel execute some arbitrary code.


One popular alternative (useful if your data has nested dicts or arrays) is to use newline separated json. This avoids having to encode or parse the whole dataset as a single object, but rather line-by-line, which works nicely with large datasets.


I never really understood why anyone thinks this is a good idea. Popular, maybe. Smart? Not so sure.

Instead of just modifying your parser, you change your entire input format to be incompatible with what most of the rest of the world uses (standard JSON).

I get it, it requires slightly more technical skill / effort / etc to implement a proper streaming JSON parser (which BTW will perform WAY better in terms of speed and memory) instead of just writing "my_data.split('\n')". But, there are pre-existing libraries that already do that, and a one-time tiny bit of extra work, to open up a capability that you can reuse anywhere in your stack, seems to be a much better option than a lifetime of incompatibility with standard JSON.


This helps but repeating the keys on each line is still a pain.


Arrays are valid json values.


But that would be a very irregular way to use JSON.


Not even remotely.


It’s not irregular to represent a list of typed objects in JSON as an array of arrays in which the first array is the array of keys in the same order as the corresponding values? Is there a JSON serialization library that actually uses this pattern? I’m not asking to be argumentative, but I genuinely like the idea and would be pleasantly surprised to see it.


My take, based on principles and reason (but not enough experience so take with a grain of salt) is that a json document is a message. If it doesn't fit in a message (whatever that means for your use case), you send multiple messages over a stream, or append them to a log file. Encoding wise, a newline character is a conventional and simple way to separate human readable messages. I believe this is called JSONL. Golangs json encoder does this by default.

It follows that json is not a database, and the fact that it's tempting to use it as such is no fault of json.

I'm not ready to say json is better than csv, but I dislike that csv has a rigid table structure but no types. It feels like combining sweat pants with a tie. I also dislike that csv has different conventions around quoting and such. I don't feel comfortable editing csvs by hand. It's possible I'm wrong and I just never learnt it correctly, but experience tells me to trust that feeling.


Maybe you want to include metadata without adhoc parsing rules like skip the first five lines.


JSON supports nested data and other data types, and your data will be read the same by any system (csv it might depend on the parser).

MessagePack might be worth a look as a more efficient json if you're okay with a non-human-readable file.


> sometimes with a special separation character.

My experience is that only leads to problems. Have never seen a good justification for not using RFC 4180 for csv files.


Well CSV only has one datatype: string. JSON has a few more.


Some people will curse me, but I have used JSON-over-CSV a few times, it's a good compromise if your data is indeed mostly tabular.

To be clear, I mean encoding values as JSON, but the overall object structure in CSV, so something like:

    id,name,cool
    1,"John Doe",false
    2,"Daisy \"Obliterator of Worlds\" Wilson", true


I think this would be fine, as long as the CSV layer was still parsable using the RFC 4180, then you could still use a normal CSV parser to parse the CSV layer and a normal JSON parser to parse the JSON layer. My worry with your example is that it is neither format, so it will need custom serialisation and deserialisation logic as it is essentially a bran new format.

https://datatracker.ietf.org/doc/html/rfc4180

If you’re looking for line-oriented JSON, another option would be ndjson: http://ndjson.org/


Yeah you're right, I'm not respecting the quote-quoting rule in this example, though IIRC I did in at least one of the actual uses.

ndjson has the disadvantage of repeating column names for every record, which, granted, is basically fine if you are using compression, but sometimes for whatever reason you can't.


This is super interesting. I'd imagine there's a pretty good reason that you picked this over something like an array of arrays of values or something like { "row1": [...], "row2": [...] }, but the only reason I can think of is if there was some of external constraint that made CSV necessary. Was it that simple, or was there something else?


Embedded system with limited bandwidth, the data was mostly simple values that did not really need escaping 99% of the time (mostly no unicode or control characters in strings, but still _possible_, and we needed the streamability of array-of-structs instead of struct-of-arrays.

Why not some binary format, then? Well, it's easier to debug and share the data with non-technical people (just open in excel or equivalent) without having to run a conversion step.


You’d be better off using jsonlines than creating a bespoke format that is neither CSV nor JSON.

https://jsonlines.org/examples/


True, I guess that could be a viable alternative!


I am building a web api for voidtools Everything in Go and this post came at the perfect time. The built-in JSON encoder isn't terrible but is noticeably slow. I am dealing with thousands to millions of objects for perspective.


I've been using jsoniter (another 'fastest' json lib) for ages. It is much faster than the built in, so 55% faster than slow isn't really meaningful.

Surprised a more comprehensive benchmark evaluation was not done since this tends to be a pretty sensitive topic.


Unlike other implementations, as far as I can tell, this one composes encoders. Under the hood it can use encoding/json.Marshal or goccy/go-json.Marshal (another existing very fast library) or any other libary that implements the json.Marshal call.

This implementation is competitive (sometimes faster, sometimes slower) with good implementations like goccy/go-json on its own, and beats goccy/go-json when composing this library with goccy/go-json.

This composition and goccy/go-json performance is included in the post.

Maybe in a followup post I'll do more benchmarks against other fast libraries. But for this one I wanted to show the process and then just pick one fast library for comparison.

Edit: also, I forgot to mention in the post but some libraries speed up encoding by requiring a fixed schema. DataStation/dsq is extremely dynamic and I'll never know the schema up front. Just another reason why I couldn't use some existing faster libraries.


All I am saying is that the 'fastest' json libs all do comparisons against other 'fastest' json libs.

I'd expect any other json lib trying to be faster to do those same comparisons and not produce arguably clickbait "55% faster" titles since your library isn't really that much faster than goccy.

Pick the ones that match (dynamic) to compare against.


If you're not a fan of the blog post that's cool. But I posted the blog post (and wrote the post in the first place) rather than a Show HN link to the project itself because I thought the process was worth showing as much as the result.


I've known for some time that the JSON process of marshaling is expensive in Go; this post illuminates why and taught me how to test these kinds of things myself. It's excellent and I appreciate you sharing your methodology.


I enjoyed reading it, as someone who writes a lot of go but hasn't done any profiling of code.


> I thought the process was worth showing as much as the result

it was, thanks for sharing.


Would love to see results from incorporating https://github.com/segmentio/encoding/tree/master/json!


To the author: heads up, the first code block in the "Infinite buffer" section doesn't match the text, the text is talking about using bytes.Buffer but the code is for the quoted columns caching. Neat blog post, btw :)


Ooops, fixed. Thank you!


i wonder what % of cpu cycles on AWS are spent encoding and decoding JSON.


Actually, this is horrifying to think about. I'll concede that JSON is the answer for most public HTTP APIs, but it's also used for tons of internal IPC. The energy that could be saved with more efficient alternatives is likely substantial.


While sending data between C++/Go/Java processes as JSON is a waste, when one part is a scripting language, JSON can be more efficient than alternatives.

Recently I benchmarked extension API in Chromium. Those can send arbitrary data from C++ main browser process to the renderer process running extension JS code. Internally API use a binary format plus they verify that the data matches API scheme.

It turned that for complex data writing the data to JSON on C++ side, sending the string using the same API and decoding that in JS was faster. The code to read binary data and convert that to JS plus the scheme verification was not optimized in Chromium, while JSON decoder was.


> is the answer

i suspect you mean pragmatically for humans, since there is support for it in most libraries. but for concise equivalent marshalling, i wonder if YAML or TOML etc would be faster to parse. Or Amazon's Ion superset of JSON.


I wonder what % are just idle.


Or running JVM GC. Or on pointer chasing. etc. Most code is just not well optimizesd.


(De)Serialization costs definitely add up. I've seen measurements where it was 20% of cpu cycles. Even worse lots of systems out there that could get an immediate boost just by swapping libraries.


I think a lot of the older APIs return XML and the SDKs abstract all that away. I imagine that uses even more


XML is an interesting one; on the one side it's as inefficient as JSON due to encoding data to a text format and back again, but on the other because it's more formalized it allows transfer methods like EXI [0] to turn it into a highly efficient binary interchange protocol. In theory. This technology probably came in too little too late though, since by then JSON was starting to take over.

[0] https://www.w3.org/TR/exi/


What is "large"?


> And generated two datasets: one with 20 columns and 1M rows, and one with 1K columns and 10K rows.

And if that's not big enough for you, the observed effects grow as I increased columns and rows.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: