JVM GC development has seen an upswing in the last few years, one of the most interesting new GCs are ZGC that show impressive numbers, delivers less than 10 ms pause times for any heap size, often than 1 ms for many applications, while having impressive throughput numbers. Another interesting adoption to the containerized world we live in is that GCs recently started to frequently return memory to the OS when not being used.
> Another interesting adoption to the containerized world we live in is that GCs recently started to frequently return memory to the OS when not being used.
I'm wondering if there are any security implications here. Like a container could figure out the workload of another container probablistically. Thinking out loud here but sometimes these side channels are leaky.
They promised low latency in G1, then they promised low latency in Shenandoah, now they promised it in ZGC.
Okay.
What JVM really really needs for performance is value types: a lot of GC problems will simply not exist with value types.
Project Valhalla started five (!) years ago, and they are still working on it.
JVM could also use multiple heaps per process (so multiple GCs could run independently not affecting other heap execution and GC). There were talks about it years ago, but nobody is working on it AFAIU.
In the c# I dislike the extra mental load of having to work out which rules apply to the lexical thing I see being manipulated in the code. If I see references being assigned in Java I can tell that that's the case just by looking at the code. In c# I have to see the definition of the objects to tell what semantics are being expressed by the syntax. Likewise, with properties in c# a straightforward field assignment can be pretty much anything which means again, innocent looking code can be misleading and the only way to know is to read the definitions.
Java is simpler to read and I value that.
Similarly, the cooling solution for some missiles' control-systems' is simply 'thermal inertia', i.e. they just race against overheating, as program-termination can be relied upon to occur first.
Ya if you allocate/deallocate onto a stack, the most recently freed memory is more likely to be hot in the cache for the next allocation, reducing overall latency.
It'd be interesting to see a version of that graph with just Episilon/Shenandoah. It's hard to tell but it looks like Epsilon may actually have lower average latency but Shenandoah may have lower jitter & max latency.
Speaking of garbage collectors, I had a thought the other day, wondering if performance could have a huge linear speed up if the graph of object pointers was stored compactly in memory, separated from the rest of an application's memory.
The object graph could be stored in a succinct compressed format to reduce it's size as much as possible, compared to actual 64 bit pointers.
Then the GC algorithm could crawl the graph much much faster by virtue of needing to access far fewer memory pages, and increased likelihood that those pages will be in a fast cpu cache most of the time.
The latest version of LuaJIT is still using the old GC. As far as I know the new GC design is in limbo, with no plan to implement it in the near future.
One thing you can do, if you know that all your objects will live in one arena, and the maximum size of that arena is <= 2^N objects for some N, is to store "pointers" as N-bit rather than 64-bit values. Java does this for N = 32: https://wiki.openjdk.java.net/display/HotSpot/CompressedOops
Yes storing them as just smaller integers is a main thing. Frequently referenced pointers could also be given an even shorter coding, kind of like entropy coding.
I thought about tackling this while visiting Recurse Center last year. My angle was that this could probably be most practical for a language like Erlang which comes closest to never mutating any runtime datastructure. (You might think of languages like Haskell first, but lazy evaluation cashes out to mutations in the implementation.) I seem to have misplaced my notes, or I'd link to them. Haven't gotten to trying to do this yet.
Someone at RC suggested using succinct data structures and I spent a little while studying them, but they seemed to me like they'd have more niche-y properties than a more obvious approach would.
Just spitballing, but I suspect the issue would be that you now lose the objects locality when reading/following/writing references in conjunction with data. The GC algorithm will be crawling much less than your program will hopefully so it might be a net loss.
Yes I agree. Perhaps the pointers should be stored twice in memory? With the updates to the compact graph being done asynchronously buffered by a shared lock-free queue? This is a mouthful and starting to sound like an obnoxious design.
Presumably the compact representation would be hot in the cache all the time due to frequent access but it would take up otherwise available cache space.
Read it. It's still garbage collecting, but doing it through reference counting. There was an extra GC for circular reference collecting. That one was disabled as well as object freeing up before termination.
Now I'm no Java expert, far from it, so would appreciate any answers to this. I'm interacting with bunch of CLIs that are either in Java or using the JVM otherwise (Clojure mostly), how much of the startup time for this things can be attributed to the GC? It's mentioned in the article that short-running programs (almost all CLIs I use) could use Epsilon since the heap is cleared on exit anyways. But wondering how much of the typical program actually spends on, what I guess is initializing the GC?
Application startup time usually is dominated by class loading, JITs and actual application initialization code. GC overhead should be fairly small unless the heap is badly sized.
You might be able to shave off a few milliseconds by tuning the GC, but the lion's share is somewhere else.
You could try OpenJ9 for CLI tools which claims to offer faster startup times out of the box compared to openjdk. Tweaking the JIT behavior (number of compiler threads, compilation thresholds for the tiers etc.) or using CDS can help too.
If you have access to the source code, you can try to recompile it with Graal Native and make it, well, native executable. It will shave off the load times considerably. But if you have a dozen of them, each of them will have their own JRE embedded so you'll waste disk space
If you have assigned a JVM too little memory for the short running task it might be a significant amount of time, but if the amount of memory is set correctly the initial GC setup is a fraction of the time compared to the time spent JIT:ing.
Although you might gain some performance, I guess this "GC" is going to be used mostly by
people running financial - or other - low latency analysis/streaming code where it has been common for years to try to tune the JVM to never even attempt to GC - to avoid latency.
The code in these cases are written to reuse most memory, and when the unreclaimable part grows to big, that cluster node stops taking requests - and is then restarted.
There's some precedent for chopping out the GC for short-lived applications. DMD, the (self-hosting) D compiler, does this, effectively using a push-only stack for the heap, never doing anything akin to freeing memory. [0]
In modern GCs, allocation is already as fast as can be (pointer-bump allocation), so I imagine the only win in chopping out the GC is that you don't need to initialize the GC (it's otherwise roughly equivalent to simply terminating before the GC needs to be invoked).
Perhaps the DMD example isn't quite the same, though, as it's possible its GC has slower allocation than pointer-bump.
The published AppCDS + Clojure results show a much smaller speedup and require a higher degree of customization int he build. Like 1.5s -> 0.5s for AppCDS+AOT vs 1.5s -> 0.005s for Graal. And you can just use the clj or leiningen native-image plugins/templates. The minuses of Graal include some compatibility snags and being an Oracle product.
One interesting thing for you may be Class-Data Sharing feature that keeps already parsed class data across restarts and can re-use it with other JVM instances running same code. It also allows JVM to share these data in memory with other JVMs running on same host so in some use cases it can both speed up startup time and save memory.
I was working with Clojure a lot some years back, and I'm sure there's been a lot of progress in the ecosystem since then. However, what I learned then was that Clojure had an inherent startup overhead, because it had to get the language itself ready. The reason why you don't see this with for example Scala, which also runs on JVM, is because Clojure is very dynamic, as far as JVM languages go. Compare it to Java, for which the JVM was designed. These dynamic qualities come in part at the cost of the startup time.
I was especially frustrated by this, because I had spent some time writting a Clojure program that had to be able to cold-start fast. Decompiling the program's JAR and pruning out unnecessary classes gave a considerable speed-up, but it was not enough.
There's an use-case in twelve factor apps where GC pauses would be unacceptable but high availability would allow downtime of an individual stateless app instance. So instead of spending any time GCing, just eat memory and throw it all away and start over fresh as necessary. With various tricks, an instance can be swapped quickly (start a new instance just before killing)... probably want some sort of user-space "OOM killer" to handle it. ulimits lower than JVM option limits would work too, but wouldn't have fast restarts without some magic.
You might be replying to the wrong comment, or I'm not making my question clear enough. I'm wondering how much of the startup time the GC currently takes, and if using Epsilon will make startup faster.
Setting aside GC, nailgun (JDK <= 8?) and drip already solves/d short-running VMs. This is often how to speed-up CLI tools like JRuby, ant, mvn, sbt, etc.
The post misleads readers into thinking that JVM runs the GC before exit. It does not.
When I was writing the Epsilon JEP, I meant that it might be futile to have a hundreds-of-ms-long GC cycle, when the program exits very soon anyway, and the heap would be abandoned wholesale. The important bit of trivia is that GC might be invoked long before 'the whole memory' is exhausted. There are several reasons to do this: learning the application profile to size up generations or collection triggers, minimizing the startup footprint, etc. GC cycle then can be seen as the upfront cost that pays off in future. With the extremely short-lived job that future never comes.
Contrived example:
$ cat AL.java
import java.util.*;
public class AL {
public static void main(String... args) throws Throwable {
List<Object> l = new ArrayList<>();
for (int c = 0; c < 100_000_000; c++) {
l.add(new Object());
}
System.out.println(l.size());
}
}
$ javac AL.java
Ooof, 12.5 seconds to run, and about 2 cpu-minutes taken with Parallel:
$ time jdk11.0.5/bin/java -XX:+UnlockExperimentalVMOptions -Xms3g -Xmx3g -XX:+UseParallelGC -Xlog:gc AL
[0.015s][info][gc] Using Parallel
[0.988s][info][gc] GC(0) Pause Young (Allocation Failure) 768M->469M(2944M) 550.699ms
...
[12.281s][info][gc] GC(3) Pause Full (Ergonomics) 1795M->1615M(2944M) 7660.045ms
100000000
real 0m12.464s
user 1m53.618s
sys 0m1.087s
Much better with G1, but we still took 11 cycles that accrued enough pauses to affect the end-to-end timing. Plus GC threads took some of our precious CPU.
$ time jdk11.0.5/bin/java -XX:+UnlockExperimentalVMOptions -Xms3g -Xmx3g -XX:+UseG1GC -Xlog:gc AL
[0.031s][info][gc] Using G1
[0.452s][info][gc] GC(0) Pause Young (Normal) (G1 Evacuation Pause) 316M->314M(3072M) 124.119ms
...
[2.518s][info][gc] GC(11) Pause Young (Normal) (G1 Evacuation Pause) 2321M->2324M(3072M) 79.496ms
100000000
real 0m2.953s
user 0m16.880s
sys 0m0.872s
Now Epsilon, whoosh, 1.5s end-to-end, and less than 1s of user time, which is probably the only running Java thread itself, plus some OS memory management on allocation path.
$ time jdk11.0.5/bin/java -XX:+UnlockExperimentalVMOptions -Xms3g -Xmx3g -XX:+UseEpsilonGC -Xlog:gc AL
[0.004s][info][gc] Using Epsilon
...
[1.387s][info][gc] Heap: 3072M reserved, 3072M (100.00%) committed, 2731M (88.93%) used
real 0m1.480s
user 0m0.830s
sys 0m0.699s
You might think fully concurrent GCs would solve this, and they partially do, by avoiding large pauses. But they still eat CPUs. For example, while Shenandoah is close to Epsilon in doing the whole thing in about 1.7s wall clock time, it still takes quite significant CPU time. Therefore, that benefit is there because machine has spare CPUs to offload that work to.
$ time jdk11-shenandoah/bin/java -XX:+UnlockExperimentalVMOptions -Xms3g -Xmx3g -XX:+UseShenandoahGC -Xlog:gc AL
[0.009s][info][gc] Using Shenandoah
...
[0.913s][info][gc] Trigger: Learning 3 of 5. Free (1651M) is below initial threshold (2150M)
[0.913s][info][gc] GC(2) Concurrent reset 1265M->1267M(3072M) 0.689ms
[0.914s][info][gc] GC(2) Pause Init Mark 0.111ms
[1.276s][info][gc] GC(2) Concurrent marking 1267M->1925M(3072M) 361.985ms
[1.306s][info][gc] GC(2) Pause Final Mark 0.465ms
[1.306s][info][gc] GC(2) Concurrent cleanup 1924M->1748M(3072M) 0.171ms
real 0m1.761s
user 0m5.688s
sys 0m0.633s
Perhaps there may be objects that depend on the finalizer callback for correctnesss. I have seen people use finalizer to do things like close file handles, and presumably not calling close may not guarantee data is persisted.
It's not a issue? It's one of the cases where it does make sense to use Epsilon as the heap is cleared anyway on program exit.
From the post:
> There is a strong temptation to use Epsilon on deployed programs, rather than to confine it to performance tuning work. As a rule, the Java team discourages this use, with two exceptions. Short-running programs, like all programs, invoke the garbage collector at the end of their run. However, as JEP 318 explains, “accepting the garbage collection cycle to futilely clean up the heap is a waste of time, because the heap would be freed on exit anyway.”
Memory might need to be cleaned up if the program was being run embedded in something else (it's not unheard of to embed JVMs inside e.g. C++ applications, and it's very common in scripting languages to do this).
Additionally, global destructors, while not guaranteed, can be very helpful if you let them run rather than just exiting and letting the system clean up file descriptors: for example, a clean disconnect from a database is often faster overall (on the database side, e.g. freeing up a connection slot) than a dirty "client hasn't phoned in for awhile/received unexpected FIN" disconnect via hard-exit.
> Memory might need to be cleaned up if the program was being run embedded in something else
Just unmap the heap pages. Don't run the GC!
> global destructors, while not guaranteed, can be very helpful if you let them run
If you want them to run on exit then you want Runtime.runFinalizersOnExit, not the GC. Finalizers are non-deterministic, asynchronous, and would take an indefinite number of GC cycles to run them for all objects.
I think the concern is those resources might be external and not cleaning up correctly leaves them in an inconsistent state. Not saying this is best practice but I've seen it done.
Finalisers are not guaranteed to be called by GC in theory, and in practice they run asynchronously even if they are going to be called, so aren't likely to be called if you GC and then exit.
I agree with you and how it's not reliable. I just remember the Rust community going through this same kerfuffle with their Drop trait not being guaranteed not too long ago.
Peter Lawrey said he had HFT clients who wrote Java and managed to get their object allocation so low that they never had a GC pause, and just waited until market close to restart the JVM.
A number of games written in managed runtimes will pre-allocate a large number of objects at the start of each level/zone, and hope they don't run out before the next level (where a GC can run).
If you want to read more about these new GCs, this is a great post: https://blog.plan99.net/modern-garbage-collection-part-2-1c8...