Hacker News new | past | comments | ask | show | jobs | submit login
Epsilon: The JDK’s Do-Nothing Garbage Collector (oracle.com)
141 points by based2 on Dec 8, 2019 | hide | past | favorite | 75 comments



JVM GC development has seen an upswing in the last few years, one of the most interesting new GCs are ZGC that show impressive numbers, delivers less than 10 ms pause times for any heap size, often than 1 ms for many applications, while having impressive throughput numbers. Another interesting adoption to the containerized world we live in is that GCs recently started to frequently return memory to the OS when not being used.

If you want to read more about these new GCs, this is a great post: https://blog.plan99.net/modern-garbage-collection-part-2-1c8...


> Another interesting adoption to the containerized world we live in is that GCs recently started to frequently return memory to the OS when not being used.

I'm wondering if there are any security implications here. Like a container could figure out the workload of another container probablistically. Thinking out loud here but sometimes these side channels are leaky.


Don't forget Shenandoah headed by G1's lead Christine Flood! I don't know enough to compare it to ZGC but it's an exciting time in Java's GC lineup.


They promised low latency in G1, then they promised low latency in Shenandoah, now they promised it in ZGC.

Okay.

What JVM really really needs for performance is value types: a lot of GC problems will simply not exist with value types.

Project Valhalla started five (!) years ago, and they are still working on it.

JVM could also use multiple heaps per process (so multiple GCs could run independently not affecting other heap execution and GC). There were talks about it years ago, but nobody is working on it AFAIU.


I always had good experiences with G1 generally.

ZGC and Shenandoah are Oracle and Redhat's respective projects with similar goals. I like the options and activies around GC all personally.


G1 has low latency. And as someone who programs c# for a living: I DO NOT WANT value types.


Interesting, could you expand a bit? From what I have seem the proposal for Java seems useful.


In the c# I dislike the extra mental load of having to work out which rules apply to the lexical thing I see being manipulated in the code. If I see references being assigned in Java I can tell that that's the case just by looking at the code. In c# I have to see the definition of the objects to tell what semantics are being expressed by the syntax. Likewise, with properties in c# a straightforward field assignment can be pretty much anything which means again, innocent looking code can be misleading and the only way to know is to read the definitions. Java is simpler to read and I value that.


See also Raymond Chen: https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98... - who quotes the story of a piece of embedded software which accepted leaking since it would be used in the runtime of a missile.


Similarly, the cooling solution for some missiles' control-systems' is simply 'thermal inertia', i.e. they just race against overheating, as program-termination can be relied upon to occur first.


Sounds like the maintenance shut downs that window OS would need from time to time.


If the amount of memory needed can be calculated ahead of time, shouldn't it also be possible to allocate it statically?


Related: I ran some benchmarks for all GCs in OpenJDK 12 some time ago: https://github.com/ixy-languages/ixy-languages/blob/master/J...

Epsilon tied on speed but lost to Shenandoah on latency because never freeing memory isn't ideal either, even if you never run out of memory.


Ya if you allocate/deallocate onto a stack, the most recently freed memory is more likely to be hot in the cache for the next allocation, reducing overall latency.

It’s something I do all the time in C++.


It'd be interesting to see a version of that graph with just Episilon/Shenandoah. It's hard to tell but it looks like Epsilon may actually have lower average latency but Shenandoah may have lower jitter & max latency.


Speaking of garbage collectors, I had a thought the other day, wondering if performance could have a huge linear speed up if the graph of object pointers was stored compactly in memory, separated from the rest of an application's memory.

The object graph could be stored in a succinct compressed format to reduce it's size as much as possible, compared to actual 64 bit pointers.

Then the GC algorithm could crawl the graph much much faster by virtue of needing to access far fewer memory pages, and increased likelihood that those pages will be in a fast cpu cache most of the time.


Luajit does something vaguely like this: http://wiki.luajit.org/New-Garbage-Collector#arenas_block-ma...

It doesn't need to access the actual data when sweeping.


The latest version of LuaJIT is still using the old GC. As far as I know the new GC design is in limbo, with no plan to implement it in the near future.


IIUC that only moves the mark bits out of the objects. It still has to access the objects while marking.

I think the parent proposed moving pointers out of the objects, so you wouldn't even have to touch the user's objects while marking.


How do you envision compressing the graph?

One thing you can do, if you know that all your objects will live in one arena, and the maximum size of that arena is <= 2^N objects for some N, is to store "pointers" as N-bit rather than 64-bit values. Java does this for N = 32: https://wiki.openjdk.java.net/display/HotSpot/CompressedOops


Yes storing them as just smaller integers is a main thing. Frequently referenced pointers could also be given an even shorter coding, kind of like entropy coding.


I thought about tackling this while visiting Recurse Center last year. My angle was that this could probably be most practical for a language like Erlang which comes closest to never mutating any runtime datastructure. (You might think of languages like Haskell first, but lazy evaluation cashes out to mutations in the implementation.) I seem to have misplaced my notes, or I'd link to them. Haven't gotten to trying to do this yet.

Someone at RC suggested using succinct data structures and I spent a little while studying them, but they seemed to me like they'd have more niche-y properties than a more obvious approach would.


This is a really cool idea.

Just spitballing, but I suspect the issue would be that you now lose the objects locality when reading/following/writing references in conjunction with data. The GC algorithm will be crawling much less than your program will hopefully so it might be a net loss.

Just speculation!


Yes I agree. Perhaps the pointers should be stored twice in memory? With the updates to the compact graph being done asynchronously buffered by a shared lock-free queue? This is a mouthful and starting to sound like an obnoxious design.


Presumably the compact representation would be hot in the cache all the time due to frequent access but it would take up otherwise available cache space.


I think this article about Instagram turning off Python's garbage collecting is somewhat relevant and interesting:

https://instagram-engineering.com/dismissing-python-garbage-...


Read it. It's still garbage collecting, but doing it through reference counting. There was an extra GC for circular reference collecting. That one was disabled as well as object freeing up before termination.


Now I'm no Java expert, far from it, so would appreciate any answers to this. I'm interacting with bunch of CLIs that are either in Java or using the JVM otherwise (Clojure mostly), how much of the startup time for this things can be attributed to the GC? It's mentioned in the article that short-running programs (almost all CLIs I use) could use Epsilon since the heap is cleared on exit anyways. But wondering how much of the typical program actually spends on, what I guess is initializing the GC?


Application startup time usually is dominated by class loading, JITs and actual application initialization code. GC overhead should be fairly small unless the heap is badly sized. You might be able to shave off a few milliseconds by tuning the GC, but the lion's share is somewhere else.

You could try OpenJ9 for CLI tools which claims to offer faster startup times out of the box compared to openjdk. Tweaking the JIT behavior (number of compiler threads, compilation thresholds for the tiers etc.) or using CDS can help too.


If you have access to the source code, you can try to recompile it with Graal Native and make it, well, native executable. It will shave off the load times considerably. But if you have a dozen of them, each of them will have their own JRE embedded so you'll waste disk space


If you have assigned a JVM too little memory for the short running task it might be a significant amount of time, but if the amount of memory is set correctly the initial GC setup is a fraction of the time compared to the time spent JIT:ing.

Although you might gain some performance, I guess this "GC" is going to be used mostly by people running financial - or other - low latency analysis/streaming code where it has been common for years to try to tune the JVM to never even attempt to GC - to avoid latency.

The code in these cases are written to reuse most memory, and when the unreclaimable part grows to big, that cluster node stops taking requests - and is then restarted.


I think the overhead of GC is negligible for many small programs. Claes Redestad @ Oracle gives a good overview here. https://cl4es.github.io/2019/11/20/OpenJDK-Startup-Update.ht...

I would look into GraalVM native images if you want fast startup.


There's some precedent for chopping out the GC for short-lived applications. DMD, the (self-hosting) D compiler, does this, effectively using a push-only stack for the heap, never doing anything akin to freeing memory. [0]

In modern GCs, allocation is already as fast as can be (pointer-bump allocation), so I imagine the only win in chopping out the GC is that you don't need to initialize the GC (it's otherwise roughly equivalent to simply terminating before the GC needs to be invoked).

Perhaps the DMD example isn't quite the same, though, as it's possible its GC has slower allocation than pointer-bump.

[0] https://www.drdobbs.com/cpp/increasing-compiler-speed-by-ove...


GC is not the bottleneck in Clojure startup time. There has been a fair bit written about it.

You can use GraalVM to build fast startup Clojure binaries.


CDS and AppCDS, along with jlink, are also great for this.


The published AppCDS + Clojure results show a much smaller speedup and require a higher degree of customization int he build. Like 1.5s -> 0.5s for AppCDS+AOT vs 1.5s -> 0.005s for Graal. And you can just use the clj or leiningen native-image plugins/templates. The minuses of Graal include some compatibility snags and being an Oracle product.


One interesting thing for you may be Class-Data Sharing feature that keeps already parsed class data across restarts and can re-use it with other JVM instances running same code. It also allows JVM to share these data in memory with other JVMs running on same host so in some use cases it can both speed up startup time and save memory.


I was working with Clojure a lot some years back, and I'm sure there's been a lot of progress in the ecosystem since then. However, what I learned then was that Clojure had an inherent startup overhead, because it had to get the language itself ready. The reason why you don't see this with for example Scala, which also runs on JVM, is because Clojure is very dynamic, as far as JVM languages go. Compare it to Java, for which the JVM was designed. These dynamic qualities come in part at the cost of the startup time.

I was especially frustrated by this, because I had spent some time writting a Clojure program that had to be able to cold-start fast. Decompiling the program's JAR and pruning out unnecessary classes gave a considerable speed-up, but it was not enough.


These days you can cut out most of that with jlink to trim down the jvm, and ClassDataSharing (CDS) and AppCDS. This post goes into it: https://blog.gilliard.lol/2017/10/04/AppCDS-and-Clojure.html


There's an use-case in twelve factor apps where GC pauses would be unacceptable but high availability would allow downtime of an individual stateless app instance. So instead of spending any time GCing, just eat memory and throw it all away and start over fresh as necessary. With various tricks, an instance can be swapped quickly (start a new instance just before killing)... probably want some sort of user-space "OOM killer" to handle it. ulimits lower than JVM option limits would work too, but wouldn't have fast restarts without some magic.


You might be replying to the wrong comment, or I'm not making my question clear enough. I'm wondering how much of the startup time the GC currently takes, and if using Epsilon will make startup faster.


Might want to check out https://github.com/facebook/nailgun


To address the short-running program issue, can't the regular GCs just not clean up on program exit?


Setting aside GC, nailgun (JDK <= 8?) and drip already solves/d short-running VMs. This is often how to speed-up CLI tools like JRuby, ant, mvn, sbt, etc.

Also these help reduce load times:

- Class Data Sharing (CDS; JDK 5+) https://docs.oracle.com/en/java/javase/11/vm/class-data-shar...

- Application Class Data Sharing (AppCDS; JDK 10+) https://openjdk.java.net/jeps/310

- Ahead-Of-Time compilation (AOT; jaotc; JDK 9+ Linux-x86_64 only): http://openjdk.java.net/jeps/295 - JVM runtime trimmer (jlink; JDK 9+): http://openjdk.java.net/jeps/282

---

Drip: https://github.com/ninjudd/drip

Nailgun:

https://github.com/facebook/nailgun

http://www.martiansoftware.com/nailgun


I do remember some perl scripts where I would send a SIGKILL to itself as the last instruction. Cut the total runtime of the script almost in half...


Heh.

That script could probably have used POSIX::_exit() to get the same speedup without the calling process thinking it crashed.

uWSGI, a web application container for Python, Perl and other languages has an option for that:

  --skip-atexit-teardown
The teardown time delay comes from:

- Unreferencing objects recursively, therefore tracing all objects.

- Calling destructor functions.

- Potentially doubling idle memory use due to copy-on-write as all the objects are written to.

When a web application server restarts all its child processes, the third item in particular can result in a large spike in memory use.


The post misleads readers into thinking that JVM runs the GC before exit. It does not.

When I was writing the Epsilon JEP, I meant that it might be futile to have a hundreds-of-ms-long GC cycle, when the program exits very soon anyway, and the heap would be abandoned wholesale. The important bit of trivia is that GC might be invoked long before 'the whole memory' is exhausted. There are several reasons to do this: learning the application profile to size up generations or collection triggers, minimizing the startup footprint, etc. GC cycle then can be seen as the upfront cost that pays off in future. With the extremely short-lived job that future never comes.

Contrived example:

  $ cat AL.java
  import java.util.*;

  public class AL {
       public static void main(String... args) throws Throwable {
          List<Object> l = new ArrayList<>();
          for (int c = 0; c < 100_000_000; c++) {
              l.add(new Object());
          }
          System.out.println(l.size());
      }
  }

  $ javac AL.java

Ooof, 12.5 seconds to run, and about 2 cpu-minutes taken with Parallel:

  $ time jdk11.0.5/bin/java -XX:+UnlockExperimentalVMOptions -Xms3g -Xmx3g -XX:+UseParallelGC -Xlog:gc AL
  [0.015s][info][gc] Using Parallel
  [0.988s][info][gc] GC(0) Pause Young (Allocation Failure) 768M->469M(2944M) 550.699ms
  ...
  [12.281s][info][gc] GC(3) Pause Full (Ergonomics) 1795M->1615M(2944M) 7660.045ms
  100000000

  real 0m12.464s 
  user 1m53.618s
  sys 0m1.087s
Much better with G1, but we still took 11 cycles that accrued enough pauses to affect the end-to-end timing. Plus GC threads took some of our precious CPU.

  $ time jdk11.0.5/bin/java -XX:+UnlockExperimentalVMOptions -Xms3g -Xmx3g -XX:+UseG1GC -Xlog:gc AL
  [0.031s][info][gc] Using G1
  [0.452s][info][gc] GC(0) Pause Young (Normal) (G1 Evacuation Pause) 316M->314M(3072M) 124.119ms
  ...
  [2.518s][info][gc] GC(11) Pause Young (Normal) (G1 Evacuation Pause) 2321M->2324M(3072M) 79.496ms
  100000000

  real 0m2.953s 
  user 0m16.880s
  sys 0m0.872s
Now Epsilon, whoosh, 1.5s end-to-end, and less than 1s of user time, which is probably the only running Java thread itself, plus some OS memory management on allocation path.

  $ time jdk11.0.5/bin/java -XX:+UnlockExperimentalVMOptions -Xms3g -Xmx3g -XX:+UseEpsilonGC -Xlog:gc AL
  [0.004s][info][gc] Using Epsilon
  ...
  [1.387s][info][gc] Heap: 3072M reserved, 3072M (100.00%) committed, 2731M (88.93%) used

  real 0m1.480s
  user 0m0.830s
  sys 0m0.699s
You might think fully concurrent GCs would solve this, and they partially do, by avoiding large pauses. But they still eat CPUs. For example, while Shenandoah is close to Epsilon in doing the whole thing in about 1.7s wall clock time, it still takes quite significant CPU time. Therefore, that benefit is there because machine has spare CPUs to offload that work to.

  $ time jdk11-shenandoah/bin/java -XX:+UnlockExperimentalVMOptions -Xms3g -Xmx3g -XX:+UseShenandoahGC -Xlog:gc AL
  [0.009s][info][gc] Using Shenandoah
  ...
  [0.913s][info][gc] Trigger: Learning 3 of 5. Free (1651M) is below initial threshold (2150M)
  [0.913s][info][gc] GC(2) Concurrent reset 1265M->1267M(3072M) 0.689ms
  [0.914s][info][gc] GC(2) Pause Init Mark 0.111ms
  [1.276s][info][gc] GC(2) Concurrent marking 1267M->1925M(3072M) 361.985ms
  [1.306s][info][gc] GC(2) Pause Final Mark 0.465ms
  [1.306s][info][gc] GC(2) Concurrent cleanup 1924M->1748M(3072M) 0.171ms

  real 0m1.761s 
  user 0m5.688s 
  sys 0m0.633s


> The post misleads readers into thinking that JVM runs the GC before exit. It does not.

Yes, the article is incorrect about this. We’ll make sure it gets fixed.

These numbers are quite interesting. Thanks for doing this analysis!


Perhaps there may be objects that depend on the finalizer callback for correctnesss. I have seen people use finalizer to do things like close file handles, and presumably not calling close may not guarantee data is persisted.


At least on unix systems, process termination implicitly calls close() on every file descriptor anyway. There should be no need to call it explicitly.

(You won't get the chance to log any write errors reported by close() or react to the errors, though.)


Finalizers were deprecated two years ago, and might be removed altogether in the not-too-distant-future.


Who would do that? Effective Java specifically says not to rely on finalizers.


It's not a issue? It's one of the cases where it does make sense to use Epsilon as the heap is cleared anyway on program exit.

From the post:

> There is a strong temptation to use Epsilon on deployed programs, rather than to confine it to performance tuning work. As a rule, the Java team discourages this use, with two exceptions. Short-running programs, like all programs, invoke the garbage collector at the end of their run. However, as JEP 318 explains, “accepting the garbage collection cycle to futilely clean up the heap is a waste of time, because the heap would be freed on exit anyway.”


Yes, my point is, why do you have to explicitly select the no-op GC for this purpose, when the default GC could already behave this way?


The default GC's do already behave this way. They don't run GC on exit. Why on earth would they?


I believe you are correct but it's worth pointing out that the post above is from an oracle blog and seems to suggest they do run on exit.


Why do you need to clean up if your program is about to exit?


Memory might need to be cleaned up if the program was being run embedded in something else (it's not unheard of to embed JVMs inside e.g. C++ applications, and it's very common in scripting languages to do this).

Additionally, global destructors, while not guaranteed, can be very helpful if you let them run rather than just exiting and letting the system clean up file descriptors: for example, a clean disconnect from a database is often faster overall (on the database side, e.g. freeing up a connection slot) than a dirty "client hasn't phoned in for awhile/received unexpected FIN" disconnect via hard-exit.


> Memory might need to be cleaned up if the program was being run embedded in something else

Just unmap the heap pages. Don't run the GC!

> global destructors, while not guaranteed, can be very helpful if you let them run

If you want them to run on exit then you want Runtime.runFinalizersOnExit, not the GC. Finalizers are non-deterministic, asynchronous, and would take an indefinite number of GC cycles to run them for all objects.


Hence my question.


I still don't understand your question, sorry:

> can't the regular GCs just not clean up on program exit

Why do you need to clean up your memory - at all, GC or otherwise - on program exit? The process and all its resources will be gone.


I think the concern is those resources might be external and not cleaning up correctly leaves them in an inconsistent state. Not saying this is best practice but I've seen it done.


Finalisers are not guaranteed to be called by GC in theory, and in practice they run asynchronously even if they are going to be called, so aren't likely to be called if you GC and then exit.

So how does calling GC help with anything?


I agree with you and how it's not reliable. I just remember the Rust community going through this same kerfuffle with their Drop trait not being guaranteed not too long ago.


That was four and a half years ago; time flies!


But regular GCs DO run at the end of the program, so my question was, why don't they just skip the final cleanup.


Unless I'm missing something, the only reason you'd run a GC cycle on termination is to invoke finalizers.

If the platform doesn't feature finalizers, or if it is known that they've not been used, there'd be no point at all.


> But regular GCs DO run at the end of the program

Which GCs do that? I'm not aware of any.


The article (apparently falsely) claims that default JVM GC does that:

"Short-running programs, like all programs, invoke the garbage collector at the end of their run."


The question was rhetorical; you and OP are saying the same thing.


That's an interesting concept but unfortunately, garbage collection is not as simple as that.


It'd be nice to have a per-thread mode for this.


Probably good for running Java on your cruise missile.


Peter Lawrey said he had HFT clients who wrote Java and managed to get their object allocation so low that they never had a GC pause, and just waited until market close to restart the JVM.


A number of games written in managed runtimes will pre-allocate a large number of objects at the start of each level/zone, and hope they don't run out before the next level (where a GC can run).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: