Hacker Newsnew | past | comments | ask | show | jobs | submit | crockpotveggies's commentslogin

DL4J contributor here. I spoke with our team so the differences in performance are likely explained by array ordering. ND4J, for good reason, requires F ordering because of limitations in cuBLAS. While I haven't had the opportunity to closely examine the Neanderthal comparison (I'm also not a clojure user), the likely explanation is that there's an implicit ordering impacting this.

Deeplearning4j is written around F ordering and ND4J supports this. Admittedly, our ordering API is not obvious to the average user.

Here's an example test you can run yourself that demonstrates ordering: https://gist.github.com/raver119/92b615704ca1bf169aa23a6a6e7...

  o.n.i.TensorFlowImportTest - Orders: CCC; Time: 11532 ns;
  o.n.i.TensorFlowImportTest - Orders: CCF; Time: 2101 ns;
  o.n.i.TensorFlowImportTest - Orders: CFC; Time: 10202 ns;
  o.n.i.TensorFlowImportTest - Orders: CFF; Time: 1960 ns;
  o.n.i.TensorFlowImportTest - Orders: FCC; Time: 10744 ns;
  o.n.i.TensorFlowImportTest - Orders: FCF; Time: 1717 ns;
  o.n.i.TensorFlowImportTest - Orders: FFC; Time: 10097 ns;
  o.n.i.TensorFlowImportTest - Orders: FFF; Time: 1716 ns;
We also profiled the above test and confirmed that F -> C ordering adds significant overhead. I can share screenshots if anyone is interested.


While this is a good explanation, keep in mind that:

1. This is the benchmark that you provided, so while this might be not obvious to the average user, it also seems to not be obvious to the above average user that wrote the benchmark. I was just using what you proposed, assuming that you used the right thing in your library.

2. It still does not explain how you got better performance with ND4J with the same non-optimal call, which was what started the discussion, and inspired this post.

3. Neanderthal supports both Row and Column oriented order with cuBLAS with the same performance, and won't have those problems that you mention for ND4J.

I'm, of course, interested in following up on this. Please decide on what cases you'd like to compare, post the (optimal) code, and the ND4J and Neanderthal numbers that you get, and I'll respond with my comments.


It looks like while converting from my benchmarking code you've dropped the 'f' when creating the resulting array.

https://github.com/treo/benchmarking_nd4j/blob/master/src/ma...

The difference is rather huge with the newer versions of nd4j.

While the numbers in the following gists do not contain the measurements I took for neanderthal, they do contain the numbers that I got for ND4J.

Without f ordering: https://gist.github.com/treo/1fab39f213da26255cf4f75e383ff90...

With f ordering: https://gist.github.com/treo/94fe92c9417b5c8b24baa12924a35b0...

As you can see something happened in the time between the 0.4 release (I took that as the comparison point since that was when I ran my own benchmarks the last time) and the 0.9.1 release that introduced additional overhead.

Originally I planned to create my own write-up on this, but I wanted to first to find out what happened there.

Given that ND4J is mainly used inside of DL4J and the matrix sizes it is used with usually are rather large, the performance overhead difference that I've observed there for tiny multiplications isn't necessarily that bad, as the newer version performs much better on larger matrices.


You're right. In that particular case, ND4J comes to Neanderthal's speed. But only in that particular case; and even then ND4J is still not faster than Neanderthal. My initial quest was to find out whether ND4J can be faster than Neanderthal, and I still couldn't find a case where it is.

Although, to my defense, the option in question here is very poorly documented. I've found the ND4J tutorial page where it's mentioned, and even after re-reading the sentence multiple times, I still do not connect its description to what it (seems to) actually do. It also does not mention that it affects computation speed.

Anyway, I'm looking forward to reading your detailed analysis, and especially seeing your Neanderthal numbers.


Do you have any pointer on how you've profiled Neanderthal during development?

When I originally set out to compare ND4J and Neanderthal, I've ran into the issue that I bottomed out at: they basically both call MKL (or Openblas) for BLAS operations.


Fair point we are fixing now: https://github.com/deeplearning4j/deeplearning4j-docs/issues...

We will be sending out a doc for this by next week with these updates. Thanks a lot for playing ball here.

Beyond that, can you clarify what you mean? Do you mean just the gemm op?

For that, that's the only case that mattered for us. We will be documenting the what/how/why of this in our docs.

Beyond that, I'm not convinced the libraries are directly comparable when it comes to the sheer scope of the libraries to each other.

You're treating nd4j as a gemm library rather than a fully fledged numpy/tensorflow with hundreds of ops and support for things you would likely have no interest in building.

A big reason I built nd4j was to solve the general use case of building a tensor library for deep learning, not just a gemm library.

Beyond that - I'll give you props for what you built. There's always lessons to learn when comparing libraries and making sure the numbers match.

Our target isn't you though, it's the likes of google,facebook, and co and tackling the scope of tasks they are.

That being said - could we spend some time on docs? Heck yeah we should. At most we have java doc and examples. We tend to help people as much as we can when profiling.

Could we manage it better? Yes for sure. That's partially why we moved dl4j to the eclipse foundation to get more 3rd party contributions and build a better governance setup. Will it take time for all of this to evolve? Oh yeah most definitely.

No project is perfect and always has things it could improve on.

Anyways - let's be clear here. You're a one man shop who built an amazingly fast library that scratches your own itch for a very specific set of use cases. We're a company and community tackling a wider breadth of tasks and trying to focus more on serving customers and adding odd things like different kinds of serialization, spark interop,.. etc.

We benefit from doing these comparisons and it forces us to document things better that we normally don't pay attention to. This little exercise is good for us. As mentioned, we will document the limitations a bit better but we will make sure to cover other topics like allocation and the like as well as the blas interface.

Positive change has come out of this and I'd like to thank you for the work you put in. We will make sure to re run some of the comaprisons on our side.


Sure. I agree. You as a company have to look at your bottom line above all. Nothing wrong with that.

Please also note that Neanderthal also has hundreds of operations. The set of use cases where it scratches itches might be wider and more general than you think.

The reasons I'm showcasing matrix multiplications are:

1. That's what you used in the comparison. 2. It is a good proxy for the overall performance. If matrix multiplication is poor, other operations tend to be even poorer :)

Anyway, as I said, I'll be glad to compare other operations that ND4J excells at, or that anyone think are important.

I would also like to see ND4J's comparisons with Tensorflow or Numpy, or PyTorch, or, JVM based MXNet.


Yeah we definitely need to spend some more time on benchmarks after all it's said and done.

That being said, while gemm is one op, it's a lot more than just jni back and forth that use other libraries. What matters here are also things like convolutions, pair wise distance calculations, element wise ops, etc.

There's nuance there.

There are multiple layers here to consider:

1. The JNI interop managed via javacpp (relevant to this discussion)

2. Every op has allocation vs in place trade offs to consider

3. For our python interface, we have yet another layer to benchmark there (we use pyjnius for jumpy the python interface for nd4j)

4. Op implementations for the cuda kernels and the custom cpu ops we wrote. (That's where our avx512 and avx2 jars matter for example)

For the subset we are comparing against, it's basically making sure we wrap the blas calls properly. That's definitely something we should be doing.

We've profiled that and chose the pattern you're seeing above with f ordering.

That is where we are fast and chose to optimize for. You are faster in those other cases and have laid that out very well.

Again, there's still a lot that was learned here and I will post the doc when we get it out there to make that less painful next time.

You made a great post here and really laid out the trade offs.

I wish we had more time to run benchmarks beyond timing for our own use cases, if we had smaller scope we would definitely focus on every case you're mentioning here. We likely will revisit this at some point if we find it worth it.

In general, our communications and docs can always be improved (especially our internals like our memory allocation)

Re: your last point we do do this kind of benchmarking with tensorflow. For example: https://www.slideshare.net/agibsonccc/deploying-signature-ve... (see slide 3 and also the broader slides for an idea of how we profile deep learning for apps using the jvm)

We need to do a better job of maintaining these things though. We don't keep it up to date and don't profile as much as we should. It has diminishing returns after a certain point vs building other features.

I'm hoping a CI build to generate these things is something we get done this year so we can both prevent performance regressions and have consistent numbers we can publish for the docs.

Once the python interface is done that will be easier to do and justify since most of our "competition" is in python.


Here's our updated benchmarks. Thanks to Dragan for the cooperation!

Benchmarking ND4J and Neanderthal (Scientific Computing in Java and Clojure)

https://www.dubs.tech/blog/benchmarking-nd4j-and-neanderthal...


When will you just admit the initial claim about ND4J being faster than Neanderthal was bogus?


The amount of work we've done this year with Deeplearning4j on performance has been much higher than previous years. We brought DL4J up to par with community standards while maintaining the advantages of Java. I think what a lot of people don't realize is that a ton of effort has been made toward ETL and integration tooling.

It's very difficult to train on multiple GPUs while maintaining performance of ETL. ETL is a scary hidden bottleneck.

I'm very interested to see how Eclipse can continue to push development. I think the people who will especially benefit from this are devops/production teams operationalizing data science.


As someone who is mostly working in the field of data warehousing, ETL has a very meaning to me (Extract, Transform, Load). Is this the same as you’re talking about ?


Yes, so one of the core libraries within the DL4J project is datavec which is ETL-focused. One key problem that we discovered - and fixed - was that reading and transforming data for training could bottleneck a multi-GPU process. You spend a lot of $$$ on a deep learning computer, but making the library performant enough so that you could load data at the same rate the GPUs could consume it was challenging. This scales to about 4+ GPUs (depending on datatype) and we're building a datavec server so this can scale much larger. There are still good returns if you clean and transform your data and presave to disk, which helps with large machines such as a DGX. However, other bottlenecks still apply (which we are solving right now).

I hope that answers your question. I consider the process of extracting records, transforming them, and loading them for training to be "ETL". I understand ETL also applies to other data consumption.

*I should also note that if you wanted to use datavec for ETL and do not wish to train a deep learning model, it is quite useful for columnar data.


Why mix ETL and training?

I am using TF, and in my workflow I first do all ETL in some separate process, dump all training/validation data into TFRecord file, and then my training program consumes it. Clear separation of concerns without any performance penalty.

And I can iterate over training logic with various parameters as many times as I want without touching ETL.


Integrations and deployment. We are targeting data engineers with this not data scientists.

A lot of our user base are people who want to take what the python folks did for creating a feature vector from raw data -> import a model (note this is all in java) and run the training pipeline and test pipeline in the same place.

The goal is to provide an opinionated set of tools for doing this.

Upside here: It's obvious how to both pre process data and create ndarrays from them. Tighter integration also allows us to make some optimizations under the hood with how memory allocation is handled as well as allows us to target different data types like images + sound as well as databases in the same place.

What I'm guessing here is: You're a data scientist focused on building the models. Someone has to take that code and put it in production. If you work at a startup, production might be "TF serving". If you're a fortune 500,

you're likely not deploying that. People we've worked with are usually constrained in some way (especially by the JVM) .

You aren't our target audience, it's your colleagues.

We'll work backwards from that by adding our python interface and the like, but largely we want to solve a cross team concern after models are built.


But what if your training logic needs a change in ETL - You have to iterate on ETL too. So doing it inline makes a lot of sense. Microsoft has done something like this with SQL server + R offering. Personally I find the MS approach quite appalling. You have to load your model in a PLSQL script, so the ETL + running the model is seamless, but imagine debugging and oh forget about multi GPU support.


The part of the ETL you'd want to "crystallize" is just the transformation from raw data to feature vector.

Beyond that, you already "change in ETL" when you experiment. I'm not sure how this changes anything.

Out of nowhere you've sprinkled in "multi gpu support" which I'm not sure is relevant here. Do you mean as part of training?

We handle the gpu bits for you. All you do is define your transform logic, it runs on one of our backends like spark and then when you go to allocate a tensor boom gpu.

There's no special compilation or process needed to make this happen.

Dl4j supports multi gpu training out of the box. All you need to do is uses our parallelwrapper module.

We can also do distributed training with gpus on spark as well (yes this includes cudnn).

You've also for some reason decided to attach an open ended coding library where you can do whatever you want to a database?

Baked in ML in the database servers is already notoriously bad. The whole point of what we're doing is to provide a middle ground.

Data engineers have to do this anyways.


Adam, sorry guess I was not clear. What I said was that the MS Sql server + R strategy of baking in ML in PLSQL is appalling and doing something like what you are doing with DL4J is perhaps the right approach. The multi GPU support rant was for SQL Server + R, not DL4J.

"You've also for some reason decided to attach an open ended coding library where you can do whatever you want to a database?" I didn't catch this part.


Right so I think we were agreeing that how database servers bake in the ML is a bit weird and not the way to go.

The "open ended coding library" is datavec here. Comparing them here is only really semi valid. The processes there are definitely brittle.


I think he agreed with you..!


Yeah sorry about that..I was a bit confused on how to read it. There was a lot mixed in there. I think we're clear now :D.


Because you might want, as I definitely wanted, to also iterate over the feature engineering, not only the network and training parameters. Thus, doing ETL "inline" is really cool, and speeds up your iteration.

It is also easy once you have a proper language that supports multi-threading. Since you do it "full speed" with full utilization of the GPU, there is no disadvantage of doing the ETL while training.


Fwiw, here are the links for our ETL tool DataVec (it vectorizes data, or tensorizes it if you prefer): https://github.com/deeplearning4j/datavec

https://deeplearning4j.org/datavec

The thing to remember is that this is ETL focused on machine learning. It's not any old set of transforms. It's transforms that help us normalize, standardize and finally tensorize various data types, be they images, video, text or time series.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: