Researchers unveil a pruning algorithm to shrink deep learning models

brilee · on May 8, 2020

This technique figures out how to find a subset of neural network edge weights that can replicate most of the full model's performance, and is indeed quite simple. The catch is that this sparse subset of NN edge weights has no structure, so it doesn't get efficient execution on a GPU. If you're on a CPU with no specialized matrix math, this does in fact cut down on execution costs, but if you have an embedded ML chip, this doesn't really help.

Distillation is another shrinkage technique that is also pretty foolproof, but allows you to pick an arbitrary architecture - this way, you can make full use of whatever hardware you're deploying too.

cs702 · on May 8, 2020

> The catch is that this sparse subset of NN edge weights has no structure, so it doesn't get efficient execution...

I would love to be able to model each artificial neuron as a separate object in code, making it trivial to group/ungroup neurons into irregularly shaped and/or overlapping groups and easily shrink/expand/alter such groups on the fly without incurring a performance penalty. Easy pruning would be just one application.

As you point out, right now, those things not practical with frameworks like PyTorch and TensorFlow if we want decent performance. Currently we have little choice but to pigeonhole all our designs explicitly into layers that operate on grid-like structures of fixed size and resort to things like masking so we can have fast multiply-sum operations.

By coincidence I was just talking about this on an unrelated thread: https://news.ycombinator.com/item?id=23118348

j-pb · on May 8, 2020

> separate object in code

I hope you don't mean object in the OOP sense, but in a more general way, because that would be the slowest thing ever, no matter how smart your compiler/framework.

You'd probably be better off to just have each group as a copy of the original one.

> not practical with frameworks like PyTorch and TensorFlow

The problem is not the frameworks, the problem is hardware, and physics.

We've hit the wall with clock speeds, we've hit the wall with pipeline depth, we've hit the wall with branch prediction.

The only thing we have left is specialised hardware, parallel hardware, and bigger caches. But all of these only work with SIMD, because it makes data fetches predictable, it makes the cores simpler and it makes adding more compute cheaper (no separate fetch decode silicon required).

For the memory hierarchy to work you want stuff packed like in ECS systems for games, or Matrices for ML.

Even with specialised hardware, changing your neural architecture on the fly would be ridiculously expensive because of all the communication overhead.

cs702 · on May 9, 2020

The problem is more complex than "just" hardware. If I may paraphrase a recent paper out of Google Brain: the current systems (hardware and software) we use for for numerical computing are "stuck in a local basin of performance and programmability."[a]

[a] https://dl.acm.org/doi/pdf/10.1145/3317550.3321441?download=...

visarga · on May 9, 2020

Using a sparse adjacency matrix in combination with regular matrix operations solves the problem of fitting any graph topology into matrices (used in graph neural nets). You can put multiple graphs together in the same adjacency matrix and batch them up for efficient computation.

cs702 · on May 9, 2020

You would think so, but the problem is more complex than it appears to be: https://dl.acm.org/doi/pdf/10.1145/3317550.3321441?download=...

There's a big "impedance mismatch" between (1) "programmability" (by which I mean, being able to write high-level code that under-the-hood requires dynamic modification, reshaping, and/or combination-optimization of those adjacency matrices you mention, without worrying about performance) and (2) existing infrastructure (frameworks that rely on highly optimized computation "kernels" that cleverly exploit the memory layout and other features of accelerated hardware, i.e., GPUs/TPUs).

woah · on May 9, 2020

I have thought about that as well. I think it should be pretty straightforward. You need an algorithm that can divide your graph (weights and activations) into batches where no member of a batch depends on values that have not yet been calculated. I think, but am not sure, that this will result in alternating batches of weights and activations.

Then you can calculate the results of applying the weights in the first batch to every example in a minibatch in parallel. Then you move on to calculate the results of applying the activations in the second batch to the results of the first batch, and so on until you have finished all the batches.

The hard part is making the batches, although it feels like the kind of classical algorithms 101 problem that has some great solutions out there already. Maybe it is an application of graph coloring or something.

I have not really pursued this because I'm not sure if there is a benefit to breaking the neurons out of their orderly layers. It would be interesting to just try training a few randomly generated networks, but I haven't had the time so far.

Email me at hi at jehan dot email if you want to talk about this.

cs702 · on May 9, 2020

I'm not so sure.

Say we want to build a CNN that accepts batches of images of different size without having to resize and/or pad them to the same size. Or say we want to build a convolutional layer in which kernels need not be rectangular and of the same shape and size but can be of different shapes and sizes (e.g., circular kernels that vary in size as we move away from the center of attention). Or say we want to remove connections with low weights (as proposed in this paper) and randomly add new connections between arbitrary individual neurons in different layers between backprop updates.

These are just a few examples off the top of my head of things that in principle would be easy to code in any modern high-level language (e.g., Python, Swift) but look very hard to get to run efficiently using today's software + hardware infrastructure (PyTorch/TensorFlow + GPUs/TPUs).

woah · on May 9, 2020

Just need to find a way to put together batches of operations that can happen in parallel, right?

cs702 · on May 9, 2020

No. Unfortunately, the problem is more complex than that: https://dl.acm.org/doi/pdf/10.1145/3317550.3321441?download=...

buildbot · on May 9, 2020

I’m not super familiar with the technique, but something like NEAT or hyperNEAT might be wgat you are looking for?

gavinray · on May 9, 2020

Why can't you model each neuron as an individual object with a unique UID in code?

I am fortunate enough to have a friend that's fairly accomplished in ML and a really nice guy. He spent a video chat + code session with me where he helped me build an NN from scratch in Typescript, we replicated the network in the paper here:

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagat...

Here's the Codesandbox link, if you notice the console output it (approximately) matches the figures from the paper above:

https://codesandbox.io/s/node-typescript-vqszi?file=/example...

I don't really understand much about ML or neural networks, but from a code perspective, if you have these data structures:

    class Neuron {
      id: number | void = neuronUidGen.next().value;
      incoming: Connection[];
      outgoing: Connection[];
      rate: number;
      bias: number;
      output: number;
      error: number;
    }

    class Connection {
      id: number | void = connectionUidGen.next().value;
      from: Neuron;
      to: Neuron;
      weight: number;
   }

Why couldn't you manipulate these however you see fit? There's a Layer implementation in there that's commented out because it doesn't function right that could roughly accomplish something like this theoretically, right?

cs702 · on May 9, 2020

You could. But that's not the hard part. The hard part, for the foreseeable future, is that if you write your code like that, you will find it very difficult to get decent performance out of existing frameworks (PyTorch/TensorFlow) and accelerated hardware (GPUs/TPUs) which operate on neurons arranged as arrays of the same shape. (Oversimplifying a bit.)

gavinray · on May 9, 2020

But the Neuron and Connections are just metadata objects like pointers that hold arrays, I don't understand.

Surely in PyTorch/TF, the ND-Arrays of NN's must also be associated to an object reference?

> existing frameworks (PyTorch/TensorFlow) and accelerated hardware (GPUs/TPUs) which operate on neurons arranged as arrays of the same shape.

    const dot = (inputs: number[], weights: number[]) => {
      let total = 0;
      const length = Math.min(inputs.length, weights.length);
      for (let i = 0; i < length; i++) {
        total += inputs[i] * weights[i];
      }
      return total;
    };

  activate(input?: number): number {
    if (!this.incoming.length) {
      this.output = input;
    } else {
      const inputs = this.incoming.map(connection => connection.from.output)
      const weights = this.incoming.map(connection => connection.weight);
      this.output = squash(dot(inputs, weights) + this.bias);
    }
    return this.output;
  }

This is on principle the same concept right?

cs702 · on May 9, 2020

Unfortunately, the problem is more complex than that: https://dl.acm.org/doi/pdf/10.1145/3317550.3321441?download=... (I oversimplified things in my comment above).

goatlover · on May 9, 2020

Could there be a transpiling step that converts the objects to arrays, targeting PyTorch or TensorFlow?

amelius · on May 9, 2020

You first need to demonstrate that the path you wish to take is promising, e.g. by running simulations that take weeks but give better results than our current results which are fast.

Only then would it make sense to consider a different computational approach.

walrus · on May 8, 2020

I'm just speculating (and haven't read the paper yet), but it may be possible to achieve similar speedups on GPUs by pruning the smallest 20% of blocks of size ≥K×K to produce block-sparse weights[0], rather than pruning the smallest 20% of weights.

[0] https://openai.com/blog/block-sparse-gpu-kernels/

iddiid · on May 9, 2020

A month or so ago a paper about a deep learning model using a cpu beating the best (4000$ or so) nvidia gpu was on hn, the trick was a quick hashing technique similar to nearest neighbour, so the algorithm keep only the most active neurons. I think the gpu price was about 4000$ and the computer was 60% of that based on some ball park estimates from the comments in HN.

Edited: This is the paper:https://news.ycombinator.com/item?id=22502131

Perhaps both techniques can be used together, since the paper take advantage of sparcity to beat gpu when computation are on a sparse subset.

lonelappde · on May 8, 2020

Does a GPU run a pruned network slower than unpruned, or just not as faster as the modern shrinkages would suggest?

fxtentacle · on May 8, 2020

A naive GPU implementation on a non-block sparse network will be just as slow as the full un-pruned network.

visarga · on May 9, 2020

Another approach would be to lower precision. By going from 32bits to 16 or 8 you can get massive speedups in memory operations and computation.

sk0g · on May 9, 2020

Nvidia's apex.amp does just this - attempts half-precision training where possible, and adjusts scaling factors during training. I achieved a very significant speedup using this on my RTX 2080 Super, though it doesn't support Windows that well.

fxtentacle · on May 9, 2020

Yes, that'll work. But it's orthogonal to this paper.

yalok · on May 9, 2020

The trick is to combine these 2 approaches. Quantizing to bytes helps a lot, but forces you to use much larger pruning group size to see any speed up from pruning.

ur-whale · on May 9, 2020

> The catch is that this sparse subset of NN edge weights has no structure, so it doesn't get efficient execution

I you parallelize your computation along the training data dimension, the problem is embarrassingly parallel and the structure of your neural network does not matter on bit.

Eridrus · on May 9, 2020

Having seen a bunch of people fail to make distillation work in the sense of training a better model than by training the smaller model from scratch, I don't think it's really fool proof...

jb775 · on May 9, 2020

> The catch is that this sparse subset of NN edge weights has no structure, so it doesn't get efficient execution on a GPU

Could you elaborate a little on this aspect?

jointpdf · on May 9, 2020

I could be off-base here, but I think they’re saying that a sparse matrix (for holding the edge weights) is ultimately something like a list-of-lists format, rather than a dense n x m array (for which GPUs have optimized routines). Except I that there are libraries doing sparse matrix operations on GPU (https://developer.nvidia.com/cusparse).

enchiridion · on May 9, 2020

How optimized are current neural networks to specific GPU architectures?

buildbot · on May 8, 2020

If they are simply masking out low values weights after doing full training, they’ve basically recreated what I worked on for my graduate thesis. In contrast to this work, what we worked on starts from iteration 0, and constantly adapts the lottery mask.

You can reset weights back to their initial value, or set them back to their initial value tied to a decay term. This will zero initial values slowly to get computational sparsity if needed. https://arxiv.org/abs/1806.06949 https://open.library.ubc.ca/cIRcle/collections/ubctheses/24/...

asparagui · on May 8, 2020

Comparing Rewinding and Fine-tuning in Neural Network Pruning

https://arxiv.org/abs/2003.02389

solidasparagus · on May 8, 2020

Thank you. Can't believe the article didn't have a link to the paper.

seesawtron · on May 10, 2020

Modern day journalism (even at MIT) doesn't care about citing the sources relevant to their story. It only cares about clickbait headlines.

liuyao · on May 9, 2020

I thought something as simple as this would have been tried out long ago.

Here are some related ideas. Instead of pruning channels, one should try to prune connections. Say with a 64x64x3x3 conv layer, it might be that the matrix (or tensor) can be brought to "block diagonal" form with a "change of basis", similar to singluar value decomposition. Except here we have 9 matrices of size 64x64 that we want to diagonilize simultaneously. This process itself might be formulated as an optimization problem that one solves by gradient descent.

CShorten · on May 8, 2020

Check out our interview on Machine Learning Street Talk with Jonathan Frankle explaining Rewinding and why you can only reset the Learning Rate rather than the weights! https://www.youtube.com/watch?v=SfjJoevBbjU&t=1177s

kaaloo · on May 9, 2020

Thanks! I really enjoyed the group conversation format and I am looking forward to seeing some more of your content.

p1esk · on May 9, 2020

Interesting, he thinks "there has not been much progress in pruning NNs in the last 10 years". Maybe he meant that it's often possible to train a non-pruned network of the same size as the pruned network, and achieve a similar accuracy.

vikramkr · on May 8, 2020

As someone not familiar with AI - I'm wondering if this is really as simple and revolutionary as the article states? MIT is kind of known for highly optimistic press releases that maybe oversell sometimes, which makes it hard to know what's actually a real breakthrough sometimes

orange3xchicken · on May 8, 2020

Yeah, the original work this paper follows up (by the same group: https://arxiv.org/abs/1803.03635) received a lot of attention in 2018 when it was uploaded to Arxiv & spawned a lot of followup work. Even though it's been demonstrated that these lottery tickets - sparse trainable subnetworks exist and can approximate the complete nn arbitrarily well, their properties are still not really understood. What is understood is that these subnetworks depend heavily on the initialization of the network, but that training the entire network together is necessary for generalization. These findings generally advocate for two-stage pruning approaches as opposed to cts regularization/sparsification throughout training. The question is how best to find these lottery tickets, and encourage them from the get-go.

A lot of this work is also related to training adversarially robust networks. A composition of ReLU layers corresponds to a piecewise linear function, where the # of 'pieces' is like exponential in the # of neurons. It's well known that standard training of networks results in a highly non-linear pwl that is easily fooled by adversarial examples. The robustness of a neural network against adversarial examples is typically characterized by its smoothness. One question is how to train the network or prune neurons to encourage smoothness & reduce the complexity (i.e. # of linear pieces) of the nn.

janhenr · on May 9, 2020

It is, unfortunately, not revolutionary at all. The Lottery Ticket Hypothesis won a Best Paper Award because the idea is very intuitive and elegant. However, the compression rates (9x) of this follow-up work on ImageNet are far from impressive. Take for example this paper: https://arxiv.org/abs/1810.07378 that reaches 30x compression rate on ImageNet. (Disclaimer: I know the original author and co-published a paper with him - not this one however)

deepnotderp · on May 8, 2020

It's like 99% hyperbole, the gain is like 20% versus fine tuning. An interesting idea, but incremental in terms of improvements.

itronitron · on May 8, 2020

There is a lot of pie-in-the-sky PR filler in the article, and it's very light on details. It likely is as simple as the article states, but probably not very revolutionary. At some point the learned model will cease to be useful.

m0zg · on May 9, 2020

Sparse stuff is not efficient to compute unless it's really, really sparse, or at least partially dense in large enough chunks that your compute can efficiently do its job. If L1 hit is counting to three or four depending on the arch, a full cache miss is counting to 200+. If you miss your cache all the time (which with sparse stuff you will) things get really slow. And that's _before_ you consider that GPU programs can't really do different branches across threads, and non-coalesced memory access absolutely crushes their memory throughput, and CPUs have to blow their pipeline out on branch misprediction, so you want very predictable branches. It all looks good on paper, but most researchers do not have the engineering chops to validate these ideas in practice properly.

Eridrus · on May 9, 2020

I think these pruning methods do work if you put some effort into engineering them properly.

But I came here to agree about most researchers having no interest in actual inference time performance. I just tried a library that was meant to be a "drop in replacement" for an embedding table that as meant to use a lot less memory, and after a little bit of fiddling, yes, it was a good drop in replacement for an embedding table if all you wanted to do was write papers and compute compression rations using theoretical bits needed. In practice, only the training/eval version of the code was written and nobody had actually written the the theoretically possible efficient inference path in the 3 implementations I looked at, and all the numbers in the paper were on theoretical memory savings, so my memory usage actually went up...

m0zg · on May 9, 2020

I don't see how they would work, TBH. I wrote a good amount of low level, accelerated code for deep learning kernels, and I've yet to see a case where sparse stuff is faster than dense. Moreover, based on my knowledge of low level details, I don't see how typical "academic" pruning can be made fast if you aren't using model-specialized hardware. The way you make things fast on CPU/GPU/TPU is by loading as wide a vector as possible, having as few branches as possible, and helping your prefetcher as much as possible. Sparsity gets in the way of all three, _especially_ on the GPU.

benibraz · on May 9, 2020

I find the technique very similar to the one presented in [1] from 2018, and I can't seem to find a ref to this paper in theirs.

Another interesting and very related paper [2] that they fail to mention also shows improvement in training time (epochs) with dynamic pruning

And for the ones who wondered how far this is going back time I suggest you to read this one [3]

[1] https://www.nature.com/articles/s41467-018-04316-3

[2] https://arxiv.org/abs/1608.04493

[3] https://papers.nips.cc/paper/323-generalization-by-weight-el...

acd · on May 9, 2020

Biomimic this may be what happens in the brain during sleep. Neural network connections are maintained/ pruned, memories strengthened.

person_of_color · on May 9, 2020

Why not train a net to prune another net?

crakenzak · on May 9, 2020

PruneGAN

natch · on May 8, 2020

What are the real world tradeoffs here compared to quantization of models? Quantization is super easy but I take it this has some advantages?

Eridrus · on May 9, 2020

Quantization is orthogonal and complementary.

natch · on May 9, 2020

How so? You still aren't saying what the benefits are. OK, to be fair I said tradeoffs, but I'm really wondering what the point of this work is. I mean my models are ridiculously tiny already, even when considered for use on mobile devices, so clearly that's not the win here. What is?

Eridrus · on May 9, 2020

Your models might be ridiculously tiny, but a lot of people's models are not. Take a loot at any research paper in vision, speech or language. The models are gigantic.

natch · on May 9, 2020

But quantization solves the problem of models being gigantic. So the question is still unanswered. But hey, you got in your downvote, thanks for that.

MarkusQ · on May 8, 2020

> ...and repeat, until the model is as tiny as you want.

Cool! So if I repeat long enough I can get any network down to a single neuron (as long as I really want to)? That is awesome!

mkolodny · on May 8, 2020

Not quite. The Lottery Ticket Hypothesis paper showed that models could shrink to around 10% of their original size without a loss of accuracy [0]. So around a million neurons instead of 10 million.

[0] https://arxiv.org/abs/1903.01611v1

jl2718 · on May 9, 2020

Read the original papers on the perceptron for similar claims.

fishmaster · on May 8, 2020

I mean yeah... but the performance will be accordingly.