More

slashcom · on May 1, 2019

That is the actual plot of the SheWork episode in Broad City.

nickthegreek · on May 1, 2019

Clip of the SheWork concept.

slashcom · on April 30, 2019

Yann LeCun did not, otherwise he’d be a coauthor. As it is, this was a collaboration between NYU and Facebook AI Research, with multiple authors working at both institutions.

drb91 · on May 1, 2019

My understanding is that academic authorship credit is political: authors don’t always contribute, and contributors don’t always get credit. Is this not the case?

nl · on May 1, 2019

Not really. Usually the politics goes the other way - people getting an author slot because they are the head of the department or something.

drb91 · on May 1, 2019

I believe I covered that—authors who don’t contribute.

Also, I know too many angry grad students to believe that contributors always get credit.

slashcom · on April 24, 2019

I don’t know man, 20% of your annual income would be seen as a sizeable fine. That’s 20% of their yearly profits, and it wipes out most of their earnings for Q1.

xvector · on April 24, 2019

On the long-term, it's still nothing at all for Facebook. Their stock will bounce back.

paulcole · on April 24, 2019

20% of 100,000 is much more significant than 20% of 100 billion dollars.

slashcom · on March 27, 2019

And replaced by residual connections in transformers, which are absolutely dominating LSTMs now.

stochastic_monk · on March 27, 2019

Transformer-XL uses recurrence, and most NLP SOTA is still with LSTMs. I’m not sure I’d expect attention mechanisms to fully replace recurrence.

slashcom · on March 4, 2019

Interestingly, we have 3 in North America: East, West and Texas.

https://en.m.wikipedia.org/wiki/Texas_Interconnection

igravious · on March 4, 2019

The article you linked to says: “The two major and three minor NERC Interconnections, and the nine NERC Regional Reliability Councils.” and it also says, “The Texas Interconnection is one of the three minor grids in the continental U.S. power transmission grid. The other two minor interconnections are the Quebec Interconnection and the Alaska Interconnection.”

snowwindwaves · on March 4, 2019

And quebec

slashcom · on Feb 15, 2019

There’s a natural way to parallelize these models so that using 128 GPUs is the same as a 128x batch size. You can similarly simulate 128x batch size by accumulated gradients before backpropping. So you can test on just one or a few GPUs before you run the full thing.

By that point you know it’s going to work, it’s just a matter of how well and whether you could’ve done nominally better with different tuning.

There’s been enough research leading up to this paper to suspect that just scaling larger would play out.

paraschopra · on Feb 15, 2019

Thanks.

>By that point you know it’s going to work, it’s just a matter of how well and whether you could’ve done nominally better with different tuning.

This can't be true in all cases, right? I'm assuming that for many initially promising results on less-compute when they scale it, the results aren't impressive. I'm very curious to know what is the trials-to-success rate of publishable results when big-compute is thrown in the mix.

slashcom · on Feb 15, 2019

It’s indeed a very high trials to success ratio. Again though, there’s enough papers preceding this one that you could have good confidence in the effort. Another thing that helps is orgs like OpenAI have their own servers, rather than renting ec2 instances.

You also don’t just launch that many things and them ignore it. You monitor it to make sure nothing is going terribly wrong.

But yeah there’s also the fact that if you’re Google, throwing $2m worth of compute at something becomes worth it for some reason (eg Starcraft)

riku_iki · on Feb 15, 2019

I doubt 1.5B params will fit any single GPU. I think they spread parts of models between GPUs/TPUs similarly to mesh-tensorflow: https://arxiv.org/abs/1811.02084

slashcom · on Feb 9, 2019

Top of the document says use a ballpoint pen, which doesn't smear and can't be erased like a pencil.

slashcom · on Dec 31, 2018

An infinitely sized 2 layer NN is universal in the same way a Turing machine is universal — sure you can write any program; God help you if you try.

slashcom · on Dec 27, 2018

So Russia ays they were forced to develop it after Trump withdrew from the nuclear treaty. 2 months is some impressive R&D time, huh?

badrabbit · on Dec 27, 2018

> "Putin says Russia was forced to develop the Avangard after the U.S. withdrew from the Anti-Ballistic Missile Treaty in 2002,"

2002 - 16 yrs

slashcom · on Sept 2, 2018

An easier way to understand it is in the context morphology: word prefixes and suffixes mean things, and words have common roots.

For example, polymorphism could be decomposed into poly-morph-ism. Antidisestablishmentarianism, which is unlikely to appear much in the corpus, becomes anti-dis-establish-ment-arian-ism. Now the system can learn how to reuse "anti-" or "establish" from other examples more easily than trying to learn the full word's meaning from the one or two examples it might see in the corpus.

BPE is a clever way to induce these sort of decompositions automatically without any linguistic annotation, making them useful in multilingual settings. Other languages are much more morphologically rich than English, and there it really benefits.

jstandard · on Sept 2, 2018

Thanks, this makes much more sense than the author's strange example.