Fwiw this is what Soumith has said: "Internally at Facebook, we have a unified strategy. We say PyTorch is used for all of research and Caffe 2 is used for all of production."
Totally. And I think it’s a solid strategy. However, there’s certainly an interest (internally and externally) in providing a better inter-operability story between the two. I imagine something like using Keras for model creation (and possibly training) and running (either on mobile or the cloud) on some Caffe 2 deployment.
Its a good strategy. But its no silver bullet either. If you're exporting to a "static graph" platform, your losing a major benefit of PyTorch. If you mostly just care about shipping to production, a case can be made to just use tf/caffe2/mxnet etc from the start.
While PyTorch is extremely cool, the fanboyism is out of hand, thinking that what's good for their corner of the universe must be awesome for every use case and therefore TF is a overcomplex turd. Its not like the people designing these systems are stupid.
I agree. PyTorch’s dynamism is fantastic. However, I have no idea how you’d manage to recompile PyTorch code to Caffe 2 in a satisfying way. If something is released, I suspect it’d be limited to a subset of PyTorch features (I’d also bet that subset doesn’t include the features that make PyTorch compelling).
If you mean that Wolfram advocates programmatic generation of structures, then that's true; the approach is very different though. These appear to come from a continuous optimisation process, i.e. starting with a "bad" design and iteratively tweaking it. In contrast, Wolfram tends to focus on discrete systems (e.g. cellular automata) and perform the search interactively, like a form of superoptimisation rather than numerical optimisation.
We formulate SGNS word2vec as a distributed graph problem, where nodes are all unique tokens (the dictionary) in the corpus and edges are defined by skipgrams. For skipgram (w_in, w_center), there will be an edge from w_in to w_center.
Tokens are randomly distributed over a set of workers. Each worker iterates over its edges in parallel with all other workers and performs the appropriate computation.
Drawing negative samples is done in two steps. We first draw a worker W from a suitable distribution over the workers and then draw a word from W. The overall word sampling is the same as for the reference implementation (ie, unigram distribution raised to 3/4.)
This work will soon be made public [1].
[1] Stergios Stergiou, Zygimantas Straznickas, Rolina Wu and Kostas Tsioutsiouliklis, ``Distributed Negative Sampling for Word Embeddings''. AAAI 2017.
This is a very thick talk, one Rich's best ever IMHO.
The first point is how we talk about 'change' in software, to center around what things 'provide' and 'require'.
Breaking changes are changes that cause code to require more or provide less. Never do that, never need to do that. Good changes are in the realm of providing more or requiring less.
There is a detailed discussion about the different 'levels' - from functions, to packages, artifacts, and runtimes, which he views as multiple instances of the same problem. Even though we now have spec, theres a lot of work to leverage it across all those different layers to make specific statements about what is provided and required.
I found value in dissecting the different levels of change. For the sake of sanity though, we should do breaking changes. Breaking changes exist because we have limited capacity as individuals and an industry to maintain software. This is especially true for infrastructure that is supported by (limited) corporate sponsorship and volunteers. Breaking changes limit our window of focus to two or three snapshots of code, instead of having our window of focus grow without bound. Our limited capacity can still be effective as a library changes over time.
The most important point of this talk is here: "You cannot ignore [compatibility] and have something that is going to endure, and people are going to value" [0]. Breaking changes provide a benefit for library developers, but it is usually damage done to end users. As consumers we should weigh the cost of keeping up with breaking changes against the quality of a tool, and the extra capacity its developers are likely to have.
Agreed. Breaking changes can lead to alienation of user base, but I think there's a danger in lulling people into expecting that kind of constancy in software. It creates dependency of another kind. Maybe the trick is to vary features at some rate, getting users used to change and bringing them along.
In retail it used to be the case that you could go to the same store a month later and see the same shirt to buy. The Sears catalog [1] presented that sort of constancy for consumers. Today there's a lot of flux. Some of it actually engineered to prevent people from delaying purchasing decisions. In software we can and do introduce breaking changes for ease of maintenance, and that can be ok as long as people are used to it. It's making the choice to have a living ecosystem.
Additionally there are safer, usually reasonable, ways to deal with what would otherwise be breaking changes. Give the changed functionality a different name, create a new namespace/module without the removed functionality, or create a new library if you have introduced something fundamentally different (e.g., w.r.t. how you interact with it). That way your users can choose to refactor their code to use the change, rather than discover their expectations no longer match reality when they upgrade.
Who says you have to maintain old code? We're talking about simply not deleting it and establishing a discrete semantic for the new version as truthfully, a new version is new content which demands a new name to accurately and precisely describe it. If it didn't it would be like saying different content doesn't produce a different hash.
You're right, there is no obligation to maintain it. I think that misses the point though. The value in keeping the code is to allow the end user to continue to enjoy improvements in parts of the library that don't have breaking changes without upgrading those that do. You could continue to have security patches installed, for example. That value is much less when you don't do basic maintenance implement bug fixes and security patches.
Unless I'm missing something… the answer to that problem is to (a) factor the code sufficiently to then (b) create an abstraction (interface) that backs out the concrete implementation to the specifically desired version/functionality.
Hickey's other talk says hard is relative, and I happen to agree, especially when it comes to naming. The question is to what degree of exactness you can confirm what exists (in problems). That is a function of your degree of truthfulness. So it's "hard" only in the sense it's hard to approach 100% truthfulness. However, I have observed that one doesn't need 100%, one needs to be beyond a certain threshold of effective sufficiency. And according to human history, special, rare individuals are born who do exceed that threshold.
Its an interesting question. One reason is Wolfram is extremely talented at language design, which is necessary to build an artifact of this size without self-immolating. Another is that it is a commercial company following a plan. A third is that few people have learned the lessons of Mathematica enough to apply them.
> One reason is Wolfram is extremely talented at language design
It's always a matter of taste when it comes to language design, but I'd have to disagree with this assessment ;-)
> which is necessary to build an artifact of this size without self-immolating
Well, that's certainly not the case. Plenty of huge software artifacts of very impressive quality have been built by non-language-designers.
> Another is that it is a commercial company following a plan
This is certainly true. Or rather, several plans, all of which intersect at common mathematical sub-questions. So then the entire company can leverage effort that's been poured into those components.
> A third is that few people have learned the lessons of Mathematica enough to apply them
Nah. I think the third reason is that Wolfram hires excellent hackers who are also excellent mathematicians. He hires a lot of them. And he puts them to work on the intersectional capabilities I mentioned above.
(Disclaimer: pure conjecture. I've never worked at Wolfram)
While its useful to have this kind of info, IMHO its still far from 'infrastructure for deep learning'. What about model versioning? What about deployment environments? We need to address the whole lifecycle, not just the 'training' bit. This is a huge and underserved part of the problem bc people tend to be satisfied with having 1 model thats good enough to publish.
Indeed, deployment is a whole set of interesting issues. We haven't deployed any learned models in production yet at OpenAI, so it's not at the top of our list.
If the data and models were small and training was quick (on the order of compilation time), I'd just keep the training data in git and train the model from scratch every time I run make. But the data is huge, training requires clusters of machines and can take days, so you need a pipeline.
CTO of Algorithmia here. We've spent a lot of time thinking about the issues of deploying deep learning models. There are a whole set of challenges that crop up when trying to scale these kinds of deployments (not least of which is trying to manage GPU memory).
It would be interesting to compare notes since we have deployed a number of models in production, and seem to focus on a related but different set of challenges. kenny at company dot com.
Have you tried Sacred[1]? It definitely doesn't answer the "infrastructure for deep learning" challenge but it is helpful for understanding what experiments have been run/where did this model come from (including what version of the code/parameters produced it)
So true. I've been doodling some tools to somehow manage all of it. So far I only have git-like approaches to models and chef-like approaches to infrastructure. I hope to somehow bring all together into a docker-like package that can be deployed without much hassle.
You might want to check out Pachyderm -- that is essentially what they are trying to do (Analytics infrastructure support. It isn't specific to machine learning):
In terms of deploying trained models, you can probably get away with using TensorFlow Serving and let Kubernetes handle the orchestration and scaling part of the job. I do agree that there is certainly a need to have a tool that glues all these different bits and pieces together for improving the process of taking a model from development to production.
Agreed. A very interesting and thoughtful post, but I think that you are right that OpenAI's primary use cases seem to be (unsurprisingly) academic research and rapid prototyping of new ideas. These emphasize very different set of problems than, say, deploying something in production or as a service.
Thus, this post seems immensely useful to someone like me (a PhD student, also primarily concerned with exploring new ideas and getting my next conference paper), but I can see how others doing machine learning in-the-wild or in production might see a lot of questions left unanswered. I, for one, work primarily with health care data from hospital EHRs, and I spent a lot more time with data prep pipelines than folks working with, say, MNIST.
Yes, though of course here Stein is referring there to the Wolfram quote that's on slide 28 (roughly: certain kinds of development can't be done in academia) and not the condescending rejection of inquiry about mathematica's internals from earlier in the presentation.
DevCards is great! Bruce has put a lot of work into making it a really smooth experience, and advocating the benefits of building your components outside of the application first.
Dan Abramov's DevTools[0] with Hot Reload and "Time Travel" (historical debugging) is basically the same thing too, though its tied to Redux pretty heavily IIRC. So yeah, nothing new. ("TimeWarp OS"[1] was a project developed in the late 80s that did the same thing at the OS level, primarily for physics simulations. (Something would break, you'd go back to state foo, change parameters mu,delta,sigma to yield foo' and continue the run.))
[1] http://www.cs.nyu.edu/srg/talks/timewarp.pdf Brian Beckman formerly of Microsoft as a second author. All those RX features I'd imagine were done largely in part by him and De Smet.
I'm missing something, I think. Walk with me through a hypothetical example. I load component todo-list. Dan used integers to mark each modification of virtual DOM, so lets define this as an revision of 42, saved-state-0, labelled with a reference "Populated todo-list", after {"quuz" "bar" "baz"} have been added as elements all bool, all incomplete.
OK, now you can save a ref to this state state, i.e., Component 'todo-list' is now rev 51 (on the vDOM) saved-state-1 (with a reference to rev 53) with a rendering label of : "Bar complete".
Check off the 2 remaining elements for component todo-list, we now have saved-state-2 (a reference to rev 53) with label : "Tasks completed".
I'm not saying that what you built isn't useful (I'm 100% certain it is!) but I don't see how it's any different from taking an append-only journal and adding bookmarks to save state, though I really could be missing something since I don't work front-end.
It looks like React Storybook uses a similar set of principles to what you're describing but organizes them for a different use case than Redux DevTools. You definitely could build something like React Storybook using Redux DevTools, but from what I understand React Storybook provides a pre-made standalone app wrapper + server that consumes components and applies state specs ("stories") in a 'standardized' way for browsing - you would have to write your own app to leverage Redux DevTools (or even just plain React since you don't have to keep undo/redo info) for the same purpose.
The fancy part is assembling stories for showing all the different states. This is even more powerful if you show them on the screen at once.
Imagine having all the examples below shown on screen, and then editing your component definition and having the hot reloading update them all at once so you can see the effects.
Todo item normal:
[ ] A thing to do
Todo item checked:
[x] -A-thing-to-do-
Todo item editing:
[ A thing to do ]
Todo item hovering:
[ ] A thing to do [del]
Todo list show all:
[ ] ABC
[x] DEF
[x] GHI
3/3 items
Todo list show incomplete:
[ ] ABC
1/3 items
Todo list show complete
[x] DEF
[x] GHI
2/3 items
I think the selling point for devcards/React storybook et al. is that of a live/visual styleguide of UI components. Imo its raison d'etre is that it promotes a UI component centric methodology for developing web apps, whereby designers and developers can develop UI components in isolation away from the cognitive noise of how those components come together in a single monolithic app. That's the innovation here, if you can call it that, ... the state travelling stuff is an implementation detail.
To me it's not even non-technical users, or juniors: this is a way of documenting components and their important substates, the same way a CSS styleguide does. It can help keep consistency of use, document features and states that may be overlooked by a consumer, or act as a framework for design QA.
It's not even just documentation - you can use it like visual unit tests, so you can see what the effects of a change are to all the different states of a component.
DevCards and Figwheel are amazing. As much as I love React and its hot reloading plugins, `lein new figwheel` has always worked better than cloning a boilerplate or setting stuff up.
Theres a 160 people in the Om slack channel. There's more ways to contribute than sending patches, and theres quite a lot of community effort behind Om Next these days.
Even FB doesn't use PyTorch in production, and instead uses Caffe2.