The problem is people using R without trying to learn about the language itself, just assuming it works like their favourite language.
For example complaining that R is slow and then writing iterative solution instead of using vectorization. When I saw the example the author gave my first thought was "sapply/lapply". Lapply is essential to the R use, and is being taught early on in every book/course on R I've ever saw.
"In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. "
It's been a few years since I really looked at R, but I don't think the problems with R are simply that people don't learn the language. Some languages are simply not as good as others. We can all learn more about the tools we use when programming, I know that I certainly could. But this doesn't make it our fault that a language is tricky or hard to debug or hard to understand. If we worked at it, I suppose we could all write more efficient programs by using assembler, but that doesn't mean that assembler is the best possible programming language for, say, statistical programming.
Someone, Ross Ihaka, that knows a thing or two about R wrote a short post 6 years ago and said "simply start over and build something better". Take a look:
From your link, hilariously relevant to the blog post at hand:
"First, scalar computations in R are very slow. This in part because the R interpreter is very slow, but also because there are a no scalar types. By introducing scalars and using compilation it looks like its possible to get a speedup by a factor of several hundred for scalar computations. This is important because it means that many ghastly uses of array operations and the apply functions could be replaced by simple loops. The cost of these improvements is that scope declarations become mandatory and (optional) type declarations are necessary to help the compiler."
> The problem is people using R without trying to learn about the language itself
It's not the user's fault.
Like, congratulations on being better at R than the author of TFA. Maybe you're smarter than him, maybe you've put in more time learning, maybe you've just spent your time more intelligently, maybe you lucked out and bought better books...who knows.
But this line of reasoning completely misses the author's point, which is that despite having used the language for years, he still finds it inscrutable. "It would be easier if you were better at R" is a tautology, and unhelpful. The issue is that the author finds it hard to become better at R.
We can disagree as to whether or not it's objectively hard to become better at R, but this is a perfectly valid criticism to make. It's not the user's fault.
Its a 4 year old article and R has changed a TON with new things BUT.... R has grown a ton in users as well as features.
R really is a functional programming language that people don't take advantage on. All languages have strength and weaknesses and YET the complaint is R has too many ways to do any one thing which allows us to have data.table, dplyr, ggplot2, magrittr (pipping %>%). [EDIT RStudio and RServer are also a big example of R growth in features and quality]
As I learned R my code has changed dramatically and I think R has one of the largest gap between the code you start with and when you are proficient. My starting R code is really embarrassing.
I think the person you are responding to is simply saying some things are more complex than others and require more understanding and experience. R apparently falls into that category. If the author wants to gain that experience, I think the time spent on this blog post may be better used to reading a book on R. Fault might be a strong word, but the author has certainly made a decision on what they spend their time on and the results are as expected.
some things require more understanding and experience, for no obvious benefit...in which case they're anti-patterns, or not best practice in language design.
there are a lot of things in R that are good but it's an old language and there's a lot of cruft.
and believe it or not, there are things that require looping through a data frame and when I had to do that a few years ago it was unbelievably slow... going multithreaded was non-trivial, writing that section in C was non-trivial...ended up rewriting the whole thing in python and was a lot happier.
data frame has 2 columns, 20 years of portfolio returns, and 20 years of % withdrawn.
using a starting portfolio value, calculate the 20 ending portfolio values for each year and the dollar amount withdrawn.
worked ok looping through the data frame, but was unreasonably slow.
never figured out how to use a vectorized method that could go through the frame building each new element from the one previously calculated.
maybe I missed something obvious?
(the parallel part came in because I was doing it on a lot of portfolios, so to speed it up just launched the same slow function on several lists of them in parallel. writing that one function above in C probably would have been OK. I think I got it to work but then I couldn't get the right version of compiler to work with the right version of R which supported the other libraries I was using. was a few years ago so maybe things weren't as stable. I never said I was very good :)
It's kind of amazing to see someone admit to spending hundreds or thousands of hours using R, yet refuse to spend a couple hours learning the language a little better. Whining that your tools are hard without investing any effort in them is just dumb.
The R help even comes with code samples that you can run!
R is, I think, an interesting language because it's heavily used by people who would not otherwise learn a programming language. If you compare R not with other programming languages, but with other ways of working with statistical data, this makes far more sense. I don't actually "know" SAS in the way I know a programming language - I know the commands I invoke to do what I want it to do.
Similarly, I encounter lots of people using R who don't actually know what a function is, just that lm(x~y) gets them what they want.
I see this as a failure of our educational system.
Speaking as an academic in CS, it's our job to teach people skills that they need for dealing with computers in the course of their career. The Math department does this for basic calculus and probability; the English department does this for literature and composition.
Why don't more CS departments offer the service courses that scientists and engineers need to really learn how to manipulate their data and make sense of it? At least part of the problem is probably that the other departments won't require their students to take such a course...
My personal experience is similar because I know quite a few people in social sciences.
Conceptually, this is similar to rats pulling a lever or monkeys being reinforced to type the right characters. It also explains p-hacking and many other problems of interpretation.
Now one question I always have is - if you consider R just a tool - what is the difference between things I should fully understand (R?) and things I should only know how to use (e.g., my cell phone)?
How can I justify saying that people should understand R while I myself don't understand quite a few aspects of my cell phone?
Also, many people use it only intermittently, maybe once every six months or so when they have some data to look at. Rather than try to relearn the language and its quirks yet again, it's much easier to take what you did last time and tweak it until you get what you need.
This too. Between collecting data, writing grant proposals, writing papers, etc. I don't spend time day-in, day-out using R.
Often the first hour of that is thinking "Shit, how do I do that again? Has Hadley written a package to do this better by now? What did I do last time - why did I do that last time?"
> The problem is people using R without trying to learn about the language itself, just assuming it works like their favourite language.
I think you just explained Perl in a nutshell (err, the idiom, not the book). It seems whenever a language supports enough idioms of the usual C-like languages people will gravitate towards those, likely due to the high population of people that know those idioms and can fall-back on them without having to think too hard. I doubt Lisp has as much of a problem of people trying to write C in Lisp.
I sympathize with the OP and also feel frustrated with R (and I say that as a regular R "practitioner").
Part of the problem, I think, is the built-in documentation. The typical R user is a domain-expert just trying to get some work done. Occasionally, they'll get stuck and try something like "?sapply". What appears is usually a terse, confusing mess that takes a VERY LONG TIME to digest and is the LAST THING you want to read when you're trying to make a living solving a problem other than understanding R documentation.
Below is the "Description" for Apply (which you get when you try ?sapply). Does it _really_ explain the essentials of what you need to use "apply"?
"...
lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if
appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.
..."
"lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X"
== lapply is a map() construct that takes a list and a function
"sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f)."
== "sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f)"
It could be A LOT more clear. The first sentence is, of course, obvious to everyone and probably already known by people searching for help on sapply.
If you have a family of functions that "sort-of" do similar things, the most critical thing to communicate is some clear sense of when to use one or another of the functions. This doc degenerates into unhelpful jibberish instead.
Perhaps a close reading of it would have helped the OP, but there is a unnecessarily high cost in frustration.
>> But if someone prefers iterative solutions, or that's all they know, why can't R
make them just as fast as the vectorised versions?
R is interpreted and dynamically typed, so when you declare a variable, the
interpreter has to do some bookkeeping to figure out the type of the variable,
allocate memory for it and so on.
If you write a loop by hand, the interpreter has to do this bookkeeping once for
each iteration.
If you write your code in vectorised form, the interpreter can sort out the
bookkeeping once and then hand over to the lower-level code (C or Fortran) the
vectorised functions are interpreted in.
This can also be further optimised to take advantage of processor vector instructions, parallel processing etc.
So I'm afraid we can't have our pie and eat it. If we want an interpreted
language with somewhat intuitive notation, then it has to have crappy slow
loops. If we want a language with fast loops we have to rely on C or Fortran and
forget about vectorised notation.
Why can't a JIT solve this? It shouldn't need to do the bookkeeping for every iteration if it has JIT compiled it. A JIT should be able to take advantage of processor vector instructions etc.
However, the R core committers are essentially not only volunteers, but they're all (afaik) academic statisticians. One of the people who made strides in this direction is primarily an computational statistician at Iowa (Luke Tierney / compiler package). Building a high performance runtime/jit is wildly out of their scope of expertise.
In retrospect, and I think many of them would agree, building and maintaining their own runtime was a giant mistake. Yet here we are.
Serious compiler people (Jan Vitek, others) have made strides towards a faster implementation (his in java / fastr IIRC), but it suffers from the same problem as cpython: there are millions of lines of C code in packages or internal functions that have the details of the R interpreter / C interface deeply embedded in them. In fact, there's probably far more "R" code written in C than in R. Undoing this mess is not easy, and probably not possible.
Oh, reading Evaluating the Design of the R Language [1] will shed some more light on why it's hard to make R run fast.
I think, and I'm pretty sure most of R core would agree, that building and maintaining their own runtime _was_ the right thing to do. Otherwise R would have been at mercy of maintainers who were interested in problems other than creating an expressive language for data analysis.
I don't think calling Luke an "agricultural statistician" is at all reflective of his work. Not everything in Iowa is corn, and Luke has been working in computationally intensive statistical methodology and statistical software development for decades.
"While R and Lisp are internally very similar, in places where they differ the design choices of Lisp are in many cases superior. The difficulty of predicting performance and hence writing code that is guaranteed to be efficient in problems with larger data sets is an issue that R will need to come to grips with, and it is not likely that this can happen without some significant design changes."
R does actually ship with the ability to byte-compile functions these days, and as that functionality matures it may become the default behavior. It's still better to actually learn the language; it's far easier to optimize something like:
apply(X, 1, function(x){
# do stuff to the row of X
})
than:
for (i in 1:nrow(X)){
# do stuff to X[i,], and store it somewhere
}
As far as I know byte-compiling won’t actually alleviate the repeated name lookup (or does it?). Unless the R byte compiler is fiendishly clever, every single name lookup in the loop body will still incur what essentially amounts to a `get(name, environment(), inherits = TRUE)` call.
Probably not, but I'll admit to not having dug into it too deeply. In my initial experiments, I found only modest speed gains when byte compiling. Then again, I'm already using C functions wherever possible.
Keep in mind just how old the language is, as it started as S in 1976. It was intended to be a glue language for Fortran and C.
Keep in mind also that it's easy to rewrite the bottlenecks (which are only a small part of most programs) in C, C++, Fortran and other languages including D. That may not be what any particular person is looking for, but that's traditionally the way things have been done.
Yea I feel like sapply, lapply, and mapply will cover most of what people need to do. Hell I've personally only really ever needed sapply as I don't work much with lists.
In my experience a lot of the claims that R is slow are greatly exaggerated and made by people who don't actually use it. Kind of an echo chamber. Every time I see someone say chose Python instead because of speed, I roll my eyes.
My usual take on this: "Between R and Python, the faster language is likely whichever library author actually wrote most of their code in C or FORTRAN."
It's funny you bring up python; I say this not as a comment on your thesis, but related, since I often hear the "python is slow" trope but that's only half true, you can typically write python that is plenty "fast enough" (As a day-jobbing data pipeline engineer) if you're implementing with an understanding of what things will drive you into the mud. This goes beyond just understanding the tool you're using, fundamentally writing an O(N^N) or something is going to hurt even if you're in C#/C. Have seen that plenty, frankly, more than I've seen "legitimately slow python"
Anyway, this was just a thought rolling around in my head given the discussion.
I've translated plenty of numerical code from (pure-ish) python to c and c++, and usually get about a 100x speedup, sometimes as high as 800x, implementing the same algorithms.
At the risk of being overly pragmatic, note that I said "fast enough" and not "as fast as possible."
My comment was more on the perception that python is unworkably slow in many situations, where I can count the number of times on my hands that I've NEEDED to C-ify some hot paths.
If you're writing a plasma fluid simulation to run on a HHPC cluster, yes, you probably damn well want some straight C/C++. Outside of similarly exceedingly high throughput situations, CPUs are normally more than fast enough, especially if the application in any way brushes up against people and thus falls into "human time" scales, in which case you'd typically be hard pressed to make things slow enough for someone to notice. (Yet somehow we find a way...)
To a sister post re: where python->C speedup can occur, to two birds with one stone, I imagine there's a lot of low hanging fruit, to take one obvious one, anything the compiler can optimize away. Memory read/address optimization, vectorization, potentially better support for branch prediction, I can handwave at more but I am so far from a compilers type that I'd probably make a fool of myself.
> Outside of similarly exceedingly high throughput situations, CPUs are normally more than fast enough, especially if the application in any way brushes up against people and thus falls into "human time" scales, in which case you'd typically be hard pressed to make things slow enough for someone to notice.
This has simply not been my experience. (In a previous job I had reasonably optimized numerical python code sitting on the back end of an api and it was incredibly easy to go over our time budget).
For what it's worth, I believe you; I'd be curious what the workload was / what the time window was, if you're able to say?
I could certainly see myself as having been spoiled with respect to beefy hardware and feasible workload/SLA ratios, but it's lead me to a prior where I take the age old advice against premature optimization pretty aggressively. (Starting projects in python, naive brute force implementations for a first pass, readability over a better O(N), etc)
Nit, but throughput is not the only performance constraint that could rule out Python. The last substantial amount of C I wrote was low throughput but needed to reliably receive, process and respond to packets in single-digit microseconds.
I've had good experience with Cython, which compiles python to C and gets almost all of the speedup of rewriting in C entirely. And in fact, most of that speedup just comes from declaring variable types...
Any idea where the speedups came from? Is it that the problems weren't algorithmically limited in the first place (lots of io for example), reduction of overhead etc (what kind of python was the code running on before?), or just that the speedup on low level operations added up cumulatively and cam e to dominate the other timing factors?
Also, did you change the data structures or use the same ones as in python? Was any of the speed boost data structure related?
Python and similar dynamic languages suffer from the fact that every name access (variable, function, etc) incurs a dynamic lookup of that name in a (nested) dictionary. Statically compiled languages don’t have this. There are fairly recent, clever optimisations that can avoid many of these lookups but (a) they are not implemented in any of the common implementations of Python, R, etc (JavaScript has them though). But even with these optimisations in place we cannot get rid of such lookups altogether, and they kill cache locality and branch prediction.
There are other reasons for slowdown (automatically managed garbage collection is a big one, and so is any kind of indirection, e.g. callbacks). But usually the big one is name lookup.
As a compiler writer, I can tell you that in JS, local variable lookups do not incur any kind of dynamic overhead. The performance of modern JS engines is much closer to C than you might think. Dynamic language optimization is also not so recent. Most of the techniques implemented by modern JS engines were invented for the Smalltalk and Self projects. See this paper from 1991, for example: http://bibliography.selflanguage.org/_static/implementation....
Python is just inexcusably non-optimized. It's a bytecode interpreter, with each instruction requiring dynamic dispatch. Integers are represented using actual objects, with pointer indirection. The most naive, non-optimizing JIT implementation might get you a 10x speedup over CPython. I think that eventually, as better-optimised dynamic languages gain popularity, people will come to accept that there is no excuse for dynamic language implementations to perform this poorly.
I haven’t followed recent development of JavaScript all that closely so my knowledge is somewhat outdated. However, the optimisations that make JS performance close to C in some cases are really recent. Some of the tricks are old, such as the paper you cited. But these tricks only go so far, and in particular even modern GCs simply work badly in memory-constrained environments, which puts a hard upper limit on the amount of memory that JavaScript can handle efficiently. One of the better articles on this subject is [1].
That said, my comment already mentioned that local variable lookup isn’t a problem in JavaScript. It is in R, however; see my example in [2]. Beyond that, both R and Python execution have obvious optimisation potential, which is made hard by the fact that existing libraries rely extensively on implementation details of the current interpreters.
The lookup thing only happens during compilation to byte code or intermediate code, I believe. Once in byte code, there are no variable names, only addresses.
No, unfortunately that is not the case. Lookup happens at execution of the byte code, because variables cannot be looked up at byte compilation. Consider the following case:
If `user_input` is “x”, the lookup of `x` in the local scope finds a different variable, in a different scope. Hence this lookup needs to take place every time this piece of code is executed.
I’m not sure if Python suffers from similar problems.
The lookup thing only happens during compilation to byte code or intermediate code, I believe. Once in VM, there are no variable names. Only addresses.
All from (3). Definitely not io bound, and using standard python 2.7 (if numpy had been applicable, I would have used it...)
My data structures for numerics are generally really simple, and generally I'm able to go from python list/dict/sets to c++ vector/map/set pretty directly.
I for one write all my statistical code in baremetal assembly. I manage about 5 a year, but they all run very quickly. There is no such thing as premature optimization.
For example complaining that R is slow and then writing iterative solution instead of using vectorization. When I saw the example the author gave my first thought was "sapply/lapply". Lapply is essential to the R use, and is being taught early on in every book/course on R I've ever saw.
"In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. "