My thoughts (a little of a Rant) on R as the Lead Engineer at a data science focused company is that R is a great statistical language, but a poor programming language. I use the term programming language as a language which is very versatile for a variety of needs (web app, commandline app) such as python, ruby, etc. R has the capabilities to use as a programming language like a climbing rope can be used as a belt. It can, but shouldn't because of some points I have below.
It is great for exploratory analysis, as it is forgiving and easy to use in the console for testing things; but once it needs to be put into practice, it has issues. For a non-programmer, grasping R isn't too hard thanks to some great developers in the community.
There is a lot of good in the R community, but people are focused on making it isn't. Just look at deploying R into production, that can be a nightmare. I've spent days looking over code to figure out where an error in production lies. One of the errors was a package of a package which was updated for the first time in years. That package depended on another package which my package called another function that called the first one; basically it was a mess of dependencies. And there are some misconceptions, while doing the engineering work in R and learning I learned not to use for loops. Then one day I timed it and the for loop was 10x+ faster than any apply/plyr function including using a gpu.
The things that separate a programming language from a statistical language are a programming language have more than one of these:
* Good dependency management
* Easy deployment into production environment
* A clear way to setup environment (e.g. naming, folder conventions)
* Ability to do most of the things you want with the base packages
* Good documentation about the above.
Basically, I believe a good data scientist is someone who can use R (or something else) to explore data and then create the algorithm in a compiled language to be put in production. And for someone who just needs to create analysis for research or a paper, R is the perfect use case. R is an excellent language for its use cases, just don't think about using it for general programming. It has caused a lot of extra dev hours working on issues with it.
Little plug, we wrote a piece on hiring data scientists.[0]
To contrast this - I'm the lead data scientist at the same company, and head over heels in love with R....
It is the only language I can quickly and efficiently jump from algebraic topology for novel pre-processing, straight into model building and validation - with just about every potential variation of every major algorithm freely available and packaged on a well curated package manager (CRAN), and then ensemble them.
I _agree_ that it's a bit difficult to use in production, and that dependency management needs work (Packrat is trying to do that), and that blindly trusting packages on CRAN can cause errors - but 98% of the time - it just works. Graphics, models, crazy niche things that are currently only used by one post-doc locked away in a top secret research lab... it all just works.
Of course, take this with a grain of salt: this is coming from a guy who's built web-servers (HTTP responses and all) in R.
It is great for exploratory analysis, as it is forgiving and easy to use in the console for testing things; but once it needs to be put into practice, it has issues. For a non-programmer, grasping R isn't too hard thanks to some great developers in the community.
There is a lot of good in the R community, but people are focused on making it isn't. Just look at deploying R into production, that can be a nightmare. I've spent days looking over code to figure out where an error in production lies. One of the errors was a package of a package which was updated for the first time in years. That package depended on another package which my package called another function that called the first one; basically it was a mess of dependencies. And there are some misconceptions, while doing the engineering work in R and learning I learned not to use for loops. Then one day I timed it and the for loop was 10x+ faster than any apply/plyr function including using a gpu.
The things that separate a programming language from a statistical language are a programming language have more than one of these:
* Good dependency management
* Easy deployment into production environment
* A clear way to setup environment (e.g. naming, folder conventions)
* Ability to do most of the things you want with the base packages
* Good documentation about the above.
Basically, I believe a good data scientist is someone who can use R (or something else) to explore data and then create the algorithm in a compiled language to be put in production. And for someone who just needs to create analysis for research or a paper, R is the perfect use case. R is an excellent language for its use cases, just don't think about using it for general programming. It has caused a lot of extra dev hours working on issues with it.
Little plug, we wrote a piece on hiring data scientists.[0]
[0]: https://gastrograph.com/blogs/gastronexus/interviewing-data-...