> Every MATLAB using scientist I know ends up with a code base that is too complex (~1000-5000 lines of code) for the language, but is perfect for Python.
My experience with this is that it is not the necessarily the language that makes it too complex, but the lack of training in proper software development. I have got mainly colleagues who were trained as economists, actuaries or mathematicians using Python, R, Matlab or Excel/VBA as the language of their choice and the result is always the same.
Coming from physics myself,I understand that beautiful notation can make things much easier and can be often the way to find great solutions, but proper organization will most often be sufficient.
Base R has some pretty big problems compared to Python or VBA. I'll single out the increasingly desperate attempts made to implement a map function in the *apply() function family; one of a number of key functions that from a user perspective return data in a random data structure. It is unreasonably difficult to maintain control over what the type of your object is.
Say m[1:2,] where "m" is a matrix. That returns a matrix. m[1,] returns something that is not a matrix. That sort of type soup breaks all sorts of things in an unhelpful manner. typeof(m[var,]) and class(m[var,]) don't reveal any information on the subject either. You need to explicitly test if you still have a matrix with is.matrix after accessing a sub-matrix of a matrix in the obvious fashion. That is an important operation. Good luck figuring that out if you don't already know what is going on; the design is awful.
The short story is to go install tidyverse and use that instead.
Yes, there are problems of this nature in R, but I don't think the specific case you cite is really a problem. The ultimate source of the difficulty is the R (really S) decision to not have scalars - only vectors that happen to have length one. But given that, writing m[1,] is normally intended to get a simple vector, not a matrix.
The problem really arises when you write m[1:n,] with the intent of getting a matrix with n rows, and n happens to be 1 at the moment, so you get a simple vector instead.
This is a problem that I have addressed in my pqR version of R, available at pqR-project.org.
In pqR, there is a new sequence operator, .., which produces a 1D array, not a simple vector. And in pqR, when a 1D array is used as an index, the dimension is not dropped, even if the array happens to be of length 1. So m[1..n,] produces a matrix even if n is one.
Well, most of the time. There's also the problem that m might have only one column, so the result will get dropped down to a simple vector for that reason. To solve this, pqR has a new way of indicating a missing argument, with _, which also indicates that you don't want the dimension dropped. So you can now get exactly the behaviour desired by writing m[1..n,_].
This is all backwards compatible, except that it's necessary to disallow use of .. in the middle of an identifier, so that a..b won't be taken as the name of a variable.
Ah, I've been intrigued by pqR. Lately I've wondered if there couldn't be a version of dplyr implemented as transducers, if only R... wasn't R. How feasible might it be for some future R runtime to be truly multithreaded, even if it breaks some existing functionality?
Well, pqR already uses multiple threads automatically to parallelize some numerical operations - e.g., for long vectors a and b, (a * b + a / b) might be computed with three threads, one computing a*b, one computing a/b, and one adding the results of these as they become available, or exp(a) might be computed with two threads each handling part of a.
But if you mean threads programmed explicitly in R, with fine-grained, low-overhead communication using shared memory, I think it would be quite challenging to modify the current implementation to support a language extension to do this. But maybe not impossible, for some sorts of extensions.
I think this point has great merit. From time to time I sit on the industrial relations board of our local university's CS department and I begged to have at least one basic software engineering best practice module as part of their new data science degree, with no luck. Many universities have Research Software Engineering departments now who can help with this stuff, but on the whole code produced by non-CS academics is dross (but honestly god bless you if you're offended by this characterisation and it doesn't apply to you).
> My experience with this is that it is not the necessarily the language that makes it too complex
It's not that the language makes the project complex. It's that the projects are by a bit more complex than the language is cut out for.
> but the lack of training in proper software development...
Assuming training scientists etc. in proper software development is a good use of their time (it might be); then MATLAB is a blocker because it makes it hard/ugly to implement "proper" techniques.
Again Python fits the bill here, because its a pretty good language for novices to hack around naively in, while scaling smoothly to projects 5-10 times more complex. So that the naïf has some headroom.
I used Clojure to manipulate the data, construct the system, think of R as a library of data analysis and data visualization, use DSL to call R, take the professional advantages of R, avoid the shortcomings of R.
Let professional language do professional things, and the advantages of various languages complement each other.
Dunno if you're aware of Bayadera[1], but if you happen to like a Bayesian workflow and don't rely overmuch on ggplot2 niceness, you can actually go a whole day without touching R.
R is actually a lot more concise than Python for a number of operations. Python wins when you consider the total ecosystem, but doing a broad comparison like that is unfair. it depends completely on what you are trying to achieve.
R has better data manipulation, algorithm support (outside of neural networks) and visualisation. Speaking as someone who has to maintain environments for a team of data scientists, getting R and RStudio Server installed with Intel MKL support etc is trivially easy. Getting JupyterHub, Python, Anaconda and your own compile of TensorFlow all up and running is _excruciating_ so I don't really buy that one ecosystem is provably better than another even if I think R is a horrible language in comparison).
The best I can explain it is that R is to data science as PHP is to web sites. It's an incredibly productive and accessible DSL, but you rarely want to inherit an R codebase.
Exactly! And I feel R has a similar problem.