My least favourite things about R is its desire to keep on running when it should have errored on something about 50 lines before and happily spitting out some nonsense result - maybe with a warning, often not.
One of my previous jobs basically turned into an in-house R consultant for a department in a pharmaceutical company, and I caught so many bugs when investigating some other issue which meant the results people were reporting were completely wrong. A really common one is multiplying 2 vectors of unequal length where broadcasting shouldn't be possible and it just recycles the shorter vector - but hey, it ran without error and there's an output so many researchers don't notice.
Not to mention trying to handle errors is pretty miserable, if you want to catch a specific error you have to match the error string, unfortunately the error message changes depending on the locale the R session is running in.
I can't recommend the "R for Data Science" (https://r4ds.had.co.nz) book enough, which is written by one of the creators of the tidyverse, Hadley Wickham. This opinion might get challenged here, but if you're going to use R primarily for data science/analysis and not for programming I think it's a better idea to start learning it with the tidyverse than with base R (beyond the basics, of course, which are also covered in the book).
I use R professionally for biostatistics and I can't remember the last time I had to use the base syntax because something couldn't be done with the tidyverse approach.
Would be interesting if you could expand.
I've used r (data.table) extensively in the last years for biostatistics in a research organization. I was able to get away with not learning tidyverse and stick to data.table.
Main reason for choosing data.table was speed - I'm working with tens of hundred of GB of data at once.
What's worked for me is reading Hadley Wickham's "Tidy Data" paper[0] and then applying the concepts with data.table. The speed is nice, but I really love what's possible with data.table syntax and how any packages work with it. That's opposed to what many people have decided "tidy" means, with non-standard evaluation and functions that take whole tables and symbols of column names instead of vectors.
Compared to data.table, tidyverse offers significantly better readability and ergonomics in exchange for worse computational and memory efficiency, with the magnitude of the performance ranging from negligible to catastrophic depending on the operation and your data volume. At that data volume, you're probably doing some things that would OOM or hang for days if you translated your data.table code to the corresponding tidyverse code.
Agreed. IMO Tidyverse is a fantastic suite of R packages and worth learning after understanding how to use base R/with minimal dependencies. I personally started with base R and evolved to use tidyverse. Now I use base R when writing R packages and use tidyverse for data analysis/modeling workflows.
I’ll second this, though with some hesitation. If you just want to get stuff done, start with tidyverse. But if and when it’s time to start writing classes and packages, you may have to go back and gather some of the fundamentals.
I'm a base R purist personally, but that's mostly because of how long ago I picked it up and don't get any improvements in development speed from dplyr verbs with a few exceptions. But I disagree with this take for beginners especially non-programmers, with the advent of tidyverse it is incredible how fast newcomers pick up enough fluency to handle basic data massaging, analysis and visualisation.
I think exceptions where base-R is necessary can be taught as they arise.
There are several comments below that suggest not using tidyverse because "base R" is the foundation for everything.
I think it is important to use tidyverse because of the many quirks, surprises, and inconsistencies in base R. It would be helpful if others share their reasoning, or at least point to their favorite blog explanation, so that beginners can understand the problems they will face.
Unfortunately 5 minutes of Googleing failed a to produce a reference for me --- the start of some advanced R book that begins by asking "do you need to read this?" and showing examples whose results are predicted incorrectly by most people. Perhaps another user can provide the info.
This depends on what you are using R for. Tidyverse is focused on handling data.frame objects and everything that comes with them. Even ggplot2 uses a data.frame as a default input. And tidyverse has a competitor - data.table, which can be substituted instead (given that you are familiar with base R).
However, some data are better suited to be represented in the form of matrices. Putting matrix-like data in a data.frame is silly, since performance will suffer and you would have to convert it back and forth for many matrix-friendly operations like PCA, tSNE, etc. The creator of data.table shares this opinion [1]. And similar opinions are generally given by people who are familiar with problems that fall outside the data.frame model [2].
Is this really a unique to R or do all programming languages have some foibles? For example I spent an hour recently debugging C++ because I forgot that it loves to do integer division despite the fact it's going into an explicitly typed double. No error, no warning. You just have to know and I highly doubt it's desired behavior for most cases.
Most researchers are not programmers and don't care about programming. It's a tool to get the job done and I think you'd run into similar problems with other languages.
If you divide two integers, you get an integer. You can then cast it to whatever you want. Or, if you want some other type, you need to cast it before the operation is done.
Okay. But I'm storing it in a variable explicitly declared to be a double. That should be enough. If I divide two integers in python or R or Julia or a dollar store calculator I don't get an integer and I don't even have to explicitly type the variable. You have to know that C++ will do that. It's not common sense just like R recycling shorter vectors.
I agree with your point that all languages have their quirks. This is a very poor example however. If it automatically converted to float what would you do if you wanted integer division? I think automatic casting tends to get messy/be pretty evil in general but of course there are exceptions.
You could always do something like:
int divRes = IntA/intB;
double something = divRes*5.342;
At the very least it could warn me. I just tried it in rust and that will error out if you try to divide two ints and store the result in a float which is fine by me.
Hi, would it be possible to contact you to ask some career questions related to the pharmaceutical industry and data science? I'm a biostatistician who uses R for everything and lately I've been thinking about doing a career change, but I'm a bit lost with all the available options.
My least favorite thing so far was indices starting at 1. It seems blasphemous, in a way.
On a more serious note, I agree that R being too charitable in interpreting things (seemingly without warning) seems to be a problem. You'll have to do some debugging to make sure it actually does what you intended it to do. I've only dabbled in it a bit though.
> My least favorite thing so far was indices starting at 1. It seems blasphemous, in a way.
In the real world we start counting from 1. CS people cannot stop complaining about it but it makes sense in languages used for mathematics and statistics. Zero-indexing is not very relevant if you don’t care about memory layout.
> It's a bit of a joke, like arguing over tabs vs. spaces though.
It is taken very seriously, though. This “issue” comes up very often when some people come and lecture others about how stupid the language they use is.
> May I recommend you this fabulous short essay by Dijkstra
That essay is not fabulous, it is obnoxious. I know you either love or hate Dijkstra and he enjoyed being a contrarian, but he’s unconvincing. The only point that surfaces during arguments on 0-indexing is iterating over 1..N-1 instead of 0..N. That’s basically what he wrote himself. This could have been solved with just a bit of syntax if it were really a problem, and it remains largely because C did it that way to simplify pointer arithmetics. It does not change the fact that for the vast majority of people, the first element in a list is, well, first.
The proper way of handling this is to allow for arbitrary indices, because you will always find contexts where a different scheme makes sense (e.g. iterating from -10 to 10 is sometimes natural, and would otherwise require some index gymnastics). Insisting that one narrow view is the correct one is just annoying.
I dunno, it seems you misunderstood me, or idk. I clearly said that it is completely arbitrary to choose one over another, and expressing a preference over either one is just a way of poking fun at people who are anal about choosing a specific one. So there isn't really any disagreement, though I'm always amazed at what lengths people go to, to express what they think, while they're really just arguing about the definition of some thing.
> It is taken very seriously, though.
And those who do take it terribly seriously deserve being poked at ;)
Honestly indices starting from 1 fits really nicely in most situations. 1-based indexing together with ranges and inclusive range-based indexing makes loops and subsetting code really readable IMO
One of my previous jobs basically turned into an in-house R consultant for a department in a pharmaceutical company, and I caught so many bugs when investigating some other issue which meant the results people were reporting were completely wrong. A really common one is multiplying 2 vectors of unequal length where broadcasting shouldn't be possible and it just recycles the shorter vector - but hey, it ran without error and there's an output so many researchers don't notice.
Not to mention trying to handle errors is pretty miserable, if you want to catch a specific error you have to match the error string, unfortunately the error message changes depending on the locale the R session is running in.