My least favourite things about R is its desire to keep on running when it shoul...

haddr · on Feb 6, 2022

This is why it is a good idea to do “set options(warn=2)” to turn warnings into errors & easily spot problems.

melling · on Feb 6, 2022

Any other tricks that are helpful for a beginner to know?

    Use Rstudio 
    Include tidyverse
    Turn warnings into errors

funesrequiem · on Feb 6, 2022

I can't recommend the "R for Data Science" (https://r4ds.had.co.nz) book enough, which is written by one of the creators of the tidyverse, Hadley Wickham. This opinion might get challenged here, but if you're going to use R primarily for data science/analysis and not for programming I think it's a better idea to start learning it with the tidyverse than with base R (beyond the basics, of course, which are also covered in the book).

I use R professionally for biostatistics and I can't remember the last time I had to use the base syntax because something couldn't be done with the tidyverse approach.

_jules · on Feb 6, 2022

Would be interesting if you could expand. I've used r (data.table) extensively in the last years for biostatistics in a research organization. I was able to get away with not learning tidyverse and stick to data.table. Main reason for choosing data.table was speed - I'm working with tens of hundred of GB of data at once.

vharuck · on Feb 6, 2022

What's worked for me is reading Hadley Wickham's "Tidy Data" paper[0] and then applying the concepts with data.table. The speed is nice, but I really love what's possible with data.table syntax and how any packages work with it. That's opposed to what many people have decided "tidy" means, with non-standard evaluation and functions that take whole tables and symbols of column names instead of vectors.

[0]: https://vita.had.co.nz/papers/tidy-data.html

tfehring · on Feb 8, 2022

Compared to data.table, tidyverse offers significantly better readability and ergonomics in exchange for worse computational and memory efficiency, with the magnitude of the performance ranging from negligible to catastrophic depending on the operation and your data volume. At that data volume, you're probably doing some things that would OOM or hang for days if you translated your data.table code to the corresponding tidyverse code.

carlmcqueen · on Feb 6, 2022

dtplyr is an option as well which lets you use tidyverse syntax with data.table backend. Speed and syntax.

If you learned data.table however, it's better to just stay in data.table. Nothing in the tidyverse can touch the efficiency of data table.

jstx1 · on Feb 6, 2022

Don't use tidyverse if you don't need to. Certainly don't start with it if you're a complete beginner. Base R goes a long way on its own.

defaultuser9 · on Feb 6, 2022

Agreed. IMO Tidyverse is a fantastic suite of R packages and worth learning after understanding how to use base R/with minimal dependencies. I personally started with base R and evolved to use tidyverse. Now I use base R when writing R packages and use tidyverse for data analysis/modeling workflows.

adeelk93 · on Feb 6, 2022

I would say this is bad advice. Don’t learn base R, focus on tidyverse. Tidyverse is what people write and use.

jker · on Feb 6, 2022

I’ll second this, though with some hesitation. If you just want to get stuff done, start with tidyverse. But if and when it’s time to start writing classes and packages, you may have to go back and gather some of the fundamentals.

trts · on Feb 6, 2022

I agree with both you and GP. Doing heavy stats work in base is pointlessly painful.

Hadley's Advanced R is a great reference for getting down to those fundamentals.

https://adv-r.hadley.nz/

kuhewa · on Feb 6, 2022

I'm a base R purist personally, but that's mostly because of how long ago I picked it up and don't get any improvements in development speed from dplyr verbs with a few exceptions. But I disagree with this take for beginners especially non-programmers, with the advent of tidyverse it is incredible how fast newcomers pick up enough fluency to handle basic data massaging, analysis and visualisation.

I think exceptions where base-R is necessary can be taught as they arise.

fn-mote · on Feb 6, 2022

There are several comments below that suggest not using tidyverse because "base R" is the foundation for everything.

I think it is important to use tidyverse because of the many quirks, surprises, and inconsistencies in base R. It would be helpful if others share their reasoning, or at least point to their favorite blog explanation, so that beginners can understand the problems they will face.

Unfortunately 5 minutes of Googleing failed a to produce a reference for me --- the start of some advanced R book that begins by asking "do you need to read this?" and showing examples whose results are predicted incorrectly by most people. Perhaps another user can provide the info.

* Reference to good HN thread: https://news.ycombinator.com/item?id=20362626

* Particularly pointed notes on base R problems: https://news.ycombinator.com/item?id=20363806

_fnhr · on Feb 6, 2022

This depends on what you are using R for. Tidyverse is focused on handling data.frame objects and everything that comes with them. Even ggplot2 uses a data.frame as a default input. And tidyverse has a competitor - data.table, which can be substituted instead (given that you are familiar with base R).

However, some data are better suited to be represented in the form of matrices. Putting matrix-like data in a data.frame is silly, since performance will suffer and you would have to convert it back and forth for many matrix-friendly operations like PCA, tSNE, etc. The creator of data.table shares this opinion [1]. And similar opinions are generally given by people who are familiar with problems that fall outside the data.frame model [2].

[1]: https://twitter.com/mattdowle/status/1037949621773844480?lan...

[2]: https://www.youtube.com/watch?v=9Objw9Tvhb4&t=225s

ProjectArcturis · on Feb 6, 2022

I recommend data.table instead of tidyverse. The syntax is harder to learn initially, but it's much faster.

greazy · on Feb 7, 2022

The equivalent is dplyr not tidyverse, which is huge suite of tools.

ekianjo · on Feb 6, 2022

Don't learn R from books that are more than 5 years old

salamandersauce · on Feb 6, 2022

Is this really a unique to R or do all programming languages have some foibles? For example I spent an hour recently debugging C++ because I forgot that it loves to do integer division despite the fact it's going into an explicitly typed double. No error, no warning. You just have to know and I highly doubt it's desired behavior for most cases.

Most researchers are not programmers and don't care about programming. It's a tool to get the job done and I think you'd run into similar problems with other languages.

dtgriscom · on Feb 6, 2022

If you divide two integers, you get an integer. You can then cast it to whatever you want. Or, if you want some other type, you need to cast it before the operation is done.

salamandersauce · on Feb 6, 2022

Okay. But I'm storing it in a variable explicitly declared to be a double. That should be enough. If I divide two integers in python or R or Julia or a dollar store calculator I don't get an integer and I don't even have to explicitly type the variable. You have to know that C++ will do that. It's not common sense just like R recycling shorter vectors.

nightski · on Feb 6, 2022

I agree with your point that all languages have their quirks. This is a very poor example however. If it automatically converted to float what would you do if you wanted integer division? I think automatic casting tends to get messy/be pretty evil in general but of course there are exceptions.

salamandersauce · on Feb 6, 2022

You could always do something like: int divRes = IntA/intB; double something = divRes*5.342;

At the very least it could warn me. I just tried it in rust and that will error out if you try to divide two ints and store the result in a float which is fine by me.

funesrequiem · on Feb 6, 2022

Hi, would it be possible to contact you to ask some career questions related to the pharmaceutical industry and data science? I'm a biostatistician who uses R for everything and lately I've been thinking about doing a career change, but I'm a bit lost with all the available options.

_Wintermute · on Feb 6, 2022

Sure, my email is in my profile.

fithisux · on Feb 6, 2022

They should have done more for the software engineering side of things because people use it for this reason.

For repl driven development or academic code or exercises it is excellent.

stewbrew · on Feb 6, 2022

Maybe these overly self-confident software engineers should just go RTFM.

clove · on Feb 8, 2022

Sounds like a fun job. How'd you get that position? If you're retiring soon, I'll fill the position for the company.

sva_ · on Feb 6, 2022

My least favorite thing so far was indices starting at 1. It seems blasphemous, in a way.

On a more serious note, I agree that R being too charitable in interpreting things (seemingly without warning) seems to be a problem. You'll have to do some debugging to make sure it actually does what you intended it to do. I've only dabbled in it a bit though.

kergonath · on Feb 6, 2022

> My least favorite thing so far was indices starting at 1. It seems blasphemous, in a way.

In the real world we start counting from 1. CS people cannot stop complaining about it but it makes sense in languages used for mathematics and statistics. Zero-indexing is not very relevant if you don’t care about memory layout.

sva_ · on Feb 6, 2022

I know plenty of mathematicians who insist 0 ∈ ℕ. It's a bit of a joke, like arguing over tabs vs. spaces though.

May I recommend you this fabulous short essay by Dijkstra: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/E...

It has nothing to do with memory layouts.

kergonath · on Feb 7, 2022

> It's a bit of a joke, like arguing over tabs vs. spaces though.

It is taken very seriously, though. This “issue” comes up very often when some people come and lecture others about how stupid the language they use is.

> May I recommend you this fabulous short essay by Dijkstra

That essay is not fabulous, it is obnoxious. I know you either love or hate Dijkstra and he enjoyed being a contrarian, but he’s unconvincing. The only point that surfaces during arguments on 0-indexing is iterating over 1..N-1 instead of 0..N. That’s basically what he wrote himself. This could have been solved with just a bit of syntax if it were really a problem, and it remains largely because C did it that way to simplify pointer arithmetics. It does not change the fact that for the vast majority of people, the first element in a list is, well, first.

The proper way of handling this is to allow for arbitrary indices, because you will always find contexts where a different scheme makes sense (e.g. iterating from -10 to 10 is sometimes natural, and would otherwise require some index gymnastics). Insisting that one narrow view is the correct one is just annoying.

sva_ · on Feb 7, 2022

I dunno, it seems you misunderstood me, or idk. I clearly said that it is completely arbitrary to choose one over another, and expressing a preference over either one is just a way of poking fun at people who are anal about choosing a specific one. So there isn't really any disagreement, though I'm always amazed at what lengths people go to, to express what they think, while they're really just arguing about the definition of some thing.

> It is taken very seriously, though.

And those who do take it terribly seriously deserve being poked at ;)

CornCobs · on Feb 6, 2022

Honestly indices starting from 1 fits really nicely in most situations. 1-based indexing together with ranges and inclusive range-based indexing makes loops and subsetting code really readable IMO

hashimotonomora · on Feb 6, 2022

It’s pretty standard in math software such as Matlab and Octave.

ProjectArcturis · on Feb 6, 2022

That's my favorite thing for R relative to Python. Far more intuitive to start at 1 rather than 0.

ekianjo · on Feb 6, 2022

> It seems blasphemous, in a way.

It's more natural. You never count from zero with real life objects.

Mikeb85 · on Feb 6, 2022

It's standard because indexes start at 1 in Fortran. Not sure why it's an issue, especially because you never need to use loops in R anyway.

epgui · on Feb 6, 2022

Many languages are like that. I learned recently that indices also start at 1 in PostgreSQL / PLpgSQL.