Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Factor is one of the worst thing of R-world. I don't recall ever needing factors, yet they creep in with many functions (read.csv, cut).

Btw there's a nice readr package (from Hadleyverse) that has a read_csv method that does away with factors by default.



You should use factor for data cleaning and verification.

So you have "sex" on the questionnaire, and factor will very quickly identify contamination such as "often", "not yet", various mis-spellings, etc.


How would you represent categorical data then? R's primary use case isn't text processing. And HW isn't always right.


As character, for instance (in particular, they can do everything factors can do when used in conjunction with `unique`, and sorted factors can be represented as a conjunction of characters and numerics). Factors work better, but only barely. In particular, they are nowadays not any more efficient than using character (!). They used to be, which is why they are liberally used everywhere in R’s base libraries.


"In particular, they are nowadays not any more efficient than using character"

How could a comparison of two strings of unknown size be as efficient as comparing two integers? I'm curious to learn something new.


R uses a global string cache so any string comparison is just comparing two pointers.


You will (inevitably?) run into factors when importing data from SPSS files... sure, you can discard them upon reading... but are you sure you don't want access to the value labels in the future?


Factors are weird because no other language has anything like it, but they are actually a quite clever way to group data. It just takes a while to get used to them.


I actually use factors a fair amount, and having factor-like data shoved into numeric values gets you to some bad places statistically.


You must not do a lot of regression with categorical data, then. I use commands like `lm(y ~ (x1 + x2) * factor_variable, data = d)` and `xyplot(y ~ x1 | factor_1, groups = factor_2, data = d)` all the time.


Those also work just fine with strings.


Via an implicit call to factor, right?


Factors are great, and surprisingly powerful even outside of statistical computing. With that being said, I prefer to create them on purpose rather than having read.csv attempting to be helpful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: