Hacker News new | past | comments | ask | show | jobs | submit login
Formatted spreadsheets can still work in R (luisdva.github.io)
69 points by sebg 11 months ago | hide | past | favorite | 19 comments



I know a few champions of marking data with colors instead of using a new column. But this one tops it all :-)

> Using text color (and nothing else) to indicate units. (Example: black in pounds, green in kilograms).


What kind of cursed date is 32/1/2017?

If you're starting with that kind of data-quality, what confidence can you have in the rest of it, even if the rest "appears" more normal?

To expand my thoughts further:

Data quality is like water-quality. If you find something bad in your water source, you're not going to think that you'll be okay just picking it out and carrying on with the rest. You need to have a proper process to treat it and more importantly, need to investigate the upstream source of the bad data.


The date you've noticed is put there purposefully in this example Excel sheet by the original author.

>I then use the function pointblank::test_col_vals_in_set from the pointblank package to detect if there are any issues by setting no_issue to be either TRUE or FALSE.

>If there are any issues, I will isolate and display them.

>I will then correct the invalid dates accordingly and proceed. We just assume that the date was supposed to be 31-01-2017.

https://jeremy-selva.netlify.app/blog/2024-02-15-tackling-fo...


Yes, but the point I'm trying to make is that I think "isolation" is the wrong approach for data quality issues. That's like fishing out a turd. You need to quarantine the lot until you fix the upstream.

Automating that process (invalidation followed by filtering) is even worse, it will merely mask data quality issues when you want the opposite, to grind everything to a halt until you get workable and realistic data.

Perhaps I'm not "realistic" enough about real world data to work in data analytics. The frustration from ignoring bad data was a large part of why I quit data analysis for software development around 15 years ago.


For sure, you'd generally want to address the source of the issue rather than band-aid it, but that isn't always possible (or not possible immediately) and you just have to work with what you've got. Most of the time your boss isn't going to let you "grind everything to a halt".

But, more importantly, I think the blog is focused more on the practical "here's how to do x with some code", and less about the theory of data science.


post author here, thanks to both of these commenters for pointing out this issues with the intentionally bad date. I'll make note of this and update the post.


When you make a survey surveyor might write the data incorrectly mistakenly, it happens. If the data is written down mistakenly the basic statistics like median, mean, standard deviation etc won't change much. After all it is just a survey and you already sample from the population. So, even if the data in the sample was correct you will not get the exact population data.

But of course if inconsistency in the data is found it is further analyzed whether if it was a small mistake intentional wrong doing.


Great to see people trying to ingest such humanly crafted horror data.

However, whenever I encounter this with a customer, I’ll take the challenge, show how much a burden it is to manage and how much more it will cost in the long run to maintain in the event the tiny bit of metadata would change inadvertently.

Usually, the customer is listening then for some guidance on how to (not) format spreadsheets and use them as clean db tables.

Then you got clean data easily ingestible. And the customer knows a thing or two about data management. Win-win for everyone.


When you can exercise some influence over customer behavior, I agree this is the best way forward.


I think a lot of these bad excel sheets are created by Crystal Reports or some similar product. Often, I want to download data and just use it, but the only options are the PDF or the excel sheet, which has been formatted to look like the PDF. They frequently concatenate things together that would be better as individual fields, merge columns together in various ways, organize things in pretty but not columnar layouts. Frustrating, just give me the machine-readable version.


Since this is about R, can anyone please tell me how to get the following package - https://docs.tibco.com/pub/enterprise-runtime-for-R/6.1.1/do...


My guess is you buy it from Tibco. It sounds like it is part of their enterprise R distribution.

Do you have reason to believe it is open-source/free ?


I recognize the majority of those datasets as being included in the base `datasets` package. Are there differences you’re looking for?


Yeah, where is lottery


There's a copy of the dataset at http://www.math.ntu.edu.tw/~hchen/Prediction/timeseries1.htm...

There is lottery.number, lottery.payoff, lottery2, lottery3. Further info is provided in the 1988 book The new S language: a programming environment for data analysis and graphics by Becker, Chambers & Wilks.


Yeah, I started to read that book, then looked into R installation on my machine, but felt sorry not finding those datasets


I even installed S-PLUS 8.0 in order to look for this - but Sdatasets doesn't seem to have been shipped in the base install. If anyone knows...


Yeah.. I believe this is to run R embedded in Tibco Spotfire (like Tableau or other BI software). We used to use this to pull and sanitize our data from our data warehouse into dataframes and then create our spotfire templates to run via webplayer.


I hope people a day cease to torture themselves with spreadsheets...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: