The further I get into my career as a data scientist (formerly a software engineer) the more I think I see what Kay was getting at in this thread.
I spend a huge chunk of my time swimming in a sea of data that people have carelessly amassed on the assumption that data is inherently valuable. And frequently I come to the conclusion that this data has negative value. The people who collected it failed to record enough information about the data's provenance. So I don't know what kind of processes produced it, how and why it was collected, any transformations that might have happened on the way to storing it, etc. Without that information, I simply cannot know for sure what any of it really means with enough precision to be able to draw valid conclusions from it.
The best I can do is advise caution, say that this data can help us form hypotheses whose only true use is helping us form a plan to collect new data that will help us answer the question we have. What I'm typically asked to do instead is make a best guess and call it good.
The former option is just the scientific method. The latter option is the very essence of pseudoscience.
That second word in my job title gives me some anxiety. I'm keenly aware that most people are more fond of the word "science" than they are of the intellectual practice we call science. The people who like to amass data are no exception.
I was the data and analytics part of a global team at HomeAway as we were struggling to finally release a "free listing / pay per booking" model to catch up with airbnb, I wired up tracking and a whole bunch of stuff for behavioral analysis including our GA implementation at the time.
Before launch we kept seeing a step in the on-boarding flow where we saw massive drop-off and I kept redflagging it, eventually the product engineering team responsible for that step came back with a bunch of splunk logs saying they couldn't see the drop off and our analytics must be wrong "because it's js", which was just an objectively weird take.
For "silo" reasons this splunk logging was used by product and no one else trusted it or did anything actionable with it as far as I could tell other than internally measuring app response times.
I would not unflag the step and one PM in particular started getting very upset about this and saying our implementation was wrong and he roped a couple senior engineers in for support.
I personally started regression testing that page based on our data and almost immediately caught that any image upload over ~1mb was not working and neither was mobile safari, turned out they had left their mvp code in place and it used flash or something stupid so it would break on image size and some browsers just wouldn't work at all.
It was updated a couple weeks before launch and the go live was as good as could be expected.
To this day I have no clue how this particular team had so misconfigured their server side logging that it HID the problem, but you see it all the time, if you don't know what you're doing and don't know how to validate things your data will actually sabotage you.
You've accidentally described 100% of my experience with Splunk at every org I've worked at: it's so expensive no one is given access to it. It's hard to get logs into it (because of expense). And so you're experience of it is the annointed "splunk team" want something, but you never see how they're using it or really the results at all except when they have an edict they want to hand down because "Splunk says it".
It was odder than no errors, they weren't seeing any funnel drop off at all.
It wasn't worth investigating and fixing for them, at the time I figured they were excluding traffic incorrectly or didn't know how to properly query for "session" data... could have been any number of things though.
A funny pattern I've seen several times is someone querying some data, getting results that don't match their mental model/intuition, then applying a bunch of filters to "reduce noise" until they see the results they expected.
Of course this can easily hide important things.
Made up example. The funnel metrics records three states: in progress, completed, or abandoned. If a user clicks "cancel" or visits another page the state will be set to abandoned, otherwise it eill be in progress until it's completed. Someone notices that a huge percentage of the sessions are in progress, thinks there can't be that many things in progress and we only care about completed or abandoned anyway, and then accidentally filters out everyone who just closed the page in frustation.
Real example: when you're working on the data analysis of one of the "beyond the standard model" physics experiments. For example, there is one where they basically shoot a big laser against a wall and see if anything goes through. Spoiler: it won't.
Such an experiment will usually see nothing and claim an upper bound on the size of some hypothetical effect (thus essentially ruling it out). Such a publication would be reviewed and scrutinized rather haphazardly. Regardless, the results are highly publishable and the scientists working on it are well respected.
Alternatively, the experiment might see something and produce a publication that would shatter modern understanding of physics, which means it would be strongly reviewed and scrutinized and reproduction attempts would happen.
Since the a-priori probability of such an experiment finding something is absurdly low, the second case would almost always lead to an error being found and the scientists involved being shamed. Therefore, when you do data analysis for such an experiment, especially if you want your career to move on to a different field or to industry, you always quickly find ways to explain and filter away any observation as noise.
There's intent that can't be stored: I can write down the word potato on a shopping list, and again on a recipe, and they represent totally different ideas with exactly the same characters. I interpret the first by going to the market, the other in the kitchen.
I'm sure there are many people eagerly trying to solve this problem by just storing more metadata so we can interpret which interpreter we want, but now we need increasingly more layers of interpreters, and now you're asking for a machine that simulates the universe. I find myself aligning that just processing "data" forever has limits, and our continued refusal to recognize that is going to be very costly.
> I can write down the word potato on a shopping list, and again on a recipe, and they represent totally different ideas with exactly the same characters.
Yes. In cybersecurity we already say data is a toxic asset. It can be
'wrong' or cause harm in so many more ways than a narrow band of
intended good. This thread touches a concurrent topic from less-wrong
about pianos and quality. Reality is infinitely nuanced, and the finer
detail the more important to the person who "cares" (Pirsig said
quality and care were flip sides of the same thing and Quine had a
similar thought about how all data has meaning in line-spectrum of
context.)
"Data" today is collected without care, for it's use, quality or
effects. The horror is that we are training machines on that very low
quality data and expecting high quality results.
Hickey seems to have taken his cite of "data"'s definition as "a thing given" as axiomatic, requiring no further thought on the implicit following questions like "given by whom?" and "by what means?", and this severely limits the scope of his analysis versus Kay's, this I think being what had them talking past one another.
In industry, the incentives seem very rarely to line up such that questions like those are welcome.
Yeah. As a general rule of thumb, dictionaries are simplistic, and extremely lagging, signals about everything a word can be about. No offense to dictionaries, since their goal is to be a succinct, useful, and universal summary of words, but it's usually a mistake to trot them out in an argument.
Would you take a dictionary's definition as the final matter on a complex philosophical topic, like epistemology? Or its starting point?
It gets even worse in the realm of something like politics, where different groups have contended over, and actively fought to redefine, the meanings of words over time.
I wasn't there for the beginning but I got dropped into a corp that had amassed a "data lake" with 20k tables of almost worthless data. One senior data scientist lost their pet project to what turned out to be contaminated data that leaked outcomes into his model features. Basically checked out mentally and eventually quit after. It was a hopeless environment, engeers in one country were building products and completely silo'd away from the people who were supposed to use their data.
> The people who collected it failed to record enough information about the data's provenance.
This feels a bit a debate that I (general SDE) keep having with Product folks who propose some sort of magic system that collates and displays "Applicants" across a bunch of unauthenticated form-submits and third-party customer databases, "because we already have the data".
Yeah, but most if it is fundamentally untrustworthy and/or must be aggressively siloed to prevent cross-customer data poisoning. We could try to an in-house identity-graph system, but we'd at least need to record something about levels of confidence for different steps or relationships or assumptions.
For example, it would be very bad for privacy if a visitor could put my public e-mail address into a form/wizard, and then the next step "helpfully" autofills or asks to confirm data like the associated (real) name or (real) home address.
Alternately, someone could submit data with a correct phone number or e-mail address, but named "Turdy McPooperson" at "123 Ignore This Application Drive." Now the real user comes by, gets pissed when the system "greets" them with an insult, and anything they do get thrown in the trash by users who see it displayed under a combined profile named Turdy McPooperson.
Personally it was because I was staring down a career of making endless CRUD apps and became disillusioned. I read all those cool data science articles in the 2010s and thought it was way more interesting. Joke's on me though, now I'm still disillusioned and a data scientist.
Shannon and Weaver distinguished between information and meaning in their book The Mathematical Theory of Communication.
"Frequently the messages have meaning; that is they refer
to or are correlated according to some system with certain physical or conceptual entities. These semantic
aspects of communication are irrelevant to the engineering problem." - Shannon
"In particular information must not be confused with meaning" - Weaver
It seems to me that Kay sees "data" in context of semiotics, where there is a signifier, a signified and an interpreter, while Hickey is in the camp of physics where things "are what they are", and can't be lied with (from "semiotics study everything that can be lied with").
I am interested in reading where Kay references semiotics?
As a designer that is “graphically oriented” by nature, and also “CLI oriented” from necessity, I can easily see why Kay would lean into semiotics to iron out how humans should best interact with machines.
It’s possible all interfaces will eventually be seen as relics of a low bandwidth input era. We should understand the semiotics (and physics) of everything well before we hand off full autonomy to technology.
He doesn't directly reference semiotics, it's just the line of argument that adds an interpreter to the equation. This implies that data is just a signifier, which can then be resolved to a signified by the help of an interpreter, hence you also need to send an interpreter alongside it.
In what form is an interpreter sent though remains an open question (because if the answer is "data" then wouldn't that mean a recursion in the argument?).
Anything less than being a convincing prophet or an exhaustive orator won't suffice. There is likely no definitive answer to anything—only varying degrees of certainty, based on conceptual frameworks that are ultimately rooted in philosophy.
Doesn't the frame and qualification problem discredit the latter?
FWIW, most physics professionals I know, who aren't just popular personalities are not in the scientific realism camps.
They realize that all models are wrong and that some are useful.
I do think that the limits of induction and deduction are often ignored in the CS world, and as abduction is only practical in local cases is also ignored.
But the quants have always been pseudoscientific.
We are restricted to induction, deduction, and laplacian determination not because they are ideals, but because they make problems practical with computers.
There are lots of problems that we can find solutions for, many more that we can approximate, but we are still producing models.
More and more data is an attempt to get around the frame and qualification problems.
Same problem that John McCarthy is trying to get around in this 1986 paper.
If you look at the poster's profile: https://news.ycombinator.com/user?id=wdanilo, you'll see they're the founder of Enso. Seems like they pivoted at some point. (I'm a fan of this move personally, as I loathe the usage of singular common words to name a product)
> Luna is now Enso. Following a couple of years of going by Luna, we were facing issues that were making it difficult for us, and people looking for Luna. Luna is a popular term, and in programming-language land, is also very close to the popular language Lua, an endless source of confusion.
I really didn't like Kay's approach to the discussion. I don't want hints and "you fill in the blank". Tell me what you think; don't "vaguepost".
I get it that he wants me to do the thinking for myself and figure it out on my own. But he's Alan Kay, and I'm not. I may never figure out what he's figured out, even with hints. And even if I can, maybe I don't have that kind of time.
A lot of people think of programming as performing operations on “data” which one must know how to correctly interpret. If two different pieces of code don’t agree about the meaning of your data, it will lead to subtle bugs.
The idea of OOP is to eliminate “data” as much as you can, and express your logic in terms of objects interacting with each other through their interfaces. If done properly, you no longer need to deal with data that every piece of code needs to interpret in its own, but with a system that already understands the meaning and semantics of its state.
Of course most people just create objects with data accessors instead.
Alan Kay has agreed to do an AMA today - https://news.ycombinator.com/item?id=11939851 - June 2016 (893 comments)