The problem with data comes in two parts. First, data is inherently abstractive and abstractions are leaky. And second, we have culturally elevated data to a truth beyond what we can see with our own eyes.
Or, to paraphrase Douglas Adams: "The data is definitive. Reality is frequently inaccurate."
Because abstractions are leaky, it's easy to create false apparent patterns in data. And when the data is trusted more than the reality it is supposed to describe, those false apparent patterns become unfalsifiable, because the only way to falsify them is the data that created them in the first place.
The goal to avoid human bias and compensate for human fallibility is a good one. So I don't think that "data" is a bad idea. What I think is that data is a tool, to be used alongside intuition and experiment, as a means for understanding the world. If your beliefs are wildly inconsistent with the data, it is worth asking why and lowering your confidence in those beliefs (but you need not immediately abandon them). If your beliefs are consistent with the data, it's worth asking what other beliefs might be. If you did an experiment and predicted one set of observations but observed another, that's proof that at a minimum your prediction failed (even if your theory was not a bad one), and that deserves investigation. These are legitimate uses of data.
> Or, to paraphrase Douglas Adams: "The data is definitive. Reality is frequently inaccurate."
Having worked with public sector data in Denmark this part is particularly hilarious to encounter in the wild. Even something as “simple” as an organisational chart is something will multiple realities depending on who you ask. Often the people working within context of the different realities will be quite fanatical about their reality.
The place I worked had an employee registry which became the foundation for more and more purposes as the digital services grew. Typically being the foundation for rights to the 300+ different IT systems. It was based on the payment system, which was sort of natural when it was build because that is the one place every employee is registered. Of course this became an issue. For one, teams can only have one manager in basically every Danish HR system, I’m not sure why that is, because a lot of teams have multiple managers performing different roles. Sometimes some of the manager roles where delegated, sometimes the responsibilities were simply spilt. In any case, because there was no data on this hierarchy it was hilariously hard to do things like default who would have access rights to approving vacation, audit and so on. Then you had healthcare, which works three shifts with a different amount of people on each shift. Especially the night shift was a challenge, because they needed access to the whole house and every patient. Which might’ve been easy if there was a regular night shift team, but heathcare personal rotate shifts. Even the specifically designed patient registry which was solely build for patient care couldn’t handle this because nobody had thought about it before they build it (or the data laws like GDPR).
Anyway, there was a billion different things where data didn’t represent a single reality. I can’t get into the stuff involving citizens, but let’s just say that it will be horrible when different departments use the data with AI as though their own reality is the only reality.
It's not just software. Law and policy do the same thing. So does science - even ostensibly fundamental concepts such as "temperature" are really just a simplifying stochastic model of a complex physical system. This is what natural language does, too.
As another commenter pointed out, "The map is not the territory."
The full Korzybski quote is perhaps more insightful, if less pithy: "A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness."
Right, "the map is not the territory" is just half of the quote, and the worse half at that. It's like saying "well, you never know" to everything. Okay, thanks for your help.
> The full Korzybski quote is perhaps more insightful, if less pithy: "A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness."
Right, nobody is expecting a map to actually be the territory. The only question is whether it's useful. We do have a pithier quote for that; one of my favorite quotes of all time:
"All models are wrong, but some models are useful."
Yes. That was the point of my original post: abstractions generate problems.
But abstractions are also useful. You can't just not abstract anything at all.
I took a look at my company's metrics this morning. Approximately 30% of the candidates we send to clients (and who have not ended up out of the process for reasons outside of quality, e.g. the company hired someone else) have ended up getting an offer. That's an important piece of information: it tells me that my company does not have a problem with failing to screen out weak candidates.
Is that leaving out some important details? Yeah, of course! One of our candidates failed an interview because he was too aggressive in questioning his interviewer about their company's prospects. That's a useful piece of information, too; it was (along with a couple other anecdotes) a clue that we should try to do more basic coaching for candidates before interviews.
The data tells me how common the problem is, and suggests which problems are most critical to solve first. The anecdotes can tell me in detail about the nature of the problems, and suggest to me possible interventions. Both of those things matter.
"At the end of every seven years you shall grant a deletion of the data. And this is the form of the delete: Every data collector who has recorded anything on his neighbor shall delete it; he shall not maintain any of it about his neighbor or his brother, because it is called the Lord's delete" [1]
What is the connection you are drawing between those two things?
A census was used to assess a nation's fitness for battle and to set a taxing expectation.
Revelation is discussing the state of affairs under the rule of Nero (the man whose name has the number 666, or 616 depending on the source of your translation). Presumably, it became difficult to do business if you were not known to be loyal to the emperor.
(Side note: You are probably meaning to reference 2 Samuel 24 / 1 Chron 21 when you claim that unsanctioned censuses were prohibited, though that's never actually stated. Just, everybody in the story knows that what they're doing is wrong. But plenty of commentary has been written about why David's census was problematic.)
I think this is actually hitting much closer to the intention of Levitical law than people realize. Data and debt are very strongly connected, and I think probably always have been.
This is entertaining to game out the idea that anything that happened digitally more than 7 years ago just disappears.
A rolling Nothing that just eats anything that's 7 years in the past. A Great Oubliette into which we just toss anything from 7 years ago.
I think this would make an interesting dystopian novel.
Why stop at digital data? All memories roll into nothingness at 7 years, to be replaced by whatever the brain is forced to put in their place to resolve any apparent dissonance. A select few have discovered the secret and used it to build unimaginable power (e.g., building armies by promising riches and wealth after an 8yr term, using the Greate Oubliette to convince recruits they only have 1 yr left for the entirety of their lifetime of service).
- as a representation of something else, it can be incorrect (meaning error)
- as a domain of decision input, it can be misleading (sampling error)
- for questions of any significant complexity, it's the only way to scale decision-making capacity
- in an economy where actors differ in scale and information asymmetry can be leveraged to financial advantage, data gathering is incentivized even or especially when it contributes to coercive transactions, violating transaction invariants and reducing the competitive parity that disciplines the market
- it gives agents the illusion that they understand, leading to overconfident actions
How does knowledge differ from data in these respects?
- Knowledge is validated by sharing. Facts known only to one or few are not considered known.
- Knowledge can only be shared after it's embedded into overall meaning of a culture
- Knowledge can only scale to the well-understood and well-remembered past events simple enough to be comparable to other such events
Most people make personal decisions based on knowledge. A few people can make assessments/decisions based on data (though many use knowledge and justify using data). Organizations have to reduce knowledge to data to distribute authority and avoid bureaucratic capture.
People are more valuable when knowledge is more valuable, but knowledge only has an operational advantage really when value lies more in conserving states or staying small than producing new ones or going big.
The "problems of data" are not really problems with data, I feel that's what Rich Hickey was alluding to in that discussion (and no, I didn't feel that him & Alan Kay were talking past each other)
> as a representation of something else, it can be incorrect (meaning error)
- So here, you're saying that "Knowledge" may be incorrect. "Sun observed at this position in the sky during various times of day" is data, whereas "Sun moves around the Earth" is (wrong) knowledge. Yes data can contain errors (e.g. incorrect measurements). But Rich Hickey was saying that the fact that data doesn't contain the "interpretation" too is a feature, not a bug!
> as a domain of decision input, it can be misleading (sampling error)
- Right. But at least, it gives you the tools to validate the decision process and identify errors, or potential weaknesses. If you include the interpreter with the data and give direct access to the decision - any error with the interpreter will automatically invalidate all the data (and really it will make it hard to tell whether it's a sampling error, interpretation error, or simply error in the original measurements)
> it gives agents the illusion that they understand, leading to overconfident actions
I like this rebuttal. It disentangles data from interpretation and knowledge. This distinction helps us to solve problems associated with data and is a core tenet of science and problem solving.
Increasing the amount of generated data and not jumping to conclusions at the same time is how we avoid getting stuck in misconceptions or plain ignorance.
The further I get into my career as a data scientist (formerly a software engineer) the more I think I see what Kay was getting at in this thread.
I spend a huge chunk of my time swimming in a sea of data that people have carelessly amassed on the assumption that data is inherently valuable. And frequently I come to the conclusion that this data has negative value. The people who collected it failed to record enough information about the data's provenance. So I don't know what kind of processes produced it, how and why it was collected, any transformations that might have happened on the way to storing it, etc. Without that information, I simply cannot know for sure what any of it really means with enough precision to be able to draw valid conclusions from it.
The best I can do is advise caution, say that this data can help us form hypotheses whose only true use is helping us form a plan to collect new data that will help us answer the question we have. What I'm typically asked to do instead is make a best guess and call it good.
The former option is just the scientific method. The latter option is the very essence of pseudoscience.
That second word in my job title gives me some anxiety. I'm keenly aware that most people are more fond of the word "science" than they are of the intellectual practice we call science. The people who like to amass data are no exception.
I was the data and analytics part of a global team at HomeAway as we were struggling to finally release a "free listing / pay per booking" model to catch up with airbnb, I wired up tracking and a whole bunch of stuff for behavioral analysis including our GA implementation at the time.
Before launch we kept seeing a step in the on-boarding flow where we saw massive drop-off and I kept redflagging it, eventually the product engineering team responsible for that step came back with a bunch of splunk logs saying they couldn't see the drop off and our analytics must be wrong "because it's js", which was just an objectively weird take.
For "silo" reasons this splunk logging was used by product and no one else trusted it or did anything actionable with it as far as I could tell other than internally measuring app response times.
I would not unflag the step and one PM in particular started getting very upset about this and saying our implementation was wrong and he roped a couple senior engineers in for support.
I personally started regression testing that page based on our data and almost immediately caught that any image upload over ~1mb was not working and neither was mobile safari, turned out they had left their mvp code in place and it used flash or something stupid so it would break on image size and some browsers just wouldn't work at all.
It was updated a couple weeks before launch and the go live was as good as could be expected.
To this day I have no clue how this particular team had so misconfigured their server side logging that it HID the problem, but you see it all the time, if you don't know what you're doing and don't know how to validate things your data will actually sabotage you.
You've accidentally described 100% of my experience with Splunk at every org I've worked at: it's so expensive no one is given access to it. It's hard to get logs into it (because of expense). And so you're experience of it is the annointed "splunk team" want something, but you never see how they're using it or really the results at all except when they have an edict they want to hand down because "Splunk says it".
It was odder than no errors, they weren't seeing any funnel drop off at all.
It wasn't worth investigating and fixing for them, at the time I figured they were excluding traffic incorrectly or didn't know how to properly query for "session" data... could have been any number of things though.
A funny pattern I've seen several times is someone querying some data, getting results that don't match their mental model/intuition, then applying a bunch of filters to "reduce noise" until they see the results they expected.
Of course this can easily hide important things.
Made up example. The funnel metrics records three states: in progress, completed, or abandoned. If a user clicks "cancel" or visits another page the state will be set to abandoned, otherwise it eill be in progress until it's completed. Someone notices that a huge percentage of the sessions are in progress, thinks there can't be that many things in progress and we only care about completed or abandoned anyway, and then accidentally filters out everyone who just closed the page in frustation.
Real example: when you're working on the data analysis of one of the "beyond the standard model" physics experiments. For example, there is one where they basically shoot a big laser against a wall and see if anything goes through. Spoiler: it won't.
Such an experiment will usually see nothing and claim an upper bound on the size of some hypothetical effect (thus essentially ruling it out). Such a publication would be reviewed and scrutinized rather haphazardly. Regardless, the results are highly publishable and the scientists working on it are well respected.
Alternatively, the experiment might see something and produce a publication that would shatter modern understanding of physics, which means it would be strongly reviewed and scrutinized and reproduction attempts would happen.
Since the a-priori probability of such an experiment finding something is absurdly low, the second case would almost always lead to an error being found and the scientists involved being shamed. Therefore, when you do data analysis for such an experiment, especially if you want your career to move on to a different field or to industry, you always quickly find ways to explain and filter away any observation as noise.
There's intent that can't be stored: I can write down the word potato on a shopping list, and again on a recipe, and they represent totally different ideas with exactly the same characters. I interpret the first by going to the market, the other in the kitchen.
I'm sure there are many people eagerly trying to solve this problem by just storing more metadata so we can interpret which interpreter we want, but now we need increasingly more layers of interpreters, and now you're asking for a machine that simulates the universe. I find myself aligning that just processing "data" forever has limits, and our continued refusal to recognize that is going to be very costly.
> I can write down the word potato on a shopping list, and again on a recipe, and they represent totally different ideas with exactly the same characters.
Yes. In cybersecurity we already say data is a toxic asset. It can be
'wrong' or cause harm in so many more ways than a narrow band of
intended good. This thread touches a concurrent topic from less-wrong
about pianos and quality. Reality is infinitely nuanced, and the finer
detail the more important to the person who "cares" (Pirsig said
quality and care were flip sides of the same thing and Quine had a
similar thought about how all data has meaning in line-spectrum of
context.)
"Data" today is collected without care, for it's use, quality or
effects. The horror is that we are training machines on that very low
quality data and expecting high quality results.
Hickey seems to have taken his cite of "data"'s definition as "a thing given" as axiomatic, requiring no further thought on the implicit following questions like "given by whom?" and "by what means?", and this severely limits the scope of his analysis versus Kay's, this I think being what had them talking past one another.
In industry, the incentives seem very rarely to line up such that questions like those are welcome.
Yeah. As a general rule of thumb, dictionaries are simplistic, and extremely lagging, signals about everything a word can be about. No offense to dictionaries, since their goal is to be a succinct, useful, and universal summary of words, but it's usually a mistake to trot them out in an argument.
Would you take a dictionary's definition as the final matter on a complex philosophical topic, like epistemology? Or its starting point?
It gets even worse in the realm of something like politics, where different groups have contended over, and actively fought to redefine, the meanings of words over time.
I wasn't there for the beginning but I got dropped into a corp that had amassed a "data lake" with 20k tables of almost worthless data. One senior data scientist lost their pet project to what turned out to be contaminated data that leaked outcomes into his model features. Basically checked out mentally and eventually quit after. It was a hopeless environment, engeers in one country were building products and completely silo'd away from the people who were supposed to use their data.
> The people who collected it failed to record enough information about the data's provenance.
This feels a bit a debate that I (general SDE) keep having with Product folks who propose some sort of magic system that collates and displays "Applicants" across a bunch of unauthenticated form-submits and third-party customer databases, "because we already have the data".
Yeah, but most if it is fundamentally untrustworthy and/or must be aggressively siloed to prevent cross-customer data poisoning. We could try to an in-house identity-graph system, but we'd at least need to record something about levels of confidence for different steps or relationships or assumptions.
For example, it would be very bad for privacy if a visitor could put my public e-mail address into a form/wizard, and then the next step "helpfully" autofills or asks to confirm data like the associated (real) name or (real) home address.
Alternately, someone could submit data with a correct phone number or e-mail address, but named "Turdy McPooperson" at "123 Ignore This Application Drive." Now the real user comes by, gets pissed when the system "greets" them with an insult, and anything they do get thrown in the trash by users who see it displayed under a combined profile named Turdy McPooperson.
Personally it was because I was staring down a career of making endless CRUD apps and became disillusioned. I read all those cool data science articles in the 2010s and thought it was way more interesting. Joke's on me though, now I'm still disillusioned and a data scientist.
Shannon and Weaver distinguished between information and meaning in their book The Mathematical Theory of Communication.
"Frequently the messages have meaning; that is they refer
to or are correlated according to some system with certain physical or conceptual entities. These semantic
aspects of communication are irrelevant to the engineering problem." - Shannon
"In particular information must not be confused with meaning" - Weaver
It seems to me that Kay sees "data" in context of semiotics, where there is a signifier, a signified and an interpreter, while Hickey is in the camp of physics where things "are what they are", and can't be lied with (from "semiotics study everything that can be lied with").
I am interested in reading where Kay references semiotics?
As a designer that is “graphically oriented” by nature, and also “CLI oriented” from necessity, I can easily see why Kay would lean into semiotics to iron out how humans should best interact with machines.
It’s possible all interfaces will eventually be seen as relics of a low bandwidth input era. We should understand the semiotics (and physics) of everything well before we hand off full autonomy to technology.
He doesn't directly reference semiotics, it's just the line of argument that adds an interpreter to the equation. This implies that data is just a signifier, which can then be resolved to a signified by the help of an interpreter, hence you also need to send an interpreter alongside it.
In what form is an interpreter sent though remains an open question (because if the answer is "data" then wouldn't that mean a recursion in the argument?).
Anything less than being a convincing prophet or an exhaustive orator won't suffice. There is likely no definitive answer to anything—only varying degrees of certainty, based on conceptual frameworks that are ultimately rooted in philosophy.
Doesn't the frame and qualification problem discredit the latter?
FWIW, most physics professionals I know, who aren't just popular personalities are not in the scientific realism camps.
They realize that all models are wrong and that some are useful.
I do think that the limits of induction and deduction are often ignored in the CS world, and as abduction is only practical in local cases is also ignored.
But the quants have always been pseudoscientific.
We are restricted to induction, deduction, and laplacian determination not because they are ideals, but because they make problems practical with computers.
There are lots of problems that we can find solutions for, many more that we can approximate, but we are still producing models.
More and more data is an attempt to get around the frame and qualification problems.
Same problem that John McCarthy is trying to get around in this 1986 paper.
If you look at the poster's profile: https://news.ycombinator.com/user?id=wdanilo, you'll see they're the founder of Enso. Seems like they pivoted at some point. (I'm a fan of this move personally, as I loathe the usage of singular common words to name a product)
> Luna is now Enso. Following a couple of years of going by Luna, we were facing issues that were making it difficult for us, and people looking for Luna. Luna is a popular term, and in programming-language land, is also very close to the popular language Lua, an endless source of confusion.
I really didn't like Kay's approach to the discussion. I don't want hints and "you fill in the blank". Tell me what you think; don't "vaguepost".
I get it that he wants me to do the thinking for myself and figure it out on my own. But he's Alan Kay, and I'm not. I may never figure out what he's figured out, even with hints. And even if I can, maybe I don't have that kind of time.
A lot of people think of programming as performing operations on “data” which one must know how to correctly interpret. If two different pieces of code don’t agree about the meaning of your data, it will lead to subtle bugs.
The idea of OOP is to eliminate “data” as much as you can, and express your logic in terms of objects interacting with each other through their interfaces. If done properly, you no longer need to deal with data that every piece of code needs to interpret in its own, but with a system that already understands the meaning and semantics of its state.
Of course most people just create objects with data accessors instead.
I'm going to be honest it seems like a lot of the "data" obsession that has been all the rage among mid-level managers is really basically just a modern dressed up version of augury, or reading pig entrails to predict the future.
These enterprises spend a large amount of time and effort trying to collect data, often the wrong data, toa address a problem they don't understand and then hope if they do enough "data science" on it then it will magically tell them what to do. All without understanding or reasoning behind it, or any real connection to reality just "the data says X".
This results in ideas like "We did A/B testing and it runs out people stay on the page 38% longer if we use design B". Ignoeing the fact that the reason that happened was that design B involved the exit button randomly dancing around the page.
That is of course limiting ourselves to situations where people are actually trying to use data to get answers when much more often it is "I have already made a decision make the data say it was the right one." Which is a whole other can of worms.
Even the best systems marketed as “AI” nowadays can’t reason, by design.
The whole promise of targeted advertising, as well as all those cyberpunk tropes about all-knowing machines (and corporations and governments running them) is based on the fundamental requirement of machine being capable of logic and reason, not just generating statistically-probable statements. Which simply wasn’t a thing when this Big Data meme started, and still not a thing even today. So best they can do is playing statistics until it indicates the Holy Grail of modern corporate existence - sacred Growth. And, yes, it does work, but without any reason or logic to it, just blindly, like evolution. And thing about evolution is that it ends up with weird solutions, like our own retinas.
I think too many managers had consumed way too much sci-fi. Which is not a bad thing, but one gotta keep understanding fiction is a fiction, until all underlying assumptions are entirely satisfied (and it’s the magic of fiction to bring a possible future by just hand-waving and suspending the disbelief).
And because this grew way too much, there is no stopping it. The idea will support itself, corporations preaching it as hard as they can to survive, as all their valuation is in the promise of Big Data making Big Money.
This is the obvious result of letting people who have zero training or education in how you "do science", do science. Science is a process with many pitfalls and ways to fail by accident, even if you genuinely wanted to do it right. Why do we expect people with zero prior experience to get it right?
Product people don't want to do real science with their "data" anyway, because then they might not get the answer they want!
I think this is a big misinterpretation of that Alan Kay quote, which was in response to https://news.ycombinator.com/item?id=11941656 - the author of that comment aims to create a programming language with a focus on "data processing".
The entire discussion that this Alan Kay quote is from has always been about the prominence of "data" as a central concept in programming, not about other aspects like privacy or "big data".
>"Big data" is a way that a lot of people are trying to make money today, and it's a favorite of marketing people because it's in the wind. ... But in fact, the interesting future is not about data at all, but about meaning, and Stephen[ Wolfram]'s demos showed you a thought which most people in the computing world haven't had, which is "What if my programming language actually knew something". And, in fact, what if my user interface actually knew something? Not like Siri, which "knows" things, but what if it actually knew about me, and what if it actually knew about the contexts in which I'm trying to do things? That's an example of a leap. That set of ideas is actually old, and it was funded back when a lot of leap ideas were funded, and when the funding went away many of those ideas that weren't realized by about 1980 just haven't been worked on since, and that's something that'd be interesting to talk about.
Not that I necessarily agree with the article's concludions, but if the thesis is supposed to be that Kay disagrees with how we use big data today as a jumping-off point for reexamination, then this and the reference to Likleider's communicating with aliens problem work just fine for me.
I've read the article and the original thread, and I don't see at all how the author is "misinterpreting" Kay.
Maybe there is some confusion in terms of, which of the ideas in the article are the author's and which are Kay's. But the author does appear to understand that Kay's original discussion had a very different context, and does make statements of this nature:
> Kay was likely gesturing to a different reason data might be a bad idea. I’ll address that in a moment.
And overall I'm struggling to see anywhere I think Kay's original meaning is being misinterpreted or misrepresented. Can you point to a passage?
There was a good provocative keynote on a similar but slight different theme from an O'Reilly big data conference a number of years back by Maciej Ceglowski, "Haunted by Data" which I remember as the "Data is nuclear waste" talk: https://idlewords.com/talks/haunted_by_data.htm
Whenever I rewatch one of many of Maciej's talks (including “Haunted by Data”, “The Website Obesity Crisis”, “What Happens Next Will Amaze You”, “Superintelligence”, etc.), he always strikes me a bit as a digitally-relevant, modern reincarnation of Cassandra...
I'm so reminded of Seeing Like a State (James Scott) where the author describes how much of society as we observe it is a function of designing it for data collection. I feel like there's a whole pedagogy on the philosophy and practice of 'data', and I wasn't aware of it.
> I feel like there's a whole pedagogy on the philosophy and practice of 'data', and I wasn't aware of it.
I think this is an important insight (or two.)
I've got 'Seeing Like a State' and 'Data and Reality', Kent, on my to-read list, but I'm wondering what the appropriate bibliography here looks like. Anyone who sees this and has a suggestion, please add it!
A model's need for data is a sort of reciprocal to its inductive bias strength. The more permissive is your model (can learn anything/fit noise perfectly), the more data you need to tune it to a useful state. Conversely, the more restrictive is your model (e.g. y = ax + b), the less data you need (e.g. two points).
People needed a lot of data to predict the movement of planets (entire books of numeric tables), until laws of gravity were figured out, at which point it was reduced to a couple of parameters. This same principle applies to modern AI too, the more you restrict your inductive bias to the sort of structures and dynamics you expect to capture in the wild, the less volumes of data you need to tune.
So is "data a bad idea"? Only as bad as your world model is good. Perfect model of the world requires zero data, weak model of the world requires lots of data.
I do wonder sometimes how all the data collected about me is actually being used. If anyone is buying it for targeted advertising or to observe trends for advertising purposes, they are wasting their money. I (and my entire family) aggressively avoid advertising. Everything is adblocked, and if it isn't adblocked I avoid it. I even try to avert my gaze from billboards and have bugged my state representatives to be more like Vermont and Maine and just ban the eyesores.
Or, as someone once described it in another post, "a vast rube goldberg machine of privacy violations all working together to deliver the most precisely targeted ads straight into my adblocker"
Also a great tool for the powers that be to engage in parallel construction, should you suddenly need to eliminate some political rival. If you collect enough data, eventually everyone is guilty of something.
No. Advertising is aggregating it. It will also be passed to the governance structure. In future, AI will go through what's already been collected to profile you and nudge you.
In addition to this forms of commercial discrimination, police and government agencies are free to buy and use data for their purposes, which we should not be fooled into believing is good for society/security on average.
I am channeling Alan Kay and putting on my fantasy/futurist hat. I believe the essence of what Alan is thinking of is “big meaning” or the interpreter/ambassador who has to relay not only the message but also the cultural context behind the message. He was envisioning of the most concise way we can send a message plus something like LINCOS or lingua cosmos. So that the message’s meaning can be understood.
Suppose in the future, instead of dedicated communication channels, messages are just blasted everywhere similar to how short wave radio messages can be listened to by everyone. The message is not encrypted but is very short. Something like “<128 digit hexadecimal number>: I love you”.
Most people will not know the context of the message. But my robot assistant does because when the assistant runs the 128 digit key through a mathematical function it reveals this message is a part of a text conversation between my wife and I at about a certain date and time.
> Rather than chase after this hazy idea of an ambassador, I’ll spend the rest of this post exploring concrete ways to expand our notion of data.
That is a shame actually. I think LLMs are interesting to explore here.
For example, if I had a bunch of data and I wanted to combine it with data in some other form, I might have to hope that some transformation between formats is pre-existing. In some near-term future, I might actually just expect some LLM to inspect the two formats and either do the transformation manually or even intelligently write the code to perform the transformation.
As a totally trivial example - imagine I had my CV/resume details in some database and I wanted to apply to a lot of jobs. Many job boards have their own weird formats for inputting your education, experience, cover letter, etc. It feels reasonable to believe that soon a LLM could take my resume details and intelligently fill out the form. Extending this to any form or any API seems reasonable.
The question is whether the concept of data is essential to how we structure computation.
Computation is a physical process and any model we use to build or describe this process is imposed by us. Whether this model should include the concept of data (and their counterparts "functions") is really the question here. While I don't think the data/function concept is essential to modeling computation, I also have a hard time diverging too far from these ideas because that is all I have seen for decades. I believe Kay is challenging us to explore the space of other concepts that can model computation.
IMHO the article is about "data" as in "personal information", but let's indulge in your generalization.
In logic, you have definitions that are extensional (list of objects of the defined type) or intensional (conditions on the object of the type). Perhaps you can think of first representation as data, and the other representation as model or program.
But it's not trivial to convert between the two representations. From extensional to intensional is machine learning, and the other way you face a constraint satisfaction problem.
If we could somehow do both efficiently enough, then perhaps we could represent everything intensionally, as generative programs, and get rid of "data". But we don't know how to do this or whether it is possible.
> IMHO the article is about "data" as in "personal information", but let's indulge in your generalization.
That is weird on its own, because I don't see how Kay's quote could be in any way about it. I appreciate the article on its own, but in context of the quote and discussion it's taken from, it feels like it's only related because it uses the word "data".
I know the article is about concepts, but to really untangle the mess we've already created we should also figure out the necessary legal changes to make. Was the EU's cookie popup a good idea? Was it effective? What else should be done? We need to figure this out.
We need more forces acting in opposition to the inevitable conglomeration and misuse of personally identifiable information.
- At the individual level it needs to be more clear precisely when your data is going to be shared and to whom it is going to be shared, and every instance of this should require its own permission. Clauses buried in lengthy ToS documents are insufficient; the terms need to be simple and digestible by individuals of all capacities.
- Selling the data of individuals should be outlawed (including digital fingerprints), as well as use of illegally-acquired data, and the line for data anonymity strictly defined.
We should not live in a world where an individual with no criminal record is subject to evaluation based on global public data they gave no consent to distribute.
To me the issue is what people read in data. There are many suffering numberitis or they trust ANY number or "data" completely ignoring how it was established, errors and so on. Some even read the summary of an article stating their conclusion came from a certain dataset and do not even try to see that said dataset. It's not a "data" issue per se of course, but it's still an issue coming from the present data-bound way of thinking where many seems to have lost the ability of reasoning with their own mind.
Data is great, only if you always think critically about it. The problem is, for some, data has been elevated to the status of a new religion. "If the data says it, it must be true." They refuse to see how biased any data could be, and as such how careful we have to be when extracting knowledge from data.
A bit tangential, I had this revelation one day that data itself is really quite cumbersome to work with. More cumbersome the higher the quantity you have of it. And also, it is actually mostly merely a serialization of knowledge that is inherently extremely inefficient, since you often need to process a lot of it to reach a conclusion or find some information you are looking for.
It could in fact be stored much more succinctly and coherently in another representation, that is much easier to work with; in models.
Why? Because models, at least models like LLMs, and other agent like ones, allow you to ask your questions directly, and let the model produce a serialized answer (a very small amount of data) on demand, instead of you processing through endless amounts of it, trying to find your answer.
Imagine processing petabytes at costs of millions to try to determine (often incorrectly) demographics and interests, then completely ignoring the more directly provided feedback for “not relevant for me”.
Some day it will be obvious that geo targeting for stuff like elections was the only effective usage that we ever found, and that was of course pretty unethical. Hopefully in retrospect we’ll say that it was a sordid affair but the ends justified the means in terms of general advancement of computing, which after all we do still need for curing cancer and fixing climate change, but only time will tell.
why data is a bad idea, we should focus on how to make it work for us. Let's think about how we can create systems that prioritize individual autonomy and consent over data aggregation. In the end, it's not about data itself being bad, but how we choose to use it that matters.
This article feels like an academic expressive dance performance of a pseudo intellectual discussion about privacy. It mentions several concepts without explaining them and doesn't advocate for any solution.
The article posits that data is bad and therefore we must vastly multiply it. Like I can no longer have a uint64, I need a uint64 with DRM, a certificate from Brussels that says I'm allowed to have a uint64, etc.
> Orchestrated by Dutch resistance members (...) their goal was to inhibit the Nazi’s ability to track and deport Jews and other targets of terror. The operation managed to destroy over 15% of the records. Many of the participants were later captured and executed by the Nazis.
There's a story (whose veracity I have not verified) that some of the stranger dutch surnames are because when some older occupying power (the spanish? Napoleon?) came in and immediately took a census, farmers gave in joke names, little thinking their descendants would still be known, and catalogued, by them.
It doesn't answer your question, but it relates another instance in which the dutch (at least supposedly) actively illegibilised themselves under the shadow of external data-gathering.
"It continues to be sucked into the private warehouses of powerful organizations, further entrenching their power. This centrifugal force is political in nature."
Shouldn't that be "centripetal force"? [Edit] I guess it depends on whether he is talking from the view of the database or the user.
> Most people find these popups annoying. We don’t want to negotiate every time we encounter a new website. We’re used to social structures where consent is provided implicitly. A look of the eye and unspoken social contracts are the norm. But data is too brittle to capture this kind of nuance.
Hmm no, I disagree hugely.
You don't need approval to do basic things. You need it because when I open a news article that triggers a demand to share something about me to over 1500 different companies. Not an exaggeration. That requires consent and rightly so, because it's wildly outside of normal social contracts.
I disagree. These things are so annoying most people just use a browser plugin that removes them entirely. I don't need a warning label on every website and I don't trust pushing a button on a site does anything except send more data.
I don't think we are disagreeing here. They need that consent, and rightly so, in order to do those things with your data. Removing the popups is just not consenting.
You implicitly consent for some basic things - if you order a widget then you don't need to sign a disclaimer saying they can use your address for posting it to you.
You need explicit consent to go outside of that, just like regular social interactions. I don't ask permission to remember your name and address if you've asked me to pick you up, I should ask permission before signing you up for a mailing list using that info.
The popups are because they're trying to step hugely outside of normal interactions.
> I don't trust pushing a button on a site does anything except send more data
Most implementations either load the tracking scripts after you click the button or hold back certain actions like cookies and network requests until you consent. Enforcement isn't as strict as it could be, but good enough that it mostly works
If they use the consent modal to gather data about you when you try to "opt out", they're already deliberately violating the law more than they would by simply not having that banner.
We're not talking about something that will take your foot off or something dangerous with safeties. But if you want gruesome analogies this is more like hiding razorblades underneath a safety seal warning about the package having sharp edges.
The problem with DNT was that there was no established legal basis governing its meaning and some browsers just sent it by default so corporations started arguing it's meaningless because there's no way to tell if it indicates a genuine request or is merely an artefact of the user's browser choice (which may be meaningless as well if they didn't get to choose their browser).
As the English version of that page says, it's been superceded by GPC which has more widespread industry support and is trying to get legal adoption though I'm seeing conflicting statements about whether it has any legal meaning at the moment, especially outside the US - the described effects in the EU seem redundant given what the GDPR and ePrivacy directive establish as the default behavior: https://privacycg.github.io/gpc-spec/explainer
It's not a warning label, it's a request for consent which means they are legally required to ask you for permission and allow you to refuse them. Most implementations are actually violating the law by making it more difficult to refuse than accept or at least not giving the refuse option equal visual weight when they're not outright hiding it behind a bunch of extra steps.
This is different from the old "cookie banners" that were just informing you and leaving you no option but to dismiss the "warning". The GDPR and ePrivacy directive require companies to justify their use of your data and the only justification mechanism applicable for most of the data they want to collect is consent which must by definition be voluntary - the limitations of which are defined fairly explicitly in those laws.
Some sites try to work around this because they need ads for monetization by offering a paid subscription or requiring you to accept (behavioral) ads - but they've also been dinged for trying to bundle all the stuff not related to showing you ads with the "accept ads" option (or not letting you buy a subscription without first having to agree to share all your data like before).
I'm always surprised how many people in technical spaces like HN seem to misunderstand the legal situation and why these "warnings" look the way they do and blame the laws rather than the companies desperately trying to trick users into giving up on their data in ways that barely pass at an attempt to comply with the laws they're spending so much energy on deliberately violating. But it shouldn't be surprising - these companies put a lot of energy into making the process unpleasant for users (often in ways that are blatantly violating the laws) while framing themselves as the victims.
Yes social norm is if I have phone number of my friend and I am going to share it with someone, I ask that friend first if he wants his phone number shared with person asking.
Crooked companies overstepped social norms because they could get away with it - which is clear definition of an asshole, someone doing something shitty just because he knows he can get away with it.
Or, to paraphrase Douglas Adams: "The data is definitive. Reality is frequently inaccurate."
Because abstractions are leaky, it's easy to create false apparent patterns in data. And when the data is trusted more than the reality it is supposed to describe, those false apparent patterns become unfalsifiable, because the only way to falsify them is the data that created them in the first place.
The goal to avoid human bias and compensate for human fallibility is a good one. So I don't think that "data" is a bad idea. What I think is that data is a tool, to be used alongside intuition and experiment, as a means for understanding the world. If your beliefs are wildly inconsistent with the data, it is worth asking why and lowering your confidence in those beliefs (but you need not immediately abandon them). If your beliefs are consistent with the data, it's worth asking what other beliefs might be. If you did an experiment and predicted one set of observations but observed another, that's proof that at a minimum your prediction failed (even if your theory was not a bad one), and that deserves investigation. These are legitimate uses of data.