A canonical definition of data?

Why dictionary definitions of data miss the point, and what the Latin roots reveal about the concept's true essence.

Nov 24, 2024

Good Monday Morning!

Seriously, how the hell do we define “data”?

A canonical definition of data?

If your timezone is early and this affects you - apologies for getting this Mondays chronicle out a few hours late, this was a tricky one to get right and had to do a rewrite from my original idea. 🦦

I got into a debate this week around defining data. Even though it’s been about a year now since I started this channel, I realized that I did not really know of a good definition of the term.

Data is one of those words that I have used my entire adult life and more (and you too, I suppose) yet I don’t really know what it is. You might get more out of this chronicle if you now put your reader down, and write down what you think is a good definition, and then check a dictionary or two, and ask a couple of different LLMs. Anyway, I did, and the definitions are surprisingly inconsistent.

I was initially going to list the different definitions here and try to find commonalities, and compare them to terms like Experience, Observation, Data, Information, Fact and Truth, Explanation, but it all turned rather messy to be honest and after going down that route it felt like I was just planting more trees that were hiding the forest.

Can we use the dictionary definition?

I will quote one definition though that was the one that set me off that path:

“the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.” (Oxford Languages)

It dawned on me that this is a great dictionary definition, but at the same time a horrible description of the essence of the concept.

The above definition is almost a caricature of what a colloquial and parochial (see wikipedia/parochial) definition is. It is very superficial as a description and I find it to be relevant that it completely disregards that computer used to be a common human job title, in the era before Alan Turing and Enigma. I find that this definition fails to capture anything about the essence of the concept.

I realized that I have, throughout my life, used dictionaries assuming that they (or at least strive to) describe concepts correctly. However, when you think about it, that is naturally not what they optimize for!

Dictionaries don't strive to describe concepts, dictionaries strive to describe meaning, and meanings are pointers to concepts, and these meanings are often broadened or altered in language over time.

For example, the word cool was (one speculates) initially only for describing the temperature weather, but over time evolved to also refer to the “je ne seis quoi” of Fonzie (wikipedia/Fonzie)

I.e. dictionaries (as they should) strive to be correct in the sense of

“How is the word most broadly used”

... as opposed to...

“What is the most useful and congruent definition of the essential concept?”.

One Datum, Many Data

As I bounced around LLMs, I realized that they also have the same problem. They end up on popular branches of definitions, and unless you add in a certain thinker (e.g., Hume, Popper) to narrow things for an LLM, it just goes for broad meaning.

Either way, Oxford's dictionary offered an interesting clue with the origins of the word: mid 17th century (as a term in philosophy): from Latin, plural of “datum.”

The word datum means

“something given” or

“a thing granted” in Latin.

The word datum means

“something given” or

“a thing granted” in Latin.

Its plural, data, literally means

“things given” or

“things granted”.

It’s quite fun that I didn’t know the singular of data until today. 🤷‍♂️

Data is what we take for granted

In philosophy, a “given” is something assumed to be true without requiring proof in the immediate context. For example, when constructing a logical argument, certain premises (data) are often “granted” as the basis for further reasoning.

This aligns with the idea that data is accepted without question or further justification, at least. Early philosophers viewed data as foundational “givens” to be worked with, analyzed, or interpreted.

Over time, data evolved to mean recorded observations or measurements—the “givens” collected through experience, experiments, or computation.

Today, data is often treated as raw material for analysis, but I think its latin roots in “things given” remind us that it originates from something granted or observed, not inherently interpreted.

What subject do you trust to measure reality objectively?

There are quite a few modern definitions (that also float around in LLMs) that suggest that data is “neutral”. I think this is a problematic and even downright incorrect way of thinking about data.

Because we have lots of automatic data gathering devices these days (smart watches, thermometers and google analytics etc) that makes it easy to build a collective delusion that data can be gathered completely objectively.

However, perfectly neutral data is an impossibility, as data must be generated by observing an experience, thus making it inherently subjective.

I should make clear here that it is perfectly possible for data to be more or less subjective, it just cannot be objective.

We can also compensate for subjectiveness to some degree, by gathering data from many subjects and creating aggregates, or by doing meta-analysis of studies, but that doesn’t make the data gathered neutral or objective - such efforts are working around the fact that data is subjective.

I think the evolution of the word data to also include an expectation of recording to be very important and useful.

It would be (at least colloquially) acceptable to argue that a non-recorded observation is data, but I don’t see the usefulness of a definition that broad, as it is very close to just being a “memory” or a “story”.

The magic moment of recording

I think what makes data magical and distinct from memories and stories is really what happens at the recording step - and here I am talking about record more in the sense of a “database record” rather than “a record of history” or “voice recording”. To be more specific, I think that what makes data truly distinct is when we observe reality and record it according to a schema.

As I’ve talked about previously, the word plot comes from the Old English plot, which originally meant small piece of land, but evolved to mean “plan or scheme”.

And scheme, in turn: “a supposed or apparent overall system, within which everything has a place and in relation to which individual details are ultimately to be assessed”

Data is, ironically, itself not a discrete/digital concept. It is quite a gray area. A story would be dubious to argue to be a piece of data, but a structured Captains Log … maybe a bit more so, and a mood log would “more data” than a plain text journal. I suspect here that scheme and schema plays a large role in data, and how strictly the recording of observations adheres to the schema affects how much “data” it is compared to being a story or a memory, and is an important quality vector of data.

The argument that I got into was with a person that argued that lived experience can be considered data. I said that I think that equating experiences to data would be like equating a cow to beef stock, or the cryptographic hash of a disk image to the disk image itself. Experiences are not data, because to transform experiences into data, we need to do observational work.

Data is, to me, an aggressive reduction/simplification of an experience according to a schema. It intentionally sacrifices away the overwhelming majority of the experience to record specific aspects of the experience, which now gives us new powers in that we can compare datum to datum.

Does that compute?

In addition, the definition of data has also grown to include that the recordings are qualitative or quantitative. Think about a mood tracker vs a mood journal vs general daily journal. Do you agree that one feels “more data”?

I would personally say that what we are actually gunning for isn’t inherently that records are qualitative or quantitative, but that we seek for the records to be computable.

This has gotten a bit muddled in the last few years because computing power has gotten to a point where we can treat any recording, as long as it is done in a language, as data. In a data set for an LLM, English or Spanish etc becomes a very broad schema that we can use for computation).

Forgiving Schemas: The curling parent of data

The problem with a forgiving schema is that it becomes ludicrously energy-inefficient.

LLMs can be extremely impressive, but anyone that has tried to make something cost-effective with an LLM becomes painfully aware that there are enormous efficiency sacrifices one accepts when operating a model on a dataset whose recording schema is very forgiving.

The energy consumption between asking an LLM to calculate 1234 * 5678 and running a Python script that does the same thing is easily on the order of thousands of times more efficient. How Much Energy Do LLMs Consume? Unveiling the Power Behind AI

I am the last person to take financial advice from, but if you want a sure way of making money on AI investments, put your money into renewable energy, because LLMs are an inherently inefficient way of computing.

OK ADD ALL THE LOGS SO THAT WE HAVE THE BIG DATA IN LAKEHOUSECLOUD

In addition, when we record data according to a schema, that schema also often has quality criteria connected to it.

We can remove this and “just ingest books and also anything that looks vaguely like a book” which adds effectiveness limitations on top of your efficiency limitations: (Google’s AI Recommended Adding Glue To Pizza And Other Misinformation—What Caused The Viral Blunders?)

This kind of nilly-willy vacuuming of all available data reminds me of how logging was done at Spotify early on in our data journey. Back then, the idea was to LOG EVERYTHING on every single feature you rolled out, and then we would learn stuff from that data.

It’s quite an absurd thing to do a GDPR request for your Spotify data - you’ll get mousemove events since the dawn of time, it is an absolutely staggering amount of data.

I shamefully admit that I did not see the problems with this and thought it was a good idea at the time, but over the years I (and the organization) learned painfully clearly that…

A) logging “everything” is a pipe dream - reality is complex and even though we added events on every single thing, leading to gigabytes of logs per user over the years, in hindsight you almost always found yourself missing an important event when you did analytics on things.

B) the reason why we needed to log everything was because we were not honestly interested in falsifying the efficacy of our features. We looked to validate features that we rolled out, not to disprove that they didn’t (because nobody has ever gotten promoted for removing a feature they built because it was ineffective).

I understand now that what we were doing was justificationism, and as data “scientists” we were completely ignorant of the work of Karl Popper on critical rationalism (wikipedia/popper), so Spotify was at time actually hundreds of years behind the gold standard of doing science - we were performing it at the level that was the norm when chopping lots of heads with the guillotine was in vogue.

Data is not information

Another interesting thing I learned recently is that Data is not information.

We’ve established above that data, while it cannot be objective, data does purport to be collected within the bounds of its stated schema, and intentionally tries to be void of presupposed conclusion.

This is because the purpose of data is to serve as the material for data processing (typically an analysis) which in turn outputs information. Put in another way, the definition of information is data that is processed and contextualized, that acquires meaning through structured interpretation, revealing patterns, relationships, or potential significance within a specific conceptual framework.

Analysts turns data into information

It is often quite subtle what an analyst does in a company. As a decision maker, you constantly feel starved of information, and having to go through analytics often feels painfully slow and limited.

This has (in my maybe somewhat controversial opinion) unfortunately led to products like Snowflake gaining popularity, to a large extent because they offer easy querying of data decision makers without having to go through an analyst.

While that access isn't in itself a bad thing, I think it will lead to bad results if the decision maker conflates data with information.

In reality, there is lots of work left to do on data in order to turn it into meaningful actionable information that isn’t misleading. And then further work to make it widely digestible (good visualizations, summaries etc).

Your crazy hoarder uncle that would do well in Big Data

To make matters worse, data in organizations has often been collected with the Bigger Data = Gooder Data fallacy that I described above at Spotify, where the data is not all that more structured than the “enormous pile of internet” that we shove into gargantuan LLMs.

Often this is vulgar amounts of “tracking” data of users, vast amounts of clicks and likes and whatnot. It is technically data, but only very barely so.

The data in question has often been hoarded with the assumption that some magical data thingamabob down the line will be able to make sense of it. Hoarding data in a company is not really better than your crazy hermit uncles hoarding tendencies.

Big Data is expensive to store and index, and legally risky if it is personally identifiable information - which the majority of data is, unless you take (yet again expensive) measures to anonymize it.

The hoarder postpones a decision of what is important to observe. Observing, discerning is one of the hard things in life. The mammal is inherently lazy (energy conservation is good!) and wants to avoid hard things.

A broad schema of storing everything, be it hoarding clothes or microwaves or old DVDs or hoarding petabytes of arbitrary track events pushes the work of keen observation to “someone” else in the future.

Unfortunately, since we have recorded our “observation” with a very broad and speculative and hopeful schema, that future “someone” is unlikely to be able to use the data that we have collected, much like how all those clothes that you have “inherited” from the “estate” have not taken your size information in mind, let alone your taste.

Video that inspired me most last week: Hank Green on the history and mechanics of populism

I’ve been creating for the internet longer than most, and started following (and still follow today) Hank and John Green, who are just some of the best people period.

There are few men that just inspire pure hope for humanity in me. To make sense of the current state of things with a long loooong perspective, I really must recommend this one, it really made me happy just because they are so good:

Onwards to the week!

I hope all of that made you much more confused than you were at the beginning of this chronicle, it certainly did for me!

😉 Oh yeah and in case you scrolled down to the bottom for an easy answer, the Wikipedia definition of data is not half bad: “Data” - Wikipedia 🤭

🧬 On the note of how there is no neutral observation, it seems like there is no neutral presentation either: Nature: Adding a personal backstory could boost your scientific credibility with the public

As always, stay curious 🧐🐒

Mattias Petter Johansson

funfun.email

Discussion about this post

Ready for more?