What exactly is data?

Published in

BGL Tech

10 min readDec 21, 2020

Data Scientist Robert Frecknall looks past the trendy buzzwords of 2020 (big data anyone?) and takes us back to basics:

As you can tell from the number of posts on our Medium page, BGL Tech has put a lot of effort into blogging in 2020 and it’s been great to see so many of my colleagues contributing.

I’ve been more than happy to take part in this endeavour because it gives me a good excuse to write something which doesn’t begin with SELECT, library(RODBC) or import pandas as pd. My first entry on this site was a practical run-through of my work on fraud detection and then, more recently, things got a little more abstract when I pondered what it meant to be a data scientist.

This time around I am going to continue in a philosophical direction to give you a holiday gift that will, hopefully, be more useful in 2021 than a hideous Christmas jumper! Today we are going to be tearing the wrapping paper on a very fundamental but often overlooked question in tech world. Namely, what exactly is data?

Should I return this gift for store credit?

You might think this is a decidedly dull gift to receive. Why should I waste the festive season with hypotheticals like this? Data science is supposed to be about sexy stuff like eXtreme Gradient Boosting and Hadoop. It’s a toy like Hadoop that every business wants to find under its Christmas tree in 2020 even if it has no idea how it’s going to play with it and, secretly, it fears that the whole thing might be relegated to the cupboard under the stairs before December is out.

My impetus to write on this topic stems from questions I’ve seen posted in nontechnical and semi-technical parts of the internet. Across the web I observe a haphazard tendency to link the world of data science with every other fashionable concept in business and tech. Put simply, this is cringe-inducing for someone who has day to day involvement with technologies like gradient boosting and knows how they can truly have a business impact. I wonder if the folks throwing around the data buzzwords really know the context of the things they write.

It is this quest for context which led me to today’s topic. I firmly believe you can understand the contemporary world of data analytics better if you grasp the kind of data involved, where the data comes from and what you can do with it. You’d never attempt to make a Christmas cake without knowledge of the ingredients involved and yet, somehow, people who have never viewed a SQL table are discussing data concepts that are equivalent to an entire Christmas dinner!

01010010 01101111 01110110 01100101 01110010

To begin my discussion of data it’s best to consider the data that I work with the most, the data on the millions of car insurance policies that we sell or renew during a typical year of trading. BGL’s car data is structured into tables that represent the processes within our business, eg keeping records of payments, customer contact details and the type of car each customer owns. Although these tables are very large, they’re roughly the thing that most people will think of if they’re presented with the term ‘data’. And, naturally, there’s absolutely nothing wrong with us doing things this way. Our core systems make a lot of sense because they were designed to make a lot sense! Since our tables have a consistent structure I can ask obscure questions like ‘What proportion of policies that we sold in 1992 were for Rover cars?’ and get an answer with relative ease. It doesn’t matter that there’s so much counting to be done, or that some of my BGL colleagues weren’t even alive in 1992, structured data gives us the scope to explore these questions.

If we probe even deeper into this data we find that, just like anything stored on a computer, there’s even more structure to be found. Computers store everything in binary as 0s and 1s and so, even if a human has an analytical interest in a car being a Rover, to the computer it’s just numbers which would be nonsensical if you looked at them in their raw form. A computer can store pretty much anything as 0s and 1s but only humans give it meaning and structure. Only a human has a business sense of why a 1992 Rover is different to insure to a 1992 Ferrari.

A tale of two cities

I’m not old enough to remember BGL making its first policy sales in 1992, but I can say with absolute confidence that the computing world has changed a great deal since then. When I say this I’m not just talking about the nerdy stuff either, the whole culture of computing devices and connectivity has evolved to the extent that even people who’ve never heard of binary/ASCII conversions are very much dependent on their computers (phones, tablets) to mediate their everyday existence.

To bring this to life, consider two friends who went shopping on a wet day in Peterborough in December 1992. In an effort to dodge the showers they phoned each other’s homes before they left to ensure they could meet somewhere dry, and they arrive at the department store exactly on time to avoid getting lost in the crowds. Their shopping takes hours because they must buy everything on that day, those slow mail-order services wouldn’t get them their gifts in time for Christmas! They spend a lot of time queuing as well, progress at the cash register is slow because everyone’s fumbling with coins and cheques. After a while they retreat to an unknown café for a rest but find the place to be mediocre, they won’t go back there again! They traipse home at the end of the day, without the energy or enthusiasm to check out the city’s festive lights.

Now consider two different friends who went shopping in a snowy Peterborough in December 2020. To avoid the cold, they used WhatsApp to arrange to meet in café at the very last minute. Neither of them have been to the café before but, since the reviews on TripAdvisor and Google Maps look decent, they can be sure of a nice coffee before the shopping begins. Of course, the shopping itself isn’t especially stressful as most items have already been reserved online and the transactions have been handled by PayPal. The impulse purchases are done using Apple Pay which means that the process at the cash register is very slick. Once the shopping bags have been filled, the friends pose for a photo with the city’s Christmas lights in the background. The digital image is, of course, geotagged by the phone’s GPS and uploaded to Instagram within a matter of seconds.

Just as the friends left footprints in the snow as they walked around the city in 2020, almost everything they did created data footprints on the systems they were using. Some of their data footprints were explicit (eg taking the time to review the café) and some were much more implicit (eg an app pulling coordinates from your phone’s GPS).

Since our 2020 lives are computerised and the lifeblood of computing is data, we should consider data to be a staggeringly comprehensive representation of our everyday lives.

Seeking structure

You’ve probably identified a contention in what I’ve said so far. When I was counting Rovers I was very much in favour of structure and yet, moving into the 21st century, I signal the dawn of highly unstructured data coming from all sorts of sources. This unstructured stuff is data which does not take the form of the huge tables of cars that we initially discussed.

At the simple end we’ve got transactional data, such as records of who has bought what items via a payment service. It’s more like a log than something designed for analysis. Even more abstract would be something like the café review data. Although written reviews make good sense to human, trying to get a computer to interpret text (as opposed to numbers) is hugely complex.

When working with data like this considerable abstractions must be made because traditional analytical methods will struggle with the scale and detail. Traditional statistical methods originate in experimental disciplines like biology and psychology where collecting data was an expensive and bespoke process. This is why languages like R refer to rows within data tables as observations. It harks back to an experimental biologist making a specific observation within a specifically designed experiment. We can’t really call the geotag of an Instagram post an observation.

Thankfully modern methods in data science allow us to crunch data without much obvious structure and at enormous scale. For example, we use Latent Dirichlet allocation (LDA) to rapidly identify themes in written customer feedback. Working with text is very flexible because it allows customers to mention things we didn’t even know they were interested in. Whereas with a traditional experimental design like a survey, a customer will only answer the questions which we perceive to be important.

2020’s data science also has methods to summarise data without any specific target. Supervised methods in data science let us predict something of interest (eg the risk of customer committing fraud) but unsupervised methods allow us to summarise data without a particular response in mind. An example of this we use in BGL is clustering, which can be thought of like a compression process that allows us to look for instances of data which are like other instances, for example, can we find a group of customers who are similar?

Here be dragons

The challenge of untrained data science or abstract methods like text mining is interpretation within the business. Unlike trained methods they do not yield a simple result, for example we should price this policy at £400 or this person has a cancellation risk of 2.1%. The outputs of these analyses are simpler than the raw data by their definition, but they can still take a lot of interpretation to really grasp.

These kinds of interpretation exercises are rarely discussed by data science evangelists but they’re certainly important. Making simple graphs or tools in Excel and putting them into PowerPoint may not be glamorous, but they help those in less technical roles understand the purpose and value of what data science delivers.

In a meeting last Monday, I was doing exactly this kind of translation exercise with regards to a clustering of car insurance customers. During the session, I encouraged the participants to think of the project like drawing a treasure map. The output of the clustering should be a schematic of the customer base which tells the company where they should navigate for optimal results. Although it will ultimately be the data scientist who draws the map, they will need guidance from the business for what to put on the legend. The clustering should show were the gold is buried (ie lucrative segments of the market) but it’s equally important to plot the dragons, sea monsters and suchlike because these represent areas you’d rather avoid (potential losses). There must be a give and take between the cartographer and the navigator.

Wrapping up (Christmas pun absolutely intended)

Although we’ve taken some festive, nautical and historical detours the essence of this article has been about the contemporary definition of data. I have been arguing that although analytics is rightfully fashionable right now, people usually overlook the inherent linkage between the analytics being done and rich range of data that is being analysed.

The buzzword is all about the bigness of data but I hope this article has made it clear that size isn’t the whole issue here. We also must consider the speed with which data is generated, the diversity of its structure and how we can make it useful for our businesses. These things mean that the nature of modern data is an ever-evolving challenge, a challenge that will never be entirely overcome at any given moment. However, for every development in the creation of and nature of data there are matched developments in analytical and storage methods.

BGL’s priority is not to get tied up in the buzzwords. The data team is here to deliver value from the data, whatever that data may be. Given my scepticism regarding buzzwords you might want to put me on Santa’s naughty list for mentioning things like gradient boosting and LDA, however I feel comfortable discussing these things because we use them to deliver benefit to the business and, crucially, to our customers.

Having a healthy appreciation of data is a wonderful way for any organisation to understand how its customers are behaving, to learn what they think and predict how they are likely to act in the future. Almost everything written on this Medium portal refers to a project or concept which has data at its foundation. Although its data science itself is the activity that is heralded with manner of nerdy glamour, I implore everyone building a business in 2021 to also be mindful of the data they are creating and how it represents the real nuts and bolts of their organisation.