Big data is cool. Being able to use it right and make it all actually make sense is even cooler. The problem is, there are a lot of pitfalls to working with large data sets and each minor misstep can make the next step in the process slightly off or even wreck the whole thing. People lose sight of the art in the science.
I’ve worked with data that took clusters with dozens of units and weeks to get a tiny step in the right direction. I’ve delved into analytics for GIS, oil and gas, and managed agents (among many other side projects). Turning data into something useful has been my forte and has afforded me many jobs, but the problem is that data is as fragile as it is exciting.
Let’s dive into the 5 things almost everyone I’ve met throughout my career seems to get wrong at some level about big data and analytics. We aren’t going to delve too much into algorithms and science, but more the philosophies and methodologies that can poison the whole well before the math even matters.
Trash In, Trash Out
Bad data gets you bad results. It’s obvious when put that way, but many people cling to bad data hoping that it gets “rounded out” in their data processing or if they can “finesse” the algorithm just right. I’ve seen people cling to fundamentally flawed data trying to make sense of it when there really isn’t a sane way to do so.
When I worked in the energy sector, we would rather throw out a data set if even a few points were fundamentally wrong. The problem with a little doubt in the data is: if these pieces are incorrect, what else that can’t be easily checked is wrong? All of the data in a set builds off of certain fundamental pieces of information, so each minor skew then exponentially skewed the rest of the data through the process.
These points poisoned the next steps which made certain high profit areas look questionable and certain low profit areas look unusually promising. Sometimes, you can salvage bits of the set, but once it feels compromised, it can be cheaper and less frustrating to just dump the trash and move on.
Small outliers in our data set comprising a few point in a map with literally millions of features and points could actually skew specific algorithms into giving us horribly wrong results in those areas. While they were rounded off at scale, that didn’t help when you’re looking for needles in a massive haystack.
When you have bad data, you pollute the results. When you use these results, you then further pollute the process. Bad data leads to bad results which lead to bad choices which leads to bad actions which leads to a bad process.
Data Bias
We talked about how bad data gets bad results, but a small error can skew trends and complicate the process. If you have 4 100% points and a mistaken 90% point, you get 98% for the mean, 100% for the median, 100% for the mode, and 10% for the range stretching from 90%-100%. While the math is simple, the problem is in determining just how the data is wrong.
Knowing that we have 4 100% points and a single, mistaken 90% point makes it obvious to consider that the 90% point probably should have been a 100% point or similar, but what if it’s lower? Without knowing what is wrong, we then have a bias in analyzing an arbitrary set.
Data bias isn’t just a misinterpretation of the data itself. It can be a misinterpretation of the methodology to collect data, a misinterpretation of the importance of data, or the misinterpretation of analysis of said data. You’re far worse off when the process to collect data is flawed rather than just a few bad data points.
Data bias on KPIs (Key Performance Indicators) is something you see abused in call centers all the time. If agents are measured on customer satisfaction, they’ll keep the customer happy at the expense of the company, but if they’re measured on closes alone, they’ll anger customers. I’ve watched many help desks, call centers, etc. all game the system and the tighter the metrics, the easier it was.
The balance is going to depend on strengths and weaknesses of other parts of the company as well as interpretations of other data. Each minor bias creates a cascading effect that then further impacts the whole process unless someone looks away from the Excel report long enough to see the real problems.
Scaling Data
Data is not always easy to scale. You can’t just throw set after set of data into a system and expect magic. Wrong data is obviously always a problem, but different biases in different sets can get you largely differing results for data that is otherwise related. You also have differing levels of accuracy and similar which can hurt the accuracy of the whole.
As I plumbed through data munging different sets of data, I found that sometimes while data was nice, it just got in the way. Your data is only as accurate as the least accurate sets in the bunch. As you scale the amount of data you throw into the number crunching, you need to keep watch on where you’re rounding off accuracy.
There’s also a lot of work in bringing all of this data into line with the rest of the set so that you can actually use it. Dealing with oil and gas data meant working with one of many coordinate systems, multiple standards for measuring points (e.g. measuring from the ground level or the rig), different log types, completely different metrics for what was important, etc. It took more and more time and effort to sort out the data as we added more.
More data meant more problems reconciling it with the rest of the database we built. That’s fine if more data meant better results, but not all data is created equal. Sometimes, it was more efficient to throw out large sets because it was hard to act on new information in a realistic time frame. As we scaled our data, we scaled the time it took to process it. Sometimes we got lucky and it was linear, more often than not it wasn’t.
What Versus Why
Many fledgling data scientists see analytics like a hammer and see everything as a piece of data to nail into their model. On the off chance this is actually done right, it can make data extremely actionable. Let’s be real though, it’s usually done at the expense of actual progress. Data analysis is powerful, but there needs to be a reason for doing it or you just turn arbitrary numbers into different, overvalued numbers with a superiority complex. Does the problem you’re addressing even make sense to address this way? Are you even solving the right problems?
What you have is one thing, why you have it is another. Scientific data at the expense of respect for the people it’s derived from is a problem, as is data without a purpose. What do you have and why does it matter? Why did you get it and why will it benefit society? What do you have and how did you get it? Science has ethics, as should data.
It’s easy to get lost in the quest for metrics. Metadata is powerful and easily available, but what lines do you cross throwing data into your model? These lines are either functional (data for the sake of data) or ethical (data that infringes on rights and privacy). Both are effectively two sides of the same coin.
Security Concerns
This also waylays into the security concerns behind large data sets. The more data you have about certain topics, the more attractive it can be to threat actors. What does your data tell about the subjects of said data and how easily abused is it?
While you may have no privacy concerns with something like GIS data, there are still other issues if this data gets out. It can cost a lot to collect and maintain a data set and this can give your business or employer a competitive advantage which you rely on. More and more businesses need to combine anonymized or user agnostic data with more specific user data to make sane business decisions.
People have learned about family pregnancies before a reveal from Target due to arguable misapplication of data, but what happens when this gets a bit more malicious? We live in an era of ransomware and data leaks. From EA to RMM providers and beyond, hackers are trying to squeeze money out, and the more data you have, the bigger a target you are.
Conclusion
I love data and analysis, but I’ve also had to learn to make peace with the fact that once you start to plumb the depths, it loses its magic. Each incision into the beast makes it more and more mechanistic in nature, but you mustn’t lose sight of the magic in the unknown… or the reverence you should have for it as an individual. I love what data and analytics can do, but I am also aware of the dangers they pose when let loose with no restrictions.
Consider the life cycle of your data and what it means to the individuals it’s borne from. Where did it come from? What is it contributing to? How is that benefiting people and the world without harming them at the same time? If you truly care about data, big data, and analysis of said data, you need to consider the source, the impact on the source, and what your data means to the world at large.
These are 5 of the most common issues I’ve seen people interested in data lose sight of. The models, algorithms, math, and even baser sciences which constitute data may change, but the philosophy behind it and the purpose really doesn’t. Data will always be meaningful for making better decisions, and how that data is obtained will always matter, no matter what actual principles of application change. How can you make your data work for you without falling into the most common pitfalls?