All Articles

When Data Science Turns Into Homoeopathy

WARNING: this post is not about real “homoeopathy”. It’s a metaphorical term I just use when people think that being a bit data-driven should fix their flawed workflow.

From the very beginning of my career, I’ve been working with data and people who also use it somehow. And this post is based on my and my colleagues experience working in companies which actively use data. I’ll try to advocate of developing and sustaining strong Data Literacy and Data Culture in general. And why it’s crucial for everyone who touches data. There are a few things I’ve realised too late and want to share with the world.

Where does the pain come from?

Let’s start with a simple and very exaggerated example. Let’s say it’s a company which develops mobile apps.

Random person: We have a marketing campaign with 25 users from a new ad network could you please predict their LTV for us?
Data scientist: Seems that you don’t have enough data so far, a variance must be too high. Buy more users and come back.
Random person: Variance, eh? We don’t have time for this, we need to know results ASAP. Just give us your predictions, we’ll decide what to do next.
Data scientist (thinks): it sounds like homoeopathy to me.
Data scientist: Alright, you know better.

So who is wrong in this situation? Now I would say - both. Unfortunately, my answer wasn’t like that all the time.

At first, it’s obvious “random person” wasn’t data-savvy enough. In most of the cases, it’s literally impossible to say anything about traffic source with only 25 users. And he should’ve known that sample size is very important you need to keep in mind working with data. And data scientist should have explained that and suggest a solution to this problem. Unfortunately, people sometimes cannot accept another solution and it creates problems for both sides.

Giving people data sometimes makes them feel like they already know what to do with it and have some kind of superpowers. But usually, they don’t, at least not everyone.

Creating a Data-Driven Organization

Imagine your team have a model in production which make predictions on a daily basis. You read a ton of papers, articles and books, spend a few weeks or months to develop, validate and finally deployed it. You are sure that the model is 90% accurate. You deliver results via a simple dashboard or any other tool. What happens next?

The answer is “it depends”, really. If it’s a not completely automated decision-making system and somebody is using these predictions after they’re ready, a very possible situation is a person to whom this data was intended may use it wrongly and the company will eventually lose money blaming data scientists that their model is wrong.

It happens because working with data is a skill. And each skill could be learned, no one has it initially. And I think, It is the responsibility of Data Scientist to make sure the clients are capable to work with data and teach if they don’t… or keep them away from touching data.

Any data-based products, from ad-hoc analyses to advanced ML model predictions, shouldn’t be limited to delivery of their results to a “client”. The last step or, which is more likely, even the first one, to make sure that people who will work with the outcome of data analysis know how to do that properly. So, simple rule I follow now is “know your audience”.

A bit about problems of democratized data

Some people say, that in a perfect world - anybody should be a Data Scientist. However, I would disagree with this statement because in this case, I would lose my job. Being serious, in a more realistic “perfect” world - anybody should be data literate at least. It’s important for people of many professions, not only data people.

As I mentioned, working with data is a skill, and it’s required to ask the right questions and look for correct answers. I’m not talking about wrangling huge datasets - no, not at all, even facing simple everyday problems like checking KPIs at the morning in the daily mail, generating product ideas or requesting some data analysis. In teams which have a data scientist, all other people are often data users as well. They make requests, wait for the analysis and make decisions. They also could use data from dashboards or any BI systems. It’s a really great approach. However, sometimes it faces serious problems.

The lack of data-savviness and even basic knowledge of statistics or math creates problems for developing healthy Data Culture and makes it no sense to “democratise the data”. By this, I mean making it accessible to anyone in the team or company. The culture code of many modern companies (especially tech or digital) is to make everyone make decisions based on data. It can be both in data-driven or data-informed way depending on how Data Culture is developed. It also questions transparency, communication and many other things which are also interesting topics but not a part of this post.

if data is democratised in a company with strong data culture and literacy it becomes a strong advantage for business and every team member. People are capable of generating the right hypotheses, asking the right questions and making the right decisions based on data. And as a consequence, business moves forward and everyone is happy.

If data is “democratised” in company with weak data literacy among team members it ends up with a mess and suffering for data scientists. They are bombarded with stupid questions, they are digging data which somebody asked out of curiosity and spending their time to explain people they work and all the time proving that data is not lying. It can be solved by the right task and project management in the data team, however, a pressure, requirements and expectations from the data team are very high in this case.

My colleagues and I were in both situations from time to time depending on a team or project we were working. There were situations when we were asked to explain what is median and why we use it or how to read scatter plot. I remember working with people who sent excel report back to me with a request to add one column which could be calculated from the other two on the same sheet and make a simple line chart from it. If these persons had basic skills of working with data, situations above would’ve not even existed. It would save a lot of time for both parties and improved the overall team’s performance.

Data trustworthiness and “experts”

Things become more complicated when upper management also has no idea how to work with data and what they want to get from it. Expectations from data scientists are also very high in these companies. Managers ask DS to “find something interesting”, without pointing on a problem which they are willing to solve. Sometimes, they even hire data scientist only because it’s a buzzword. Data scientists are not “magic unicorns” as people often think, they also need some time and help to understand the problem, think about the right questions, especially in an unknown domain. Hiring DS and telling him/her “SOLVE MY PROBLEMS” it’s not a good strategy.

Many managers are often sceptical about data. As well as many creative people. They may have a lot of product or design ideas (it’s their job apparently). And the most dangerous situation is when your analysis contradicts with what they think.

E.g. manager tries to make a point and asks for an analysis. And it eventually ends up with a result opposite to what he or she expected. It’s normal, and It happens very often - it’s how hypotheses work. You make it, do an analysis and decide on whether you were right or not. But some people who don’t trust data enough and can refuse to accept a result you give them.

Sometimes it happens with A/B tests. It could be hard to explain to people what statistical significance is and why we cannot roll out new feature if it performed “good” in test variant but you say “results are not significant”. Or the worst case if it performed significantly bad but they expected it to be great. They are experts, they have a unique vision, it’s impossible that feature is not working, eh? It happens, and you would be devastated if your scientifically-backed result is defeated by expertise with no real reason.

The problem intensifies when you make a mistake. It’s hard to earn trust, and if you fail once, your data will be under the question mark for a long time, especially if it contradicts with expert’s opinion. But people make mistakes, it’s impossible to avoid them. The best way to deal with it is to accept your mistakes and understand and explain what went wrong. Well, anyway, biases are in the air, they are waiting for you around the corner.

Summary

There’re a few key things I would emphasise:

  • Know your “audience”. Make sure that your analysis/model/data matches with people capabilities to understand and use it.
  • Deliver+Control. It’s always good to not only post a report/analysis/model but also make sure that your audience is using it correctly.
  • Democratising data is an opportunity, however, it works only if everyone has at least basic data literacy and statistical/math knowledge and understanding what is possible to do with their data and what is not.
  • Make any efforts to gain trust for your data, especially from upper management, and always accept your mistakes.
  • Learn and teach data storytelling.

Someone told me that a good blog post must contain a quote. So I picked up one but changed it a bit.

“Company shouldn’t be afraid of their data. Data should be afraid of it’s company.” - Anonymous Data Scientist, original quote by Alan Moore, V for Vendetta

Reading

If you’re interested more in topic of Data Culture and Data Literacy I would definitely recommend this book: