All Articles

How to prioritise everything in Data Science

Simple Priotisation

Photo by Hans-Peter Gauster on Unsplash

Data Science is a very vague term. Companies define it very differently and it depends on a field and size of the data team. Data Science can be a product, play a crucial role in product development, or be only in a supportive position. This leads to different types of Data Culture, communication between product and data teams and changes the way you prioritise things. I’m not going to focus on specific cases of prioritisation but give you a general framework I use both in my job and personal projects.

Prioritisation isn’t easy. There is no single axis which helps you to triage your backlog. But there are three which can be helpful. Here they are (I call it 3D):

  1. Dependency. Is it required for other things?
  2. Duration. How long will it benefit us?
  3. Damage. What does it cost?

I’m not including “Benefit” or “Impact” here because it’s obvious. If something isn’t beneficial, why should you spend time and resources on it? Moreover, some things couldn’t benefit you directly, or it wouldn’t have an instant effect but something which will pay off in a year or more. Some things are required to other much bigger projects and identifying them is an important skill. It leads us to the first D.

Dependency is what helps you to build up a groundwork for the future. It has two subcategories - projects or tools and knowledge. Projects or tools are the bricks in your Data Science tech stack. If there’s something that will help you to create something much bigger than you can do now, it should have higher priority.

I like the concept of The AI Hierarchy of Needs. You can use it to identify priorities inside the Dependency axis. All infrastructure is a dependency. You cannot do your magic if your database is not working or users don’t send you data. You also cannot so your magic if your data is raw and full of anomalies or broken JSONs. And to be sure your data is always good you need Analytics - metrics, dashboards, monitoring, all of these. You should still consider other axes and don’t get too deep into the infrastructure woods. Improvements have no limit. It’s important to know when to slow down. What you need is just a reliable ETL flow and basic Analytics to track what’s going on with your data. In short, you need a good data to do your magic. Everything which leads you to this comes first.

The second things are Knowledge and Learning. It isn’t something that has an instant effect. It benefits in the long term. The more valuable skills you and your team has, the more opportunities to apply it you have. Sometimes knowledge is a requirement, sometimes an opportunity isn’t clear. But it shouldn’t be underestimated.

There’s a set of skills which is necessary for every Data Scientist and any person who works with data. Investing in profound knowledge of fundamentals will pay off anyway. But spending weeks on a trendy framework could be a waste of time. Lack of knowledge makes you unable to explain important concepts to non-technical people which in reality is very important. You can do more mistakes in trivial things and just slow down the process.

The second D is Duration of the effect. You need to focus on things which won’t be obsolete soon and will be able to benefit for months or years. Both projects and knowledge are suitable for this. Good examples are infrastructure, fast and efficient database, in-house developed frameworks, libraries and tools, knowledge base and many other evergreen things. Bad examples are patches, focusing a lot on curiosity-driven ad-hoc requests.

Treat everything you do as a data product. The key metric for the product is LTV which is depends on revenue (benefit) and retention (duration). So your goal is to maximize the compound impact of your work over a long period of time. This approach might not be suitable for everyone because can be too slow for very fast paces environments like startups. But even there building a strong foundation for the future is a very good thing. It’s important to see the trade-off between what you need now and what you’ll need in a few months.

The last D is Damage or Costs or Risks. Call it whatever you want but it means only one thing - the resources you need to spend on the project or task - either human resources, money or CPU time. Every of it has a money equivalent. Spending working hours or spending money always have an opportunity cost. Whenever you do something, you don’t do anything else. That’s why we prioritise stuff, don’t we? If something is too costly, it may have low ROI just because of it and the project will not be profitable. Same for data product, including internal.

You can come up with advanced AI model that will increase conversion by 20% but it will never be in production if you need a year to build it and a couple of engineers for supporting it further. That’s why you need to start start with things you can easily validate. If you can build a prototype within a reasonable amount of time and then test it - half of the job is done. You already know whether it will pay off or not and it makes it very easy to convince your team, directors or C-level to select this for development and give all the resources you need.

The last rule, which wasn’t in the list, is make more with less. Always do things which are easily scalable to get more benefit and almost don’t require to scale efforts and needed resources at the same time. AI is overrated, heuristics are underrated. If one if-else does 80% of AI’s job then you don’t need AI. Always put high-leverage things first. I see this as the main beauty of Data Science - making value from nothing. Understanding it makes prioritisation much easier (and sadder at the same time). When you realise that your job is not about doing fit/predict all day you’ll look at it from a different angle.