The role of “data scientist," billed as the sexiest job of the 21st century by Harvard Business Review in 2012, has also been named the top job by Glassdoor three years in a row starting in 2016. The massive amount of attention being paid to data scientists has only highlighted the uncertainty around the precise responsibilities the job entails. Separate job roles have been grouped under the data scientist umbrella, causing unnecessary confusion during the hiring process — and sometimes preventing companies from hiring the candidate they actually need.
In my opinion, data-related jobs should fall into one of four different categories: Data Engineer, Data Analyst, Data Scientist Type A (Analytic), and Data Scientist Type B (Builder, also known as a machine learning engineer). The two types of data scientists originate from a presentation by Michael Hochster, director of data science at Stitch Fix, and were further elaborated on by Robert Chang, data scientist at Airbnb. Essentially, Type A data scientists are more analytical in their work and don’t necessarily deal with production code, while Type B data scientists focus on building data-driven products. Let’s now look at the primary differences between all four of these roles.
Primary role: Answering ad-hoc questions from the business team and preparing dashboards to visualize how the business is performing across a variety of metrics.
Main tools: SQL, Tableau, Excel, SAS
Example task: Informing the business's decision makers which product lines are performing well on a daily basis.
Primary role: Making data accessible to data scientists and analysts in databases, potentially including streaming data. They may also work with Type B data scientists to integrate models into production.
Main tools: Python/Java/Scala, SQL, Spark, Airflow/Jenkins, cloud computing, Docker
Example task: Transforming logs that capture how a company’s products are being used into accessible databases.
Data Scientist Type A (Analytical)
Primary role: Focusing on answering the question of why something in the company is happening and how to improve it, unlike data analysts who focus more on what is happening so problem areas can be identified.
Main tools: R/Python, SQL
Example task: Explaining to the business how to improve a particular product line using a variety of statistical and machine learning techniques.
Data Scientist Type B (Builder/Machine Learning Engineer)
Primary role: Building data-driven products utilizing machine learning that integrate into the company’s production system.
Main tools: Python/Java/Scala, SQL, cloud computing, Docker
Example task: Creating a new product that recommends items to customers.
The Most Valuable Skill Sets By Role
In order to carry out their primary responsibilities, each of the four types of data professionals I've identified will need different skill sets. This article from Sam Nelson at Udacity summarizes the necessary skills well and also provides a helpful diagram of the most important skills for each role. Below is my version of the diagram with some changes I made based on my own knowledge and experience.
This diagram can be helpful when determining which skills are going to be the most-needed for different roles. Data analysts will need very good communication and data visualization skills. Type A data scientists need strong statistics, data wrangling, and machine learning knowledge. Type B data scientists focus more on software engineering, machine learning, and linear algebra/differential equations. The latter is helpful for better understanding performance requirements when scaling algorithms in recommender systems or deep learning applications. Data engineers focus mainly on data wrangling and software engineering, since they write a lot of extract, transform, and load (ETL) jobs that need to run reliably and potentially on high volume or velocity data.
When to Hire Which Roles
The order in which you may need to hire for these various roles will depend on your company's size and situation. I have included a few example scenarios below to help you better understand your own organization’s needs.
If you are a brand new company just starting out, I would personally recommend hiring data engineers before any other data roles. This is because any data you need to collect is probably not clean or easily accessible yet. It will be essential to have good logging records about how customers are using your core product. If you wish to create a product based primarily on machine learning, you will then need some Type B data scientists to build it for you. You could also hire some Type A data scientists to help improve your product after analyzing its use. Data analysts can be hired last to make sure any product changes are working correctly and to monitor the company’s growth.
Established Data-Focused Companies
If data essentially is your product, you need to maintain growth. Type B data scientists can fuel that growth with new product directions while Type A data scientists can help optimize your existing products and make them perform better. Data engineers can help if you want to integrate new sources of data or more fine-grained logging, while data analysts can assist with better strategic decisions.
Established Non-Data Companies
If data isn’t your primary product, you will have more decisions to make. I would recommend you first make sure your existing sources of data (especially on your customers and their behavior) are reasonably clean, well-documented, and use modern technologies. If not, hire some data engineers to help better organize the data and speed up query performance. Next, I would hire data analysts if you don’t currently have them so that business opportunities and current weaknesses in your product offerings can be better identified. Once you have the data readily available and analysts are present to monitor your business, you can bring on data scientists to give your company an edge over your competitors. Similar to data-focused companies, your non-data company can leverage type A data scientists to make your existing products perform better and type B data scientists to build new ones.
If you are in doubt, I strongly recommend reading “The AI Hierarchy of Needs” by Monica Rogati, a data science and artificial intelligence consultant. The main takeaway is that, first and foremost, before you can incorporate machine learning into your product offerings or hire data scientists, you need a robust and reliable data infrastructure. Otherwise, you may waste time and money on substandard models that don’t perform effectively or take too long to build. If you hire the right people for the right role at the right time, however, you can give your company a great advantage.