This article is republished with permission from the author from Medium's Towards Data Science blog. View the original here.
The ability to write production-level code is one of the most sought-after skills in a data scientist role, even if it's not explicitly stated. For a software engineer-turned-data scientist, this may not sound like a challenging task — you might have already perfected the process of developing production-level code and deploying it.
This article is for those who are new to writing production-level code and interested in learning how, such as fresh university graduates or any professionals who have made the transition into data science (or are planning to do so). For them, writing production-level code might seem like a formidable task.
Below, I will provide tips on how to practice writing production-level code. You don’t necessarily have to be in a data science role to learn this skill.
1. Keep It Modular
This is a software design technique recommended for any software engineer. The idea is to break large code into small independent sections (functions) based on its functionality. There are two steps:
- Break the code into smaller pieces that are intended to perform a specific task (may include sub tasks).
- Group these functions into modules (or Python files) based on their usability. This also helps you keep code organized and eases code maintainability.
The first step is to decompose large code into many simple functions with specific inputs (and input formats) and outputs (and output formats). As mentioned earlier, each function should perform a single task such as cleaning up outliers in the data, replacing erroneous values, scoring a model, calculating root-mean-squared error (RMSE), and so on. Try to break each of those functions down further to perform sub tasks and continue until none of the functions can be further broken down. Here are the three categories of functions you might end up with:
Low-level functions : These are the most basic functions that cannot be further decomposed; for example, computing the RMSE or Z-score of the data. Some of these functions are widely used for training and implementation of any algorithm or machine learning model.
Medium-level functions : A function that uses one or more of the low-level functions and/or other medium-level functions to perform its task. For instance, the "clean up outliers" function uses the "compute Z-score" function to remove outliers by only retaining data within certain bounds. Another example is an error function that uses the compute RMSE function to get RMSE values.
High-level functions : A function that uses one or more of medium-level functions and/or low-level functions to perform its task. For example, a model training function that uses several functions to get randomly sampled data, or a model scoring function, a metric function, etc.
Finally, you can group all the low-level and medium-level functions that will be useful for more than one algorithm into a Python file (this can be imported as a module) and all other low-level and medium-level functions that will be useful only for the algorithm in consideration into another Python file. All the high-level functions should also reside in a separate Python file. This Python file dictates each step in model development — from combining data from different sources to deploying the final machine learning model.
There is no hard-and-fast rule to follow regarding the above steps, but I highly suggest you to start with these and then work to develop your own style.
2. Logging and Instrumentation
Logging and instrumentation (LI) are analogous to the black box in aircrafts that record all the happenings in the cockpit. The main purpose of LI is to record useful information from the code during its execution in order to help the programmer debug it if anything goes awry, as well as improve the performance of the code (such as reducing execution time).
What is the difference between logging and instrumentation?
Logging: Records only actionable information (such as critical failures) during run time or structured data such as intermediate results that will later be used by the code itself. Multiple log levels such as debug, info, warn, and errors are acceptable during the development and testing phases. However, avoid these at all costs during production.
Note: Logging should be minimal, containing only the information that requires human attention and immediate handling.
Instrumentation : Records all other information left out of logging that would help us validate code execution steps and work on performance improvements, if necessary. It is always better to have more data, so instrument as much information as possible.
To validate code execution steps: We should record information such as task name, intermediate results, steps completed, etc. This will help us to validate the results and also to confirm that the algorithm has followed the intended steps. Invalid results or a strangely performing algorithm may not raise a critical error that would be caught in logging. Hence, recording this information is imperative.
To improve performance: We should record time taken for each task/subtask and memory utilized by each variable. This will help us improve our code by making necessary changes to ensure the code runs faster and limit memory consumption (or identify memory leaks, which is common in Python).
Note: Instrumentation should record all other information left out in logging that will help us to validate code execution steps and work on performance improvements. It is better to have more data than less.
3. Code Optimization
Code optimization implies both reduced time complexity (run time) as well as reduced space complexity (memory usage). The time/space complexity is commonly denoted as O(x), also known as Big-O representation, where x is the dominant term in a time or space polynomial. The time and space complexity are the metrics for measuring algorithm efficiency.
For example, let's say we have a nested for loop of size n for each that takes about 2 seconds each run, followed by a simple for loop that takes 4 seconds for each run. Then the equation for time consumption can be written as:
Time taken ~ 2 n²+4 n = O(n²+n) = O(n²)
For Big-O representation, we should drop the non-dominant terms (they will be negligible, as n tends to inf), as well as the coefficients. The coefficients — or the scaling factors — are ignored, as we have less control in terms of optimization flexibility. Please note that the coefficients in the absolute time taken refers to the product of number of for loops and the time taken for each run, whereas the coefficients in O(n²+n) represent the number of for loops (1 double for loop and 1 single for loop). Again, we should drop the lower order terms from the equation. Hence, the Big-O for the above process is O(n²).
Now, our goal is to replace the least efficient part of the code with a better alternative with lower time complexity. For example, O(n) is better than O(n²). The most common killers in the code are for loops and the least common (but worse) for loops are recursive functions (O(branch^depth)). Try to replace as many for loops as possible with Python modules or functions, which are usually heavily optimized with possible C-code to perform the computation — instead of Python — to achieve shorter run time.
I highly recommend you to read the section about “Big-O” in Cracking the coding interview by Gayle McDowell. In fact, try to read the entire book to improve your coding skills.
4. Unit Testing
Unit testing : Automates code testing in terms of functionality.
Your code has to clear multiple stages of testing and debugging before it is put into production. Usually there are three levels: development, staging, and production. In some companies, there will be a level before production that mimics the exact environment of a production system. The code should be free from any obvious issues and able to handle potential exceptions when it reaches production.
To identify different issues that may arise, we need to test our code against different scenarios, data sets, edge and corner cases, etc. It is inefficient to carry out this process manually every time we want to test the code, like when we make a major change to the code. Hence, we should opt for unit testing, which contains a set of test cases and can be executed whenever we want to test the code.
We have to add different test cases and their expected results to the unit testing module to test our code. The unit testing module goes through each test case, one-by-one, and compares the output of the code with the expected value. If the expected results are not achieved, the test fails — it is an early indicator that your code would fail if deployed into production. We need to debug the code and then repeat the process until all test cases are cleared.
To make our life easy, Python has a module called unittest to implement unit testing.
5. Compatibility With The Ecosystem
Most likely, your code is not going to be a standalone function or module. It will have to be integrated into your company’s code ecosystem. Your code will have to run synchronously with other parts of the ecosystem without any flaws/failures.
For instance, let's say that you have developed a model to give recommendations. The process flow usually consists of getting recent data from the database, updating/generating recommendations, and storing those in a database that will be read by front-end frameworks such as webpages (using APIs) in order to display the recommended items to the user. Simple! It is like a chain; the new chain link should lock-in with the previous one and the next one — otherwise the process fails. Similarly, each process has to run as expected.
To ensure that each process performs as intended, they should all have well-defined input and output requirements, expected response time, and more. If and when it receives a request from other modules (such as a webpage) for updated recommendations, your code should return the expected values in a desired format in an acceptable amount of time. If the results are unexpected values (suggestions to buy milk when we are shopping for electronics), in an undesired format (suggestions in the form of text rather than pictures), and in unacceptable time (no one waits for minutes to get recommendations, at least these days) it implies that the code is not in sync with system.
The best way to avoid such a scenario is to discuss the requirements with the relevant team before you begin the development process. If the team is not available, go through the code documentation and the code itself, if necessary, to understand the requirements.
6. Version Control
Git , a version control system, is one of the best things that has happened in recent times for source code management. It tracks the changes made to the computer code. There are many existing version control/tracking systems, but Git is more widely used than any other.
The process, in simple terms, is “modify and commit." Within that, there are many steps, such as creating a branch for development, committing changes locally, pulling files from remote, pushing files to a remote branch, and much more, which I am going to leave you to explore on your own.
Every time we make a change to the code, instead of saving the file with a different name, we commit the changes — which means we overwrite the old file with our new changes with a key linked to it. We usually write comments every time we commit a change to the code. Let’s say you don’t like the changes that were made in the last commit and want to revert back to previous version. That can be done easily using the commit reference key. Git is so powerful and useful for code development and maintenance.
You likely already understand why this is important for production systems and why it is mandatory to learn Git. We must always have the flexibility to go back to an older version that is stable just in case the new version fails unexpectedly.
The code you write should be easily digestible to others as well, or at least to your teammates. Moreover, it will be challenging for you to understand your own code in few months if proper naming conventions are not followed. To maintain readability, you should use:
Appropriate variable and function names. The variable and function names should be self-explanatory. When someone reads your code, it should be easy for them to find what each variable contains and what each function does, at least to some extent.
It's perfectly okay to have a long name that clearly states its functionality or role rather than short names such as x, y, z, etc., that are vague. Try not to exceed 30 characters for variable names and 50–60 for function names.
Previously, the standard code width was 80 characters based on IBM standards that are totally outdated. Now, as per GitHub standards, it is around 120. Setting a 1/4th limit for page width for character names, we get 30 — which is long enough, yet doesn’t fill the page. The function names could be little longer, but again, they shouldn’t fill the entire page. By setting a limit of a half page width, we get 60.
For instance, the variable for the average age of Asian men in our sample data can be written as mean_age_men_Asia rather than age or x. A similar argument applies to function names as well.
Doc string and comments: In addition to appropriate variable and function names, it is essential to have comments and notes wherever necessary to help the reader understand the code.
Doc string : Function/class/module specific. This is the first few lines of text inside the function definition that describes the role of the function, along with its inputs and outputs. The text should be placed between set of 3 double quotes:
def function_name “””<docstring>””” return<output>
Comments : These can be placed anywhere in the code to inform the reader about the action or role of a particular section/line. The need for comments will be considerably reduced if we give appropriate names to variables and functions — the code will be, for the most part, self explanatory.
Although, it is not a direct step in writing production quality code, code review by your peers will be helpful in improving your coding skill.
No one writes flawless computer code, unless they have more than 10 years of experience (and even then, it's not guaranteed). There will be always room for improvement. I have seen professionals with several years of experience writing awful code and also interns who were pursuing their bachelors degree with outstanding coding skills — you can always find someone who is better than you. It all depends on how many how many hours you invest in learning, practicing, and —most importantly — improving that particular skill.
People who code better than you always exist, but it is not always possible to find them on the team with which you can share your code. Perhaps you are the best on your team. In that case, it is okay to ask others to test and give feedback on your code. They might catch something that escaped your eyes.
Code review is especially important when you are in the early stages of your career. Here are the steps I'd recommend for successfully getting your code reviewed:
- Kindly request your peers to perform a code review after you write your code and perform all the development, testing, and debugging. Make sure you don’t leave in any silly mistakes.
- Forward your teammates your code link. Don’t ask them to review several scripts at one time. Ask them to look at one after the other. The comments they give for the first script are perhaps applicable to other scripts as well. Make sure you apply those changes to other scripts, if applicable, before sending out the second script for review.
- Give them a week or two to read and test your code for each iteration. Also provide all necessary information to test your code, such as sample inputs, limitations, and so on.
- Meet with each one of them and get their suggestions. Remember, you don’t have to include all their suggestions in your code; select the ones that you think will improve the code at your own discretion.
- Repeat until you and your team are satisfied. Try to fix or improve your code in the first few iterations (max 3–4), otherwise you might make a bad impression.
I hope this article is helpful and that you enjoyed reading it.