Structure, readability and efficiency in code development
A common behaviour among data scientists is to learn to develop on Jupyter/Databricks notebooks. However, over time, Notebooks can become long and unwieldy, with hundreds of cells running in a chaotic order, no clear code structure, and library compatibility issues (especially if your fellow developers are using different versions of the same libraries).
If you have experienced any of these problems, this article is for you.
At Capitole we have presence in many different industries. Many of us are in data processing projects, in Data Science/Development/Devops positions and work both on physical servers and on cloud machines in AWS, Azure or other cloud services. For us it is very important to work efficiently and follow good practices in development, leaving a good image of our company wherever we go. This allows us to perform our job the best we can and makes things easier for the end customers of the developed product.
In this article, we share some of the reflections that we have acquired over time, as tips to organise the code.
They are simple tricks that can save a lot of time and misunderstandings in the day-to-day work of the team of developers.
From Jupyter/Databricks notebooks to scripts
Many of us begin coding in Jupyter notebooks, and I get it—it’s simple, allows you to quickly test new code, experiment with syntax, and easily visualize plots. However, as you become more proficient in Python, it’s important to transition to writing scripts.
Why make the switch? There are many good reasons, but the most important one is that it encourages better code structure. In a script, there are no cells—everything runs sequentially. If you need additional functions, you can write separate scripts and use them as modules (a module is simply a .py file that contains functions and classes for reuse).
So, what is a script?
A script is simply a .py file designed to execute a specific task or set of tasks. Let me show you the basic structure of a script with an example: analyze.py.
* In short, if __name__ == « __main__ »: allows you to execute code when the file runs as a script, but not when it’s imported as a module. To run it as a script simply type python analyze.py in your terminal. To use it as a module, in a new .py file, write import analyze, and you’ll have access to the 3 functions defined without running the code inside the if statement.
Readable Code
Imagine I start writing the following: “goodcodingpractices ARE oneofTHEmost imp
ortant skillssssss you Will develop as a DataScientist.”
You probably understood what I meant, but you had to do an effort to do so. Bad coding practices are the equivalent of what I just showed in the previous sentence for code, but even worse. I remember at the beginning of my career writing code, so poorly I couldn’t understand it myself. In this section you’ll learn how to write code properly. The main idea here is that your code should be easy to read by anybody (other people and your future self). In my experience, this trait is what differences a beginner from a pro.
Variable Names
“There are only two hard things in Computer Science: cache invalidation and naming things.”
– Phil Karlton
Check the following video to learn how to name variables properly. (These tips also apply to function names).
Functions
I assume you know what a function is and its syntax in Python. The important thing here is how to use them effectively and name them correctly. Functions should be used to structure your code correctly. If your functions are more than 100 lines long, there is probably something wrong. Break them into smaller functions that make sense.
Tips for naming your functions:
- Descriptive names: The name should describe what the function does in a clear and concise way.
- Action verbs: Function names should use verbs to indicate what the function does.
- Use the snake_case naming convention.
- Avoid abbreviations: Abbreviations can make function names difficult to understand.
Indentation
If you need more than 3 levels of indentation, you should fix your program. You can read the preferred Linux kernel coding style and take it as a reference. This video shows the importance of this point.
Comments
In an ideal world, you wouldn’t need comments. If your variable and function names are concise and self-explanatory, and your program is designed in a way that breaks down into logical functions that are easy to follow, your code should be easily readable, and no comments would be needed.
However, we live in an imperfect world where the best decisions are not always obvious, and where we sometimes have to sacrifice readability for performance. For these reasons, I recommend writing comments. I suggest breaking functions (or code) into small chunks, each accompanied by a comment at the top explaining what you are doing and why.
Virtual Environments
If you’ve been involved in several projects simultaneously without using virtual environments, you know the struggle. Every time you need a library, you simply type pip install <new_library>, and if you now run pip list, you’ll see a huge list of libraries that you don’t even remember installing. The pain is even greater if you work in a team where nobody uses virtual environments or if you are involved in several projects simultaneously: encountering code crashes for no apparent reason, difficulty running code for new team members, and so on.
The solution to these problems is a virtual environment. For Python, I recommend virtualenv. It creates an environment in which you can install libraries completely independent of the rest of your system. To install it, simply run pip install virtualenv, and to learn how to use it, type tldr virtualenv. To remove a virtual environment, simply delete the folder you initially created. Note that the process of activating a virtual environment is slightly different for Windows and Linux.
Remember that you can have as many virtual environments as you want. Don’t be afraid to create and delete environments as needed.
I usually create two for each project: one for development and one for production.
Conclusion
In summary, good coding structure and practices not only improve development efficiency, but also facilitate collaboration and long-term code maintenance. Migrating from notebooks to well-organised scripts, writing clear and concise functions, using descriptive names, and taking advantage of tools such as virtual environments are essential habits for any development team. The key is to write code that is easily understandable, reproducible, and adaptable, which will benefit you, your teammates, and your customers. Efficiency in code is ultimately efficiency in results.