How can I effectively organize my SQL query code to further reuse? In their everyday life, many people use SQL to construct queries that they are going to reuse at some point. But how do these queries get organized? How can one categorize and document them in order to allow easier use in the future?
Personally, I have experienced many times desperation when staring at a query that I didn’t remember how it works or why I wrote it or even worse looking for an older query version that was lost forever. For this, I have a deep appreciation for any tool or practice that can help me avoid these feelings.
At this post, we are going to explore many different practices that one can apply – or is used to apply. Some of them are quite old-fashioned while others take advantage of some very popular software products or services.
This is part #4 of our helpful SQL short guides:
- Table and Data Partitioning
- SQL type casting
- SQL pattern matching
- Organizing SQL Queries (this part)
- Handling Datatypes for IP addresses in SQL
- SQL DATETIME explained
- SQL Window Functions
- A guide for Advanced Grouping
The old way
Saving SQL queries that are not used very often in local text files, organized in folders and subfolders, definitely does not belong to the past. Instead, it is a method extensively used in large companies nowadays when it comes to code organization. Nevertheless, the lack of documentation, the inability to cooperate with others and the absence of versioning are the biggest challenges one may face while using the above method. Furthermore, the same queries are written over and over again multiple times from many different people, wasting much time and effort.
An improved version of this approach involves departmental wikis where the content can be collaboratively developed and the queries can be also documented easily. Yet the need for versioning is not satisfied and that’s quite a downside since accidentally lost versions cannot be retrieved.
At that era, it seems that what every professional who writes SQL needs is a set of tools and software application that will allow him to minimize the duplicative work and eliminate the above-mentioned pitfalls. At this post, we are going to have a look at the most important tools that fall under this category.
The (not so) new way
Developers, Data Scientists, BI analysts: all of them use SQL on different platforms and all of them want to maintain their code well-organized but not all of them are keen on the same software applications.
If you ask a developer or a technical-oriented analyst how would he organize his SQL code he would probably tell you that for him code lives in Git. Git has been around for a decade or more and has successfully been used in a large number of corporate projects for source code management. The reasons why it became so famous include:
- Distributed model: You can share with others only what is necessary. You can also continue working offline.
- Branching & Merging: Each member of the project can continue working on his own branch without messing with others’ work. At the end, the work is combined.
- History: Easy recovery of previous versions of your code
- Cost: It’s free!
On the dark side, Git definitely requires some technical knowledge. The available commands and their corresponding options are many and not that intuitive at some point, making the learning curve quite steep. This may give nightmares to some semi-technical professionals who just need a consistent way of organizing their SQL code.
The data scientists who come from computer science will probably be familiar with Apache Spark and Zeppelin, a web-based notebook that enables data-driven, interactive analytics and collaborative documents with SQL, Scala and more. This analytics application offers a pluggable storage mechanism with multiple implementations, one of which is a local Git repository. This means that when using Zeppelin one can enjoy all the benefits that Git encompasses while performing analytics.
For the advocates of python, a commonly used application is Jupyter Notebook. Jupyter Notebook is a server-client application that allows editing and running of python code via a web browser combining python code, SQL, equations, text, and visualizations. It also offers syncing with GitHub repositories.
More specifically, Jupyter Notebook will be rendered by GitHub directly on your repo page. This means that one can enjoy all the benefits that Git offers regarding version control, branching, merging and collaborative development when using Jupyter Notebook.
Visual Studio developed by Microsoft is a widely used IDE, used to develop computer programs as well as websites, web apps or mobile apps. Visual Studio can also be used as a powerful development environment for SQL Server and Azure SQL Database. When it comes to query organization, from within Visual Studio one can use Git for distributed version control. Git commands are offered as built-in so that one can review diffs, stage files, make commits and revert changes to a GitHub repository.
Since data science is by definition the intersection of computer science, statistics, and business, some of those that have a non-technical or semi-technical background and work as analysts may not want to get into too many technicalities. For example, the majority of Business Intelligence Analysts would prefer to have a BI tool that would also take care of this. Fortunately, the majority of the well-known BI tools offers relevant convenient features. Although collaborative development and versioning is not their main concern, in some cases they offer both as supplementary features to the main product.
If one wants to categorize the existing business intelligence tools, the two most prominent categories would be those which are more suitable for marketers and business people in general and those which are often chosen by data scientists who prefer writing SQL.
In first category fall tools such as Tableau, Looker, and Chartio. In all these tools visualizations can be created using visual composer instead of plain SQL. These visualizations can then be organized into dashboards for easier reference. Apart from this, Tableau since version 9.3 supports versioning. According to the documentation, this means that for every workbook you can start tracking its publishing history. Overwriting a workbook will create a copy of the previous version able to be recalled later if necessary. Previous versions are stored and made available for download via the web interface.
Similarly, when using Looker you can connect each project to a separate repository on GitHub and push or pull any changes very easily from within the Looker’s interface. Furthermore, with you can still access any previous versions of your code from the history.
Regarding Chartio, although it does not support version control in the traditional sense, it maintains a log file with all previous query run. From there, with some effort, its users can retrieve queries’ previous versions.
The advocated of SQL would probably prefer a BI tool of the second category. Such examples are Mode and Periscope. In both of them, executed queries can be saved as visualizations, parts of a dashboard or as standalone entities. In addition, version control is also offered. Mode supports storage under version control in one centralized place while in Periscope one can revert to the previous state of the code, if needed via the history tab. From the same tab, one can also monitor what changes have been made, who made each one of them and at what time.
Overall, it seems that we have recently moved past the point when SQL code was saved into text files. Version Control Systems along with other collaborative tools seem to be the key to efficient and effortless SQL code organization. In addition, there’s also a new generation of business intelligence tools build on top of the idea of version control, that leverages the SQL power in order to build reports and visualizations easier than ever before.