How I write SQL code

Marton Trencseni - Fri 24 January 2020 - Data

Introduction

This is a simple post about SQL code formatting. Most of this comes from my time as a Data Engineer at Facebook.

I’ve always cared a lot about writing readable code. Readability has several components:

directory layout
breaking code into files (eg. what to put in headers, how many)
code layed out in files (eg. ordering of includes, templates, classes, functions)
naming of files, classes, functions and variables
indentation, line width
comments in code files
the modular structure of the code itself (eg. design patterns)
high-level documentation that explains design choices

Readable code should be relatively easy to explore and read. I say "relatively", because most programs have a lot of innate complexity that you have to wrap your head around, even if it’s well-written.

A famous example of high quality and readable (C++) code is the Quake3 and Doom3 code by John Carmack. When I was working on ScalienDB many years ago, I tried to write C++ code in a similar style. A good C++ book on the above points is Large Scale Software Design by John Lakos.

Writing good SQL code is much simpler than writing good C++ code. SQL code has no templates, classes, and in my experience analytical "data mining" SQL code also has no functions (=stored procedures). It’s really just a lot of standalone SELECTs, sometimes chained with WITH, or by dropping results into tables and SELECTing out of those tables in the next step. Having said that, I still see a lot of data people writing hard to read SQL code.

Without further ado, my rules for writing SQL code follow below.

Use all upper-case for keywords like `SELECT`

Also use upper-case for built-in types and functions like BIGINT and COUNT(). Use lower-case for column names and table names.

Keep `SELECT .. FROM .. WHERE .. GROUP BY` unindented

But indent the rest. An example so far:

SELECT
customer_name,
customer_street
FROM
company_metrics
WHERE
customer_name='Marton Trencseni'

Line up `WHERE` conditions

God is merciful because AND⎵ is 4 characters, a good tab width, so WHERE conditions are to be lined up like (same for JOIN conditions):

SELECT
...
FROM
deliveries
WHERE
country='UAE'
ANDday>=DATE('2019年07月01日')
ANDDAY_OF_WEEK(day)!=5
ANDscheduled_accuracy_meters<=10*1000

Write `GROUP BY 1, 2`

Always put the grouped columns first in the column list, and write GROUP BY 1, 2 .. N:

SELECT
region_fleet,
CASEWHENstatus='Delivered'THEN'Delivered'ELSE'Not Delivered'ENDASstatus,
DATE_TRUNC('week',day)ASweek,
COUNT(DISTINCTROW(day,so_number))ASnum_orders,
COUNT(DISTINCTCASEWHENscheduled_accuracy_meters<=500THENROW(day,so_number)ELSENULLEND)ASnum_accurate,
AVG(scheduled_accuracy_meters)ASscheduled_accuracy_meters
FROM
deliveries
WHERE
...
GROUPBY
1,2,3

Indent `WHEN` relative to `CASE`

SELECT
...,
CASE
WHENschedule_channelIN('Ticker')THEN'Ticker'
WHENschedule_channelIN('CallCenter','CSA')THEN'Callcenter'
WHENschedule_channelIN('CallCenterBlind','CSA-BD')THEN'Blind'
WHENschedule_channelIN('AutoReschedule','AI-AutoReschedul','Auto Schedule','DriverReschedule')THEN'Rest'
WHENschedule_channelIN('BulkSchedule')THEN'Bulk'
WHENschedule_channelIN('CallCenterSelfPickupPoint','CallCenterSelfPickupWarehouse')THEN'Self-pickup'
WHENschedule_channelIN('AI-AutoSchedulin','AI_Rango','AI-Rango')THEN'AI'
WHENschedule_channelIN('MWeb','MWEB','mobile-ios','mobile-android')THEN'Mweb/apps (various)'
WHENschedule_channelIN('Whatsapp','Callcenter','Blind','Driver (various)','Bulk','AI')THENschedule_channel
ELSE'Rest'
ENDASschedule_channel
FROM
deliveries

Use `WITH` instead of nested sub-`SELECT`s

Sub-SELECTs with indenting are hard to read. Instead, create aliases with WITH, and chain them. Put the WITH on a separate line, and then write the aliases. If I have no better idea, I call the aliases step1, step2 ...

For example, suppose table deliveries has columns scheduled_coordinates and actual_coordinates as lat, lon string, and you want to compute the meter distance with the Haversine formula:

WITH
step1AS
(
SELECT
*,
TRIM(SPLIT_PART(actual_coordinates,' ',1))ASact_lat_str,
TRIM(SPLIT_PART(actual_coordinates,' ',2))ASact_lon_str,
TRIM(SPLIT_PART(scheduled_coordinates,' ',1))ASsch_lat_str,
TRIM(SPLIT_PART(scheduled_coordinates,' ',2))ASsch_lon_str
FROM
deliveries
),
step2AS
(
SELECT
*,
CASE
WHENREGEXP_LIKE(act_lat_str,'^[0-9]+\.[0-9]+$')THENCAST(act_lat_strASDOUBLE)
ELSENULL
ENDASactual_lat,
CASE
WHENREGEXP_LIKE(act_lon_str,'^[0-9]+\.[0-9]+$')THENCAST(act_lon_strASDOUBLE)
ELSENULL
ENDASactual_lon,
CASE
WHENREGEXP_LIKE(sch_lat_str,'^[0-9]+\.[0-9]+$')THENCAST(sch_lat_strASDOUBLE)
ELSENULL
ENDASscheduled_lat,
CASE
WHENREGEXP_LIKE(sch_lon_str,'^[0-9]+\.[0-9]+$')THENCAST(sch_lon_strASDOUBLE)
ELSENULL
ENDASscheduled_lon
FROM
step1
),
step3AS
(
SELECT
*,
6371*1000*2*ASIN(SQRT(
POW(
SIN(RADIANS(scheduled_lat-actual_lat)/2),
2
)
+
COS(RADIANS(actual_lat))
*COS(RADIANS(scheduled_lat))
*POW(
SIN(RADIANS(scheduled_lon-actual_lon)/2),
2
)
))ASscheduled_accuracy_meters
FROM
step2
)

Long lines are okay

As you can see, long lines are okay in my opinion. We all have widescreen aspect ratio screens (Macbooks), let's use that space. Shorter lines are more readable, but breaking everything into shorter lines, re-breaking when making changes, it's too much work in my opinion, especially when dealing with lenghty "business logic" in SQL.

Break the rules for readability and flow of code

Rules are made to be broken. If doing it in another way leads to better readability, break the rules.

For example, sometimes we SELECT out a horizontal/vertical part of a table in a quick sub-SELECT to help the query optimizer. In cases like this I don't use WITH and keep it in one line, like:

SELECT
...
FROM
deliveriesdls
LEFTJOIN
(SELECTid,nameFROMdriversWHEREcountry='UAE')drv
ON
dls.driver_id=drv.id

Follow the same rules when writing interactive SQL code

I follow these same rules when writing one-off queries in an interactive console. That way it’s easier to stick to them when writing long-lived code that goes into ETL and the repo. Also, one-off code oftens ends up being ETL’d.

Alternative ways to write SQL

Finally, some other, more comprehensive guides:

data programming sql

Introduction

Use all upper-case for keywords like SELECT

Keep SELECT .. FROM .. WHERE .. GROUP BY unindented

Line up WHERE conditions

Write GROUP BY 1, 2

Indent WHEN relative to CASE

Use WITH instead of nested sub-SELECTs