Can we imagine something better than embedded SQL strings

Date [ 12 Dec 2024 ]

SQL is lingua franca of the data world. Every tool supports it or has some version of it. I’ve been exposed to Postres, Trino, Spark-SQL, Flink-SQL, MySql, MS-SQL and they all work the same way for basic things. Of course the real challenge is when things get more complicated, certain functions have different names, or don’t exist, and some query patterns that work well in one engine don’t in another. (And I don’t just mean between OLTP and OLAP engines) While these are minor annoyances, it is pretty amazing to have a declarative language that is mostly familiar across tools and environments so that everyone from a product manager to a backend engineer can query terabytes of data in a few seconds.

Sadly what isn’t amazing is all the extra tooling that has been built on top of this in the data engineering world. Data Engineering can’t just rely on SQL, we need to orchestrate queries and parameterize them in various different ways. This is often done in Python (e.g. Airflow, Dagster, Prefect) using some degree of templating, whether that is using jinja2, f-strings, or .format(). Doing this has drastic negative effects on developer experience becuase we loose even basic things like syntax highlighting, or more advanded features of static analysis.

Some IDEs have extensions that detect embedded SQL and apply better linting rules. (Shout out to Pycharm which does this and lets you select a wide range of SQL dialects.) But if you don’t have that you’re stuck looking at a very long string and trying to understand why, when it’s all stitched together, where the syntax error is.

SQL itself is not a language that is easily modularized. It is difficult to create chunks or “functions” as part of the query. I’ve seen SQL embedding in python as f-strings where whole parts of the query are generated by calling functions in the string parameters (those functions themselves taking parameters from the main python code). It’s impossible to read but it’s done because the original author was struggling to make something that was composible.

There are other APIs, ORMs wrap up SQL in more native looking code and some tools like Spark have a data frame API which provides python native access. I’ve found getting out of SQL and using the data frame API quite a productivity boost but…

Because SQL is lingua franca it is difficult to on board users onto other ways of developing. Any open source Data or BI tool will accept SQL but ORM wrappers or data frame APIs aren’t supported. I try to work as much in Spark as possible and use the Dataframe API as much as I can due to it’s composibility, but it is hard to iterate quickly when Trino is just faster at some things and the only way to use it is via SQL through the company sanctioned web app BI tool.

Then again, maybe this doesn’t matter anymore: I can ask github copilot to take some SQL and convert it into pyspark Dataframe API, and it does a pretty good job of it