Database-abstraction APIs should not exist

Database-abstraction APIs where you write a query using the host programming language should not exist. Or more precisely, should not have to exist. For example, in Django you can query the database using the following Python code:

Entry.objects.filter(is_draft=True)

Which Django translates (roughly) into the following SQL:

SELECT *
  FROM blog_entry
  WHERE is_draft = true;

But why we cannot write SQL query directly as an SQL query, while retaining all other features Django offers through its database-abstraction API (database agnostic code, inputs to queries and outputs from queries being Python objects, etc.)? I claim there is no reason anymore for that.

Babel

I have confidence in this claim because of the success of Babel JavaScript transpiler and the ecosystem around it. Babel can take a newer version of JavaScript as input and converts it so that it can run in older JavaScript runtimes. It maps new features to old ones. It allows one to use syntax of new or even just proposed features of the JavaScript language. It allows experimentation around the language.

The approach it takes for this is that Babel first parses the input code into abstract syntax tree (AST), an internal representation of the code, then converts the tree so that it represents the code which can run in older JavaScript runtimes, and then renders AST back to textual representation of the code.

Babel is successful because it has an ecosystem of plugins that make it even more powerful, allowing integration with various other libraries and programming language patterns, so that you can tailor programing in JavaScript to your needs and preferences.

Why we have database-abstraction APIs?

Such high-level database-abstraction APIs exist in almost all programming languages and are often paired with object-relation mapping (ORM) capabilities to present data from the database as objects in the host language.

What features do we currently get (or are at least promised) from these database-abstraction APIs?1

  1. Database engine independence: the Django query above can be converted to SQL dialects for many popular database engines without having to change Python code.
  2. Queries are part of the rest of the code, which allows tight integration with the host language. For example, type checking of the code can type check queries as well, which is not possible with opaque SQL queries represented as strings in the host language. Moreover, other types of query validation can be done before the query hits the database engine.
  3. It is easier to organize and compose such code. E.g., often there is support for various hooks to register code to be called before and after queries. Such features allow better modularity of code based on concerns (a module for logging all queries, a module for checking permissions, etc.). Such hooks are also written in the host language itself and not some database engine specific language.
  4. Support for custom data types which are transparently converted to and from the underlying data types supported by a database engine.
  5. Can have a more user friendly and familiar syntax than SQL, especially if you do not know SQL.

But in practice there are many limits:

  1. Database engine independence means that one can use only common/shared features between different database engines and if one wants to use a particular special feature, they have to use a “raw query as string” escape hatch. Which of course makes the code depend on a particular database engine and you loose many other features (like query validation/type checking). Database-abstraction API can add high-level support for such special feature and map it to different database engines, but wouldn’t it be better if you could just write the raw SQL query without loosing other features?
  2. Because queries can be very different and inputs and outputs to queries even more so, high-level database-abstraction API functions are generally very broadly typed, allowing inputs with types which do not even match the query, failing only at runtime (e.g., *args, **kwargs in Python).
  3. Hooks in the host language are nice, but because they are run outside of the database engine an issue of recovering from failures arises. What if a database transaction completes but a hook fails? What if a database transaction fails but a hook succeeds? In general synchronizing between such external hooks and a database engine can be tricky.2

Parsing SQL into AST

If you look at the Python code example above, it uses Python to generate AST and then renders this AST to SQL. But why not use SQL itself to generate this AST? And then use plugins to transform AST to implement various features we want?

First, it is important to note that there are two types of SQL statements: data definition statements and queries. Parsing data definitions statements (e.g., CREATE TABLE) allows us to extract typing information about data. We can use this typing information to type check queries as well. Moreover, we can use that information to know how to convert query results to host language objects. Depending on the integration with the host language, type checking and validation of both queries and argument values might be possible at the compile time, or at least at runtime before even hitting the database engine.

Examples

Having plugins which can operate on AST allows developer to write simple queries which are then transformed further. This achieves decoupling of concerns where original query deals with what data is needed at that point in the program while how is that data obtained, queried, structured in the database, or if query is even allowed, can all be delegated to plugins.

Permission checks

If original simple query looks like:

SELECT *
  FROM blog_entry
  WHERE is_draft = true;

But only the owner of a blog entry should be able to see it while it is draft. A plugin for row-level permission checks could transform this query automatically into:

SELECT *
  FROM blog_entry
  WHERE is_draft = true
    AND owner_id = current_user_id();

Furthermore, the plugin could change definition of blog_entry table to include necessary owner_id column.

Doing such checks manually is tedious and error prone, especially when you have nested queries or joins and you have to make sure you add check to everywhere.

We can expand this example further if we introduce groups of users and allow multiple users to own a blog entry:

SELECT DISTINCT id, title, body, pub_date, is_draft, internal_comment
  FROM blog_entry
    LEFT JOIN blog_entry_owner_users
      ON blog_entry.id = blog_entry_owner_users.entry_id
    LEFT JOIN blog_entry_owner_groups
      ON blog_entry.id = blog_entry_owner_groups.entry_id
    LEFT JOIN groups_to_users
      ON blog_entry_owner_groups.group_id = groups_to_users.group_id
  WHERE is_draft = true
    AND (
      blog_entry_owner_users.user_id = current_user_id()
      OR groups_to_users.user_id = current_user_id()
    );

The plugin has information about blog_entry table and can enumerate all columns when necessary. Here it makes sure that returned columns match the original query.

Enjoy writing this again and again. This is why people prefer doing permission checks in application code.

Plugin could also support permissions on columns. For example, internal_comment field should be visible only to blog entry owners.

SELECT *
  FROM blog_entry 
  WHERE id = 42;

Could get transformed into:

SELECT DISTINCT id, title, body, pub_date, is_draft,
    CASE blog_entry_owner_users.user_id = current_user_id()
        OR groups_to_users.user_id = current_user_id()
      WHEN true THEN internal_comment
    END AS internal_comment
  FROM blog_entry
    LEFT JOIN blog_entry_owner_users
      ON blog_entry.id = blog_entry_owner_users.entry_id
    LEFT JOIN blog_entry_owner_groups
      ON blog_entry.id = blog_entry_owner_groups.entry_id
    LEFT JOIN groups_to_users
      ON blog_entry_owner_groups.group_id = groups_to_users.group_id
  WHERE id = 42
    AND (
      is_draft = false
      OR (
        is_draft = true
        AND (
          blog_entry_owner_users.user_id = current_user_id()
          OR groups_to_users.user_id = current_user_id()
        )
      )
    );

Keeping historical versions

If you want to keep all historical versions of all data, you can setup a database view which modifies UPDATE queries into INSERT queries and for SELECT queries only returns the latest version. But managing views and their migration can add additional complexity, so we can instead use a plugin which transforms queries in this way before they hit the database engine.

The plugin could automatically add version field to all tables and then a query like:

UPDATE blog_entry
  SET title = "Hello world"
  WHERE id = 42;

Gets transformed into:

INSERT INTO blog_entry
  SELECT b1.id, b1.version + 1, "Hello world",
      b1.body, b1.pub_date, b1.is_draft, b1.internal_comment
    FROM blog_entry b1
    WHERE b1.id = 42
      AND b1.version = (
        SELECT MAX(version) FROM blog_entry b2 WHERE b2.id = b1.id
      );

The query:

SELECT *
  FROM blog_entry 
  WHERE id = 42;

Gets transformed into:

SELECT *
  FROM blog_entry b1
  WHERE b1.id = 42
    AND b1.version = (
      SELECT MAX(version) FROM blog_entry b2 WHERE b2.id = b1.id
    );

Such plugin can also help with database engines which do not support views with INSTEAD OF triggers.

Do note that storing all history in the same table can be inefficient if only the latest version is almost always queried.

More ideas

A wide range of plugins could be made:

  • SQL is case insensitive. A plugin could automatically quote all identifiers to make queries case sensitive instead.
  • Generating query documentation from query placeholders and their expected types.
  • Reordering table fields to optimize the table.
  • Pretty printing of queries.
  • Static analyzer. Instead of rewriting the query, just validate all queries for some properties (e.g., that they do correct permission checks, or type check them). Or perform some other linting checks.

Queries can be made from the client side (e.g., browser) or on the server side as prepared queries. The latter could even be exposed as JSON RPC endpoints automatically.

Such approach (and general enough AST) could work also for parsing various SQL dialects into the same AST. But we do not even have to use SQL to get this AST. Other query languages like MongoDB queries and GraphQL could be parsed into AST as well?3 Or SQL-like query language PartiQL, too. One could even see REST as just a simple query language you can map to AST and then have a chain of plugins to add permission checks and everything else before querying the database, getting quickly a REST server implementation. JSON Schema could also be used to describe types. Many JSON query languages could be mapped to this AST as well.

Moreover, this AST does not have to be converted to just SQL at the end, but could in fact be converted into any query language for other database engines, or even just to query static files like JSON or CSV. You could run same query on the client side against an in-memory state and on the server side against an SQL-based database engine.

In a way all those technologies exist just to describe data structure and queries in different ways, but underlying principles are the same. Why then have those technologies so separated? Can we simply translate between them automatically as needed?


  1. Have I missed any major feature? Leave a comment if there is an important feature missing. [return]
  2. These days many database engines support triggers and writing them in a wide range of languages, e.g., PL/Python, PLV8. [return]
  3. I would even claim that their popularity is because they do not require parsing and are already in JSON-compatible representation. [return]
If you have any relevant link, project, idea, comment, feedback, critique,
language fix, or any other information, please share it bellow. Thanks.
Subscribe
Recent Tweets @mitar_m