Database-abstraction APIs where you write a query using the host programming language should not exist. Or more precisely, should not have to exist. For example, in Django you can query the database using the following Python code:
Entry.objects.filter(is_draft=True)
Which Django translates (roughly) into the following SQL:
SELECT *
FROM blog_entry
WHERE is_draft = true;
But why we cannot write SQL query directly as an SQL query, while retaining all other features Django offers through its database-abstraction API (database agnostic code, inputs to queries and outputs from queries being Python objects, etc.)? I claim there is no reason anymore for that.
I have confidence in this claim because of the success of Babel JavaScript transpiler and the ecosystem around it. Babel can take a newer version of JavaScript as input and converts it so that it can run in older JavaScript runtimes. It maps new features to old ones. It allows one to use syntax of new or even just proposed features of the JavaScript language. It allows experimentation around the language.
The approach it takes for this is that Babel first parses the input code into abstract syntax tree (AST), an internal representation of the code, then converts the tree so that it represents the code which can run in older JavaScript runtimes, and then renders AST back to textual representation of the code.
Babel is successful because it has an ecosystem of plugins that make it even more powerful, allowing integration with various other libraries and programming language patterns, so that you can tailor programing in JavaScript to your needs and preferences.
Such high-level database-abstraction APIs exist in almost all programming languages and are often paired with object-relation mapping (ORM) capabilities to present data from the database as objects in the host language.
What features do we currently get (or are at least promised) from these database-abstraction APIs?1
But in practice there are many limits:
*args, **kwargs
in Python).If you look at the Python code example above, it uses Python to generate AST and then renders this AST to SQL. But why not use SQL itself to generate this AST? And then use plugins to transform AST to implement various features we want?
First, it is important to note that there are two types of SQL
statements: data definition statements
and queries. Parsing data definitions statements (e.g., CREATE TABLE
)
allows us to extract typing information about data. We can use this typing information
to type check queries as well. Moreover, we can use that information
to know how to convert query results to host language objects.
Depending on the integration with the host language, type checking and validation
of both queries and argument values might be possible at the compile time,
or at least at runtime before even hitting the database engine.
Having plugins which can operate on AST allows developer to write simple queries which are then transformed further. This achieves decoupling of concerns where original query deals with what data is needed at that point in the program while how is that data obtained, queried, structured in the database, or if query is even allowed, can all be delegated to plugins.
If original simple query looks like:
SELECT *
FROM blog_entry
WHERE is_draft = true;
But only the owner of a blog entry should be able to see it while it is draft. A plugin for row-level permission checks could transform this query automatically into:
SELECT *
FROM blog_entry
WHERE is_draft = true
AND owner_id = current_user_id();
Furthermore, the plugin could change definition of blog_entry
table
to include necessary owner_id
column.
Doing such checks manually is tedious and error prone, especially when you have nested queries or joins and you have to make sure you add check to everywhere.
We can expand this example further if we introduce groups of users and allow multiple users to own a blog entry:
SELECT DISTINCT id, title, body, pub_date, is_draft, internal_comment
FROM blog_entry
LEFT JOIN blog_entry_owner_users
ON blog_entry.id = blog_entry_owner_users.entry_id
LEFT JOIN blog_entry_owner_groups
ON blog_entry.id = blog_entry_owner_groups.entry_id
LEFT JOIN groups_to_users
ON blog_entry_owner_groups.group_id = groups_to_users.group_id
WHERE is_draft = true
AND (
blog_entry_owner_users.user_id = current_user_id()
OR groups_to_users.user_id = current_user_id()
);
The plugin has information about blog_entry
table and can enumerate all columns
when necessary. Here it makes sure that returned columns match the
original query.
Enjoy writing this again and again. This is why people prefer doing permission checks in application code.
Plugin could also support permissions on columns. For example,
internal_comment
field should be visible only to blog entry owners.
SELECT *
FROM blog_entry
WHERE id = 42;
Could get transformed into:
SELECT DISTINCT id, title, body, pub_date, is_draft,
CASE blog_entry_owner_users.user_id = current_user_id()
OR groups_to_users.user_id = current_user_id()
WHEN true THEN internal_comment
END AS internal_comment
FROM blog_entry
LEFT JOIN blog_entry_owner_users
ON blog_entry.id = blog_entry_owner_users.entry_id
LEFT JOIN blog_entry_owner_groups
ON blog_entry.id = blog_entry_owner_groups.entry_id
LEFT JOIN groups_to_users
ON blog_entry_owner_groups.group_id = groups_to_users.group_id
WHERE id = 42
AND (
is_draft = false
OR (
is_draft = true
AND (
blog_entry_owner_users.user_id = current_user_id()
OR groups_to_users.user_id = current_user_id()
)
)
);
If you want to keep all historical versions of all data, you can
setup a database view which modifies UPDATE
queries into INSERT
queries and for SELECT
queries only returns the latest version.
But managing views and their migration can add additional complexity,
so we can instead use a plugin which transforms queries in this
way before they hit the database engine.
The plugin could automatically add version
field to all tables and
then a query like:
UPDATE blog_entry
SET title = "Hello world"
WHERE id = 42;
Gets transformed into:
INSERT INTO blog_entry
SELECT b1.id, b1.version + 1, "Hello world",
b1.body, b1.pub_date, b1.is_draft, b1.internal_comment
FROM blog_entry b1
WHERE b1.id = 42
AND b1.version = (
SELECT MAX(version) FROM blog_entry b2 WHERE b2.id = b1.id
);
The query:
SELECT *
FROM blog_entry
WHERE id = 42;
Gets transformed into:
SELECT *
FROM blog_entry b1
WHERE b1.id = 42
AND b1.version = (
SELECT MAX(version) FROM blog_entry b2 WHERE b2.id = b1.id
);
Such plugin can also help with database engines which do not support
views with INSTEAD OF
triggers.
Do note that storing all history in the same table can be inefficient if only the latest version is almost always queried.
A wide range of plugins could be made:
Queries can be made from the client side (e.g., browser) or on the server side as prepared queries. The latter could even be exposed as JSON RPC endpoints automatically.
Such approach (and general enough AST) could work also for parsing various SQL dialects into the same AST. But we do not even have to use SQL to get this AST. Other query languages like MongoDB queries and GraphQL could be parsed into AST as well?3 Or SQL-like query language PartiQL, too. One could even see REST as just a simple query language you can map to AST and then have a chain of plugins to add permission checks and everything else before querying the database, getting quickly a REST server implementation. JSON Schema could also be used to describe types. Many JSON query languages could be mapped to this AST as well.
Moreover, this AST does not have to be converted to just SQL at the end, but could in fact be converted into any query language for other database engines, or even just to query static files like JSON or CSV. You could run same query on the client side against an in-memory state and on the server side against an SQL-based database engine.
In a way all those technologies exist just to describe data structure and queries in different ways, but underlying principles are the same. Why then have those technologies so separated? Can we simply translate between them automatically as needed?