Motivation:
I am building a webserver that is essentially a dashboard app. Users make requests and get analytics data and visualizations back. Multiple users can be in the same "workspace," where they use the same underlying datasource to power dashboards.
In this sense, N users would make requests that use the same instance of an underlying database component, and a different M users might use another instance of that same db component.
Typically, webservers serve individual requests from scratch, but in this case, it seems sensible to have a collection of initialized db instances per-workspace, such that a request from any user in the same workspace would always get that instance (creating db components from scratch for any given request seems heavyweight when I could manage a collection of already-active pools, one pool per workspace). My question is, what kinds of patterns exist for this in webserver backends?
In an attempt to answer my own question, what I've (potentially) landed on is:
# app.py (pseudocode)
app = FastAPI()
@app.post('/{workspace_id}/dasbhoard-plz')
def fetch_dashboard(...):
# auth the user, etc. then,
data = data_service.get_data(workspace_id, specific_data_request)
return dashboardify(data)
# data_service.py
def get_data(workspace_id, specific_data_request):
sql = turn_into_sql(specific_data_request)
workspace_datasource = _get_or_init_workspace_datasource(workspace_id)
data = workspace_datasource.query(sql)
return data
_cached_workspace_datasources: dict[WorkspaceId, Datasource] = {}
def _get_or_init_workspace_datasource(workspace_id):
with lock():
if workspace_id not in _cached_workspace_datasources:
ds = Datasource.from_workspace_id(workspace_id)
ds.init()
_cached_workspace_datasources[workspace_id] = ds
return _cached_workspace_datasources[workspace_id]
class Datasource:
def __init__(self, db_url, **engine_kwargs):
self._engine = create_engine(db_url, **engine_kwargs)
@classmethod
def from_workspace_id(workspace_id):
# get db_url via workspace_id
cls(db_url)
def query(self, sql):
# ...
Essentially, I'm applying a lazy-load/GetOrAdd-style lookup to a wrapper around workspace-level database engines, because requests from multiple users (in the same workspace) shouldn't have to recreate such a thing. The lock is applicable in production setups where you might have multiple threads under the hood, and the workspace_id in dict
is not jointly atomic with the dict assignment. I do think this all works, at the cost of process-level duplication of elements in _cached_workspace_datasources
in the event prod config is using multiple procs and requests get routed (in)appropriately. However...
Again, my question is, is this kosher? Is there a standard way to do this kind of thing, where some (typically heavyweight) state can be shared across a relevant (subset) of requests?
-
Frame challenge: is the database connection really so heavyweight? Can it properly handle requests from multiple threads, also when a second request comes in while you are working on a database transaction of another user?Bart van Ingen Schenau– Bart van Ingen Schenau04/30/2024 06:31:36Commented Apr 30, 2024 at 6:31
-
An excellent challenge. A connection itself is not horribly heavyweight, no; it would more precise for me to discuss workspace-level connection pools, which I would argue are heavyweight (and in this context I know to be thread-safe). This said, your point is one worth considering for the general case. If I missed your point entirely, i.e. why bother with maintaining all of these pools at all, well, it certainly seems like I'll end up reimplementing pooling in a roundabout way, but perhaps I'm mistaken.StarFox– StarFox04/30/2024 14:22:59Commented Apr 30, 2024 at 14:22