I have a database with the following hierarchy.
- A dataset can have multiple scans (
foreign key scan.id_dataset -> dataset.id
) - A scan can have multiple labels (
foreign key label.id_scan -> scan.id
) - A label has a labeltype (
foreign key label.id_labeltype-> labeltype.id
) - The dataset, scan and labeltype tables all have a
name
column
I want to delete a label based on the names of the dataset, scan and labeltype. Here is my working code using the python mysql-connector:
def delete_by_labeltype(
self, dataset_name: str, scan_name, labeltype_name: str
) -> int:
sql = (
"DELETE l.* FROM label l "
"INNER JOIN labeltype lt "
"ON l.id_labeltype = lt.id "
"WHERE lt.name = %s AND l.id_scan = ("
" SELECT scan.id FROM scan "
" INNER JOIN dataset "
" ON dataset.id = scan.id_dataset "
" WHERE scan.name = %s AND dataset.name = %s"
");"
)
with self.connection.cursor() as cursor:
cursor.execute(
sql,
(
labeltype_name,
scan_name,
dataset_name,
),
)
self.connection.commit()
return cursor.rowcount
The function is within a class handling access to the label table so connection is already established (and later closed correctly). I am simply wondering if the query is the best way to do this, since I am not working with SQL too often.
1 Answer 1
nit: For consistency with scan.id
I would
choose the column name scan_id
instead of id_scan
, meh, whatever.
Kudos on correctly using bind variables,
keeping Little Bobby Tables out of your hair.
The %s
, %s
, %s
in the query don't seem
especially convenient, for matching up
against {label, scan, ds} names in a tuple
.
Prefer to use three distinct names,
and pass in a dict
.
If you do that habitually,
then when a query grows to mention half a dozen
parameters, and some newly hired maintenance engineer
adds a seventh one, you won't be sorry.
And I like the explicit COMMIT.
best way to do this?
Looks good to me. I didn't hear you complaining about "too slow!" or reliability / flakiness. You didn't say that EXPLAIN PLAN showed some index was being ignored.
Including an ERD entity-relationship diagram would have been helpful, or at least the CREATE TABLE DDL including FK definitions. Let me try to organize your prose. Here's what I heard.
- label --> labeltype
- label --> scan --> dataset
It's not obvious to me why scan
and dataset
should be in a subquery rather than at same level -- it
just arbitrarily came out that way, no biggie.
The l.id_scan = (
fragment suggests to me
that there's enough UNIQUE indexes on {scan, ds} name
to ensure we get no more than one result row from the subquery,
else you would have mentioned l.id_scan IN ( ...
.
It's perfectly fine the way it is.
Maybe JOIN label against labeltype and also against scan?
(And then scan against ds.)
If I was chicken, I would do all the JOINing in SELECT statements,
obtain an id
that I could do a sanity check on,
and then send a simple DELETE single row command.
But you're braver than me, and it works, so that's cool.
When reasoning about complex JOINs I often find it convenient
to bury some of the complexity behind a VIEW.
Minimally you would want to define a scan_dataset_v
view
so you never have to type the name of the "dataset" table again.
CREATE VIEW scan_dataset_v AS
SELECT s.*, ds.*
FROM scan s
JOIN dataset ds ON ds.id = s.id_dataset;
Now label
has just two relationships to worry about.
(And, the *
stars probably have at least
one conflicting column name -- just write them all out.
Absent the DDL, I don't know what all the columns are.)
You might possibly find it convenient to produce
a view that JOINs them both to label
,
and another that JOINs labeltype
to label
.
It just saves some typing during interactive queries.
There's no efficiency hit -- think of it as the
backend doing a macro expansion before issuing the query.
Summary: Any improvements? Nope, not really, LGTM.