Unlike other inventory management systems that their items are fungible, in this system they are not supposed to. Each item is serialized and tracked since the printing press. We are trying to move away from manually keep track of it. Due the volume (more than 4 million documents per year) and intrinsic characteristics of the serialization (incrementing integers) we made the following model:
class Documents(models.Model):
document_type = models.IntegerField(choices=DocumentsStatusHistory.DocumentTypes.choices)
series = models.PositiveIntegerField()
serial_range = IntegerRangeField()
located_at = ForeignKey(Warehouse)
where the document type describes an arbitrary and predefined type, series change to indicate the printing cycle and serial_range is the "compressed" serialized number. Due compliance requirement that full tracking records of the movement of that inventory per individual documents, we adopted the system that was already in use and that seems most familiar with our would be users. Since it is not expected that the entire inventory moves in bulk, but more of a subset of the entire bulk in each of the specific warehouses for either use or distribution, we need to slice and splice the serial_range value, which would "pop out" that range. This is what we came with:
document_record_to_splice = Documents.objects.get(id=1)
# For this example, I'm presuming that id=1 has the correct series, document type
# and the range is contained by the object being sliced/spliced.
if document_record_to_splice.serial_range == new_serial_range:
document_record_to_splice
return document_record_to_splice
# We return early if the range is equal to the entire record
# (ie. new_range == existing_range).
if Documents.objects.filter(serial_range__overlaps=new_serial_range).count() > 1:
raise Exception
# We also raise a exception if the range is between two separated records.
# Here is where everything starts
orig_bin = document_record_to_splice
# We create a copy of all of the fields of the original
o_serial_range = orig_bin.serial_range
document_type = orig_bin.document_type
series = orig_bin.series
located_at = orig_bin.located_at
# We create our new range objects that are not part of the intersection of the ranges.
# Something to improve here is that we presume that the intersection doesn't match any
# of the bounds of the existing object.
from psycopg.types.range import Range
lower_serial_range = Range(o_serial_range.lower, serial_range.upper - 1)
upper_serial_range = Range(serial_range.upper + 1, o_serial_range.upper)
# We delete the original object. Since we need to make sure that all records
# have a continuous set of adjacent ranges, for any document type or series,
# and save those that are not part of the transferred range
orig_bin.delete()
for n_serial_range in [lower_serial_range, upper_serial_range]:
orig_bin.pk = None
orig_bin.serial_range = n_serial_range
orig_bin.full_clean()
orig_bin.save()
# We create our new record, with a copy of all the information that we cached
# before about it.
new_bin = Documents(
document_type=document_type,
series=series,
serial_range=serial_range,
document_status=o_document_status,
)
# The record of the transaction was saved on the function above.
This is how we would do it if we had to do it with raw SQL without much of django ORM but the .raw() helper:
with old_documents as (
delete from manager_documents
where document_type = %(document_type)s
and series = %(series)s
and serial_range @> '[%(new_range)s]'::numrange
returning document_type, document_status, created_at, series, multirange(serial_range) - multirange('[%(new_range)s]'::numrange) as kept_ranges, parent_transaction_id, located_at_id
)
insert into manager_documents (document_type, document_status, created_at, series, serial_range, parent_transaction_id, located_at_id)
select document_type, document_status, created_at, series, unnest(kept_ranges), parent_transaction_id, located_at_id
from old_documents;
The last sql would replace all the dancing around trying to copy the existing object.
Now my concerns are:
- Is the SQL unavoidable if I don't want to have a logical complexity in python?
- Are there any concerns with race conditions under any of the solutions? (technically, any code calling the splicing function should be bound in transaction.atomic())
- Am I missing a standard Django ORM API that would avoid writing raw SQL?
1 Answer 1
complex DELETE
I confess I don't quite understand the motivation behind that SQL statement. It seems to be about finding set difference of kept_ranges and then un-nesting?
I predict that when you recruit additional engineers
to this project, they will be reluctant to maintain that statement,
as it looks "dangerous", looks like it would be easy
to accidentally discard important rows in the course
of testing it.
Consider adding an is_old
column to the base relation,
or perhaps to a VIEW that ties a two-column table to the base relation.
We update it to show which rows are "old",
INSERT corresponding rows, and DELETE the original rows.
Notice that an interactive tester can perform just
the first step and verify that it seems to be going as expected.
automated tests
The OP would be more easily understood if it were accompanied by a small test suite. Tests have documentation value.
Including the output of SHOW CREATE TABLE ... would also make it easier to understand the OP code.
diagnostic error message
This is terrible:
raise Exception
Maybe it "never" happens? We need a better description of what went south, so a maintenance engineer has some hope of diagnosing root cause and fixing the bug.
informative identifier names
for n_serial_range in ... :
I think that means for new_serial_range
?
Saving two ew
characters hardly seems worth it.
If you want a shorter identifier, elide the word _serial
.
simplicity
I don't know how big your ranges are, nor what memory constraints you work within. But consider finding set difference with something like
range2 = range(lo, hi)
range3 = set(range1) - set(range2)
It's horribly inefficient, but there's no special cases. Possibly you'll want a helper which inspects the set(), verifies "no gaps!", and turns it back into a range().
Or write a proper SerialRange class that uses arithmetic to accomplish the same thing. I think your code avoids MultiRanges?
-
\$\begingroup\$ I will go backwards commenting: IntegerRangeField is a special range type of postgres (int4range) which probably would be familiar if implementing a schedule application. It's not a python range because it has bounds. I did mention how we could manage 4 million records in a single operation, that means range(1, 4000000). Last time I did that my system froze for a while while trying to do that for a form, that's why the "compact" option was preferred. \$\endgroup\$Braiam– Braiam2025年01月07日 21:51:01 +00:00Commented Jan 7 at 21:51
-
\$\begingroup\$ There are automated tests, but help pages made no indication of such being looked for. I tried to describe the system as best as I could. The purpose of the sql query is to do the whole range operation and deletion of the preexisting record so it can be recorded. I could pass an PK to identify the record, but I was trying my best to keep everything explicit. Something like
target_record = Documents.objects.filter(...) Documents.objects.raw(..., params={id: target_record.pk, new_range: new_range})
\$\endgroup\$Braiam– Braiam2025年01月07日 21:58:31 +00:00Commented Jan 7 at 21:58 -
\$\begingroup\$ There's an obvious drawback and is that the sql must have new columns specified if the model ever changes (which I believe it should not, knock on wood), but unless postgres changes something it should be something that doesn't need to be touched upon. Also, the function is in a hot path, because every transaction has the potential of hitting it. \$\endgroup\$Braiam– Braiam2025年01月07日 22:00:48 +00:00Commented Jan 7 at 22:00
def ...
around something containing areturn
statement. You don't have to verbosely copy model attributes to variables unless you have a very weird customdelete
- all instance fields (except forpk
? not sure) remain available after calling.delete()
. \$\endgroup\$