We have a file processing service that extracts some device configuration information from daily files and stores it into PostgreSQL tables.
We have two layers of tables,
- The loading tables that contain records for every single day
- The compiled tables that store aggregated records based on a hash and date_from to show a timeline of configuration changes. The hash includes all the fields except the primary key.
Following is a sample of the tables:
CREATE TABLE public.logger_main_config_loading (
id text NOT NULL,
hash text NOT NULL,
logger_oem_id text NOT NULL,
logger_model_id text,
logger_serial_number text NOT NULL,
logger_id text,
logger_name text,
date_from timestamp without time zone NOT NULL,
date_to timestamp without time zone,
encryption_pin_or_key text,
enclosure_lock_details text,
data_transfer_details text,
offset_from_utc_hrs numeric,
sampling_rate_sec integer,
averaging_period_minutes integer,
clock_is_auto_synced boolean,
notes text,
update_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP,
updated_by uuid,
measurement_location_uuid uuid NOT NULL,
logger_firmware_version text
);
This table has the following Primary Key:
ALTER TABLE ONLY public.logger_main_config_loading
ADD CONSTRAINT logger_main_config_loading_pkey
PRIMARY KEY (measurement_location_uuid, id, hash, date_from);
The table has the following indexes:
CREATE INDEX
idx_logger_main_config_loading
ON
public.logger_main_config_column_name_loading
USING
btree (measurement_location_uuid, logger_main_config_loading_id, logger_main_config_loading_hash, logger_main_config_loading_date_from);
CREATE INDEX idx_main_config_loading_measurement_location_uuid ON public.logger_main_config_loading USING btree (measurement_location_uuid);
The logger_main_config_compiled table has exactly the same structure, but just stores records aggregated by id, hash and date_from.
Here is a sample query that we use to get aggregated records from the loading table and insert them in the compiled table:
SELECT
log_main_load_1.*,
log_main_load_2.date_to AS compiled_date_to
FROM
logger_main_config_loading AS log_main_load_1
INNER JOIN (
SELECT
id,
hash,
measurement_location_uuid,
min(date_from) AS date_from,
max(date_to) AS date_to
FROM
logger_main_config_loading
WHERE measurement_location_uuid = %(measurement_location_uuid)s
GROUP BY
id,
hash,
measurement_location_uuid
) AS log_main_load_2
ON
log_main_load_1.id = log_main_load_2.id
AND log_main_load_1.hash = log_main_load_2.hash
AND log_main_load_1.measurement_location_uuid = log_main_load_2.measurement_location_uuid
AND log_main_load_1.date_from = log_main_load_2.date_from
WHERE
log_main_load_1.measurement_location_uuid = %(measurement_location_uuid)s;
We are planning to do away with these queries and do the aggregation purely through Python code and insert the records in the compiled table as some scenarios do not get handled. For example, if there are records in the loading table already from a particular date_from, and a file before that date is processed (new date_from before the min(date_from) in the table at that point), the records in the compiled table result in a new record as the date_from is different (different primary key), however, we need the same record's date_from to be updated if the hash has not changed.
Hence, have the following questions:
- What is the cleanest approach to achieve this? Is it a recommended practice to delete the compiled table records each time and insert them after aggregating them through code for each measurement_location_uuid? I do understand that indexes get rebuilt each time records are deleted, however this table may not even a million of records.
- Based on the table structure, are any other performance considerations or indexes that need to be added/removed?
- We expect idempotency in this implementation, even if a file is added again, there should be no effect on the compiled table. In this case delete and rewrite may be unnecessary.
I would really appreciate some suggestions so that this can be implemented in a less complex and efficient manner.
Thank you in advance.
1 Answer 1
Assumptions: Assumed some missing table definitions are not relevant. So my answer may not be a complete match. Will update once complete definitions are provided if required. Solution for one issue only provided as there was only one issue mentioned. Assumed date_from consists of the date when the value collection occurred in the device. Assumed there is a different query pulling data from the logger_main_config_loading and populating data into logger_main_config_compiled where the answer will be updated. Providing that query also will help in giving the right answer. Assumed that date_to is just an arbitrary column which will be equal to date_from in logger_main_config_loading and will be really used in the compiled version.
Answer PostgreSQL INSERT supports updates of rows which conflict with a given constraint while inserting non conflicting rows. Using that model you can update the already created rows for a given date with the files you are processing (which may be older than current date) That works as a MERGE as done in other RDBMS. Alternately if you are using a newer version of PSQL, MERGE is supported.
INSERT
INTO logger_main_config_compiled AS lmcc
( column comma list
)
VALUES
( value comma list
)
ON CONFLICT
ON CONSTRAINT logger_main_config_loading_pkey
DO UPDATE
SET date_from = {required_date_from)
, ...... [other_columns_to_update = other_col_values]
;
Using the above model you will insert new values (if any) and update new values into compiled values (if any) and do nothing about already existing values if (already exists).
Alternately you can specify the columns in the key/constraint ON CONFLICT (measurement_location_uuid, id, hash, date_from) DO UPDATE .........
PSQL documentation link https://www.postgresql.org/docs/13/sql-insert.html
Other suggestions for performance:
I would not use text type for columns that could have fixed length data.
I would put all the varying columns in the last after fixed length columns.
I would use (BIG)INT primary key for the table and make the current PK as an unique index or constraint. That is a very huge PK to tackle, but if design calls for it I would like to look at the design. FYI, Indexes in PSQL have a limited data length.
I would not use python to work on something database can already do efficiently.
More optimization suggestions can be taken from my post on linked in https://www.linkedin.com/feed/update/urn:li:activity:7202221110774382593/?lipi=urn%3Ali%3Apage%3Ad_flagship3_pulse_read%3BePyg2kUaT4%2BozGBOYskkgw%3D%3D
-
Thank you for your response. ON CONFLICT does not serve the purpose here, as a new record would always have a different date_from.ShwetaJ– ShwetaJ2024年08月13日 09:16:10 +00:00Commented Aug 13, 2024 at 9:16
-
I was expecting more details in response as my answer was assuming some details not specified. Without much details it is hard to understand the issue and suggest. ON CONFLICT would work in the situation explained where date need not be updated when hash matches. Additionally if you would like to delete rows which need to replaced I see two options 1. Merge postgresql.org/docs/current/sql-merge.html 2. WITH delete_list AS( SELECT...), insert_update AS( insert .. ON CONFLICT) delete from .... ;Raja– Raja2024年08月27日 20:45:16 +00:00Commented Aug 27, 2024 at 20:45
Explore related questions
See similar questions with these tags.