0

We have a file processing service that extracts some device configuration information from daily files and stores it into PostgreSQL tables.

We have two layers of tables,

  • The loading tables that contain records for every single day
  • The compiled tables that store aggregated records based on a hash and date_from to show a timeline of configuration changes. The hash includes all the fields except the primary key.

Following is a sample of the tables:

CREATE TABLE public.logger_main_config_loading (
 id text NOT NULL,
 hash text NOT NULL,
 logger_oem_id text NOT NULL,
 logger_model_id text,
 logger_serial_number text NOT NULL,
 logger_id text,
 logger_name text,
 date_from timestamp without time zone NOT NULL,
 date_to timestamp without time zone,
 encryption_pin_or_key text,
 enclosure_lock_details text,
 data_transfer_details text,
 offset_from_utc_hrs numeric,
 sampling_rate_sec integer,
 averaging_period_minutes integer,
 clock_is_auto_synced boolean,
 notes text,
 update_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP,
 updated_by uuid,
 measurement_location_uuid uuid NOT NULL,
 logger_firmware_version text
);

This table has the following Primary Key:

ALTER TABLE ONLY public.logger_main_config_loading
 ADD CONSTRAINT logger_main_config_loading_pkey
PRIMARY KEY (measurement_location_uuid, id, hash, date_from);

The table has the following indexes:

CREATE INDEX
 idx_logger_main_config_loading
ON
 public.logger_main_config_column_name_loading
USING
 btree (measurement_location_uuid, logger_main_config_loading_id, logger_main_config_loading_hash, logger_main_config_loading_date_from);
CREATE INDEX idx_main_config_loading_measurement_location_uuid ON public.logger_main_config_loading USING btree (measurement_location_uuid);

The logger_main_config_compiled table has exactly the same structure, but just stores records aggregated by id, hash and date_from.

Here is a sample query that we use to get aggregated records from the loading table and insert them in the compiled table:

SELECT
 log_main_load_1.*,
 log_main_load_2.date_to AS compiled_date_to
 FROM
 logger_main_config_loading AS log_main_load_1
 INNER JOIN (
 SELECT
 id,
 hash,
 measurement_location_uuid,
 min(date_from) AS date_from,
 max(date_to) AS date_to
 FROM
 logger_main_config_loading
 WHERE measurement_location_uuid = %(measurement_location_uuid)s
 GROUP BY
 id,
 hash,
 measurement_location_uuid
 ) AS log_main_load_2
 ON
 log_main_load_1.id = log_main_load_2.id
 AND log_main_load_1.hash = log_main_load_2.hash
 AND log_main_load_1.measurement_location_uuid = log_main_load_2.measurement_location_uuid
 AND log_main_load_1.date_from = log_main_load_2.date_from
 WHERE
 log_main_load_1.measurement_location_uuid = %(measurement_location_uuid)s;

We are planning to do away with these queries and do the aggregation purely through Python code and insert the records in the compiled table as some scenarios do not get handled. For example, if there are records in the loading table already from a particular date_from, and a file before that date is processed (new date_from before the min(date_from) in the table at that point), the records in the compiled table result in a new record as the date_from is different (different primary key), however, we need the same record's date_from to be updated if the hash has not changed.

Hence, have the following questions:

  • What is the cleanest approach to achieve this? Is it a recommended practice to delete the compiled table records each time and insert them after aggregating them through code for each measurement_location_uuid? I do understand that indexes get rebuilt each time records are deleted, however this table may not even a million of records.
  • Based on the table structure, are any other performance considerations or indexes that need to be added/removed?
  • We expect idempotency in this implementation, even if a file is added again, there should be no effect on the compiled table. In this case delete and rewrite may be unnecessary.

I would really appreciate some suggestions so that this can be implemented in a less complex and efficient manner.

Thank you in advance.

asked Jul 30, 2024 at 14:07

1 Answer 1

0

Assumptions: Assumed some missing table definitions are not relevant. So my answer may not be a complete match. Will update once complete definitions are provided if required. Solution for one issue only provided as there was only one issue mentioned. Assumed date_from consists of the date when the value collection occurred in the device. Assumed there is a different query pulling data from the logger_main_config_loading and populating data into logger_main_config_compiled where the answer will be updated. Providing that query also will help in giving the right answer. Assumed that date_to is just an arbitrary column which will be equal to date_from in logger_main_config_loading and will be really used in the compiled version.

Answer PostgreSQL INSERT supports updates of rows which conflict with a given constraint while inserting non conflicting rows. Using that model you can update the already created rows for a given date with the files you are processing (which may be older than current date) That works as a MERGE as done in other RDBMS. Alternately if you are using a newer version of PSQL, MERGE is supported.

INSERT 
 INTO logger_main_config_compiled AS lmcc
 ( column comma list
 ) 
VALUES 
 ( value comma list
 )
 ON CONFLICT 
 ON CONSTRAINT logger_main_config_loading_pkey
 DO UPDATE
 SET date_from = {required_date_from)
 , ...... [other_columns_to_update = other_col_values]
;

Using the above model you will insert new values (if any) and update new values into compiled values (if any) and do nothing about already existing values if (already exists).

Alternately you can specify the columns in the key/constraint ON CONFLICT (measurement_location_uuid, id, hash, date_from) DO UPDATE .........

PSQL documentation link https://www.postgresql.org/docs/13/sql-insert.html

Other suggestions for performance:

  1. I would not use text type for columns that could have fixed length data.

  2. I would put all the varying columns in the last after fixed length columns.

  3. I would use (BIG)INT primary key for the table and make the current PK as an unique index or constraint. That is a very huge PK to tackle, but if design calls for it I would like to look at the design. FYI, Indexes in PSQL have a limited data length.

  4. I would not use python to work on something database can already do efficiently.

  5. More optimization suggestions can be taken from my post on linked in https://www.linkedin.com/feed/update/urn:li:activity:7202221110774382593/?lipi=urn%3Ali%3Apage%3Ad_flagship3_pulse_read%3BePyg2kUaT4%2BozGBOYskkgw%3D%3D

answered Aug 5, 2024 at 18:01
2
  • Thank you for your response. ON CONFLICT does not serve the purpose here, as a new record would always have a different date_from. Commented Aug 13, 2024 at 9:16
  • I was expecting more details in response as my answer was assuming some details not specified. Without much details it is hard to understand the issue and suggest. ON CONFLICT would work in the situation explained where date need not be updated when hash matches. Additionally if you would like to delete rows which need to replaced I see two options 1. Merge postgresql.org/docs/current/sql-merge.html 2. WITH delete_list AS( SELECT...), insert_update AS( insert .. ON CONFLICT) delete from .... ; Commented Aug 27, 2024 at 20:45

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.