1

I have lots of time-series data going into a postgres DB. The time series data comes into a message queue and is read by a consumer script and inserted into the database, the data is flowing continuously into the system throughout the day and the load stays fairly consistent.

The table which holds the data is partitioned on the timestamp and is only ever written to by this script. It is read from by other services.

I was wondering if it would be possible/ appropriate to use a long running COPY to copy the data directly to the DB or if it would be better to batch the results and run multiple inserts/ short lived copies.

I'm using Python and the psycopg (psycopg3) driver which seems to have a very clean interface for copying data. I was thinking of a solution like something below:

def packets() -> AsyncGenerator[..., None]:
 """
 Infinite generator that yields packets from the incoming stream.
 """
 while True:
 yield ...
async with cursor.copy("COPY data FROM STDIN") as copy:
 for packet in packets():
 await copy.write(packet)

I currently have a solution much more similar to the following:

def insert(packets: Sequence[...]) -> None:
 """bulk insert packets into the DB"""
 async with pool.connection() as conn:
 conn.executemany("...", packets)
buffer = []
for packet in packets():
 buffer.append(packet)
 if len(buffer) == 100:
 insert(buffer)
 buffer.clear()

Both of the Python solutions above are quite naive, I know there are many things that could be done to optimise either one (especially the batching).

EDIT: I don't think I made it particularly clear in the initial question, the data is coming in live to the system and doesn't stop. Whenever I've seen insert speeds discussed, COPY almost always significantly faster, but they're normally comparing time to restore a DB or another task which will run and eventually complete, this process is just running for an undefined amount of time continuously ingesting data.

asked Feb 10, 2023 at 11:03
1
  • What should happen when COPY (or INSERT) returns an error? Presumably you can't just tell the streaming process to restart its stream. Commented Feb 10, 2023 at 16:01

1 Answer 1

0

No matter how many rows you load in a single COPY statement, it will always be a single transaction. So the data won't be visible until the statement is finished. Moreover, it wouldn't be smart to have long running transactions all the time.

So my advice is to use COPY, but to batch the rows loaded so that a single COPY statement doesn't take more than a couple of seconds.

answered Feb 13, 2023 at 7:03

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.