I have lots of time-series data going into a postgres DB. The time series data comes into a message queue and is read by a consumer script and inserted into the database, the data is flowing continuously into the system throughout the day and the load stays fairly consistent.
The table which holds the data is partitioned on the timestamp and is only ever written to by this script. It is read from by other services.
I was wondering if it would be possible/ appropriate to use a long running COPY to copy the data directly to the DB or if it would be better to batch the results and run multiple inserts/ short lived copies.
I'm using Python and the psycopg (psycopg3) driver which seems to have a very clean interface for copying data. I was thinking of a solution like something below:
def packets() -> AsyncGenerator[..., None]:
"""
Infinite generator that yields packets from the incoming stream.
"""
while True:
yield ...
async with cursor.copy("COPY data FROM STDIN") as copy:
for packet in packets():
await copy.write(packet)
I currently have a solution much more similar to the following:
def insert(packets: Sequence[...]) -> None:
"""bulk insert packets into the DB"""
async with pool.connection() as conn:
conn.executemany("...", packets)
buffer = []
for packet in packets():
buffer.append(packet)
if len(buffer) == 100:
insert(buffer)
buffer.clear()
Both of the Python solutions above are quite naive, I know there are many things that could be done to optimise either one (especially the batching).
EDIT: I don't think I made it particularly clear in the initial question, the data is coming in live to the system and doesn't stop. Whenever I've seen insert speeds discussed, COPY almost always significantly faster, but they're normally comparing time to restore a DB or another task which will run and eventually complete, this process is just running for an undefined amount of time continuously ingesting data.
-
What should happen when COPY (or INSERT) returns an error? Presumably you can't just tell the streaming process to restart its stream.jjanes– jjanes2023年02月10日 16:01:06 +00:00Commented Feb 10, 2023 at 16:01
1 Answer 1
No matter how many rows you load in a single COPY
statement, it will always be a single transaction. So the data won't be visible until the statement is finished. Moreover, it wouldn't be smart to have long running transactions all the time.
So my advice is to use COPY
, but to batch the rows loaded so that a single COPY
statement doesn't take more than a couple of seconds.