Memory usage on select for large number of rows

Question 1

I am trying to dump the entire contents of a large table, from the command line using pqsl, but am running into a problem where memory usage goes up, to the point where the process is killed, before any data is even dumped.

What I don't understand is: why isn't the query returning results immediately, and completing without running out of memory?

Here is an explanation of exactly what I'm attempting:

I have a table, say:

CREATE TABLE big
(
 id integer,
 rand double precision
)

A large number of rows is inserted (50 million):

insert into big 
 select generate_series(1, 50000000) AS id, random();

The query plan to select every row looks like (not surprisingly):

$ psql -d big -c "explain select * from big;"
 QUERY PLAN 
----------------------------------------------------------------
 Seq Scan on big (cost=0.00..924326.24 rows=50000124 width=12)
(1 row)

I then attempt to dump the contents to file:

$ psql -d big -c "select * from big;" > big.dump

As I said above, this command is failing before any data is written, seemingly by taking up an ever increasing amount of memory before being killed by the OS (by "OOM killer").

Note: I understand I could use pg_dump to accomplish something similar, but in reality, my query is more complex than this - specifically, I would like to encode each row as JSON when dumping.

Some configuration details:

postgresql version = 9.3.4
work_mem = 1MB
shared_buffers = 128MB
effective_cache_size = 128MB

Question 2

Although this doesn't answer the question, I was able to accomplish the task at hand using the COPY command: psql -d big -c "copy (select * from big) to stdout" > big.dump

Question 3

Which process exactly gets killed? The psql process or the Postgres backend process for your connection? I guess the client (psql) buffers the result somehow (or forces the backend process to do so) When you use copy the data is never transferred to the client (the psql program) because this is all done on the server side.

Question 4

Looks like the psql process, from syslog: Out of memory: Kill process 26465 (psql). FYI: I'm running the client on the same machine as the server.

Question 5

It doesn't matter where psql runs - it's still a "client" to the server. Does this also happen when you use the \o command to write the output to a file? In that case psql "knows" that you don't need to display the data, maybe it takes retrieves the data more efficiently then.

Question 6

By default, the results are entirely buffered in memory for two reasons:

1) Unless using the -A option, output rows are aligned so the output cannot start until psql knows the maximum length of each column, which implies visiting every row (which also takes a significant time in addition to lots of memory).

2) Unless specifying a FETCH_COUNT, psql uses the synchronous PQexec function directly on the query, which buffers the entire resultset. But when setting a FETCH_COUNT, it will use a cursor-based method with successive fetch calls and freeing or reusing the client-side buffer every FETCH_COUNT rows.

So a big resultset should be fetched by a command like:

psql -A -t --variable="FETCH_COUNT=10000" \
 -c "select columns from bigtable" \
 > output-file

With FETCH_COUNT reduced if the rows are very large and it still eats too much memory.

The -t stands for --tuples-only, which suppresses the output of headers and footers.

Question 7

Excellent explanation. I tested both 1) and 2) separately and together. Turns out that specifying -A doesn't make much difference, although my table is simple (i.e. only two number columns) - it might make more difference for a "wider", or variable width table? Setting FETCH_COUNT does make all the difference though.

Question 8

Yes, the effect of column alignment is more pronounced when the result includes wide text columns of varying lengths. But the point is to remember to turn alignment off to extract the raw contents rather than look at the space-padded version of them on a terminal.

Question 9

It appears from some small experimentation that if you set FETCH_COUNT then it only aligns header sizes based on "largest per column so far" (i.e. if you specify FETCH_COUNT, you don't have to worry much about alignment) FWIW...

score 13 · Accepted Answer · 2015-05-14 11:45:49Z

By default, the results are entirely buffered in memory for two reasons:

1) Unless using the -A option, output rows are aligned so the output cannot start until psql knows the maximum length of each column, which implies visiting every row (which also takes a significant time in addition to lots of memory).

2) Unless specifying a FETCH_COUNT, psql uses the synchronous PQexec function directly on the query, which buffers the entire resultset. But when setting a FETCH_COUNT, it will use a cursor-based method with successive fetch calls and freeing or reusing the client-side buffer every FETCH_COUNT rows.

So a big resultset should be fetched by a command like:

psql -A -t --variable="FETCH_COUNT=10000" \
 -c "select columns from bigtable" \
 > output-file

With FETCH_COUNT reduced if the rows are very large and it still eats too much memory.

The -t stands for --tuples-only, which suppresses the output of headers and footers.

Excellent explanation. I tested both 1) and 2) separately and together. Turns out that specifying -A doesn't make much difference, although my table is simple (i.e. only two number columns) - it might make more difference for a "wider", or variable width table? Setting FETCH_COUNT does make all the difference though.
Yes, the effect of column alignment is more pronounced when the result includes wide text columns of varying lengths. But the point is to remember to turn alignment off to extract the raw contents rather than look at the space-padded version of them on a terminal.
It appears from some small experimentation that if you set FETCH_COUNT then it only aligns header sizes based on "largest per column so far" (i.e. if you specify FETCH_COUNT, you don't have to worry much about alignment) FWIW...

Stack Exchange Network

Memory usage on select for large number of rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Memory usage on select for large number of rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions