I am trying to dump the entire contents of a large table, from the command line using pqsl
, but am running into a problem where memory usage goes up, to the point where the process is killed, before any data is even dumped.
What I don't understand is: why isn't the query returning results immediately, and completing without running out of memory?
Here is an explanation of exactly what I'm attempting:
I have a table, say:
CREATE TABLE big
(
id integer,
rand double precision
)
A large number of rows is inserted (50 million):
insert into big
select generate_series(1, 50000000) AS id, random();
The query plan to select every row looks like (not surprisingly):
$ psql -d big -c "explain select * from big;"
QUERY PLAN
----------------------------------------------------------------
Seq Scan on big (cost=0.00..924326.24 rows=50000124 width=12)
(1 row)
I then attempt to dump the contents to file:
$ psql -d big -c "select * from big;" > big.dump
As I said above, this command is failing before any data is written, seemingly by taking up an ever increasing amount of memory before being killed by the OS (by "OOM killer").
Note: I understand I could use pg_dump
to accomplish something similar, but in reality, my query is more complex than this - specifically, I would like to encode each row as JSON when dumping.
Some configuration details:
- postgresql version = 9.3.4
- work_mem = 1MB
- shared_buffers = 128MB
- effective_cache_size = 128MB
1 Answer 1
By default, the results are entirely buffered in memory for two reasons:
1) Unless using the -A
option, output rows are aligned so the output cannot start until psql knows the maximum length of each column, which implies visiting every row (which also takes a significant time in addition to lots of memory).
2) Unless specifying a FETCH_COUNT
, psql uses the synchronous PQexec
function directly on the query, which buffers the entire resultset. But when setting a FETCH_COUNT
, it will use a cursor-based method with successive fetch calls and freeing or reusing the client-side buffer every FETCH_COUNT
rows.
So a big resultset should be fetched by a command like:
psql -A -t --variable="FETCH_COUNT=10000" \
-c "select columns from bigtable" \
> output-file
With FETCH_COUNT
reduced if the rows are very large and it still eats too much memory.
The -t
stands for --tuples-only
, which suppresses the output of headers and footers.
-
Excellent explanation. I tested both 1) and 2) separately and together. Turns out that specifying
-A
doesn't make much difference, although my table is simple (i.e. only two number columns) - it might make more difference for a "wider", or variable width table? SettingFETCH_COUNT
does make all the difference though.JonoB– JonoB2015年05月15日 00:24:19 +00:00Commented May 15, 2015 at 0:24 -
Yes, the effect of column alignment is more pronounced when the result includes wide text columns of varying lengths. But the point is to remember to turn alignment off to extract the raw contents rather than look at the space-padded version of them on a terminal.Daniel Vérité– Daniel Vérité2015年05月15日 10:25:31 +00:00Commented May 15, 2015 at 10:25
-
It appears from some small experimentation that if you set FETCH_COUNT then it only aligns header sizes based on "largest per column so far" (i.e. if you specify FETCH_COUNT, you don't have to worry much about alignment) FWIW...rogerdpack– rogerdpack2017年11月27日 16:12:21 +00:00Commented Nov 27, 2017 at 16:12
Explore related questions
See similar questions with these tags.
COPY
command:psql -d big -c "copy (select * from big) to stdout" > big.dump
psql
process or the Postgres backend process for your connection? I guess the client (psql
) buffers the result somehow (or forces the backend process to do so) When you usecopy
the data is never transferred to the client (thepsql
program) because this is all done on the server side.psql
process, from syslog:Out of memory: Kill process 26465 (psql)
. FYI: I'm running the client on the same machine as the server.psql
runs - it's still a "client" to the server. Does this also happen when you use the\o
command to write the output to a file? In that casepsql
"knows" that you don't need to display the data, maybe it takes retrieves the data more efficiently then.