How do I disable escaping when using COPY FROM in PostgreSQL?

Question 1

I have a large tab delimited file that I want to read into a table in PostgreSQL 9.5. It contains double-quotes and backslashes that I want to treat as regular characters.

I think COPY FROM is the way to go, but I can't figure out how to disable escaping.

Here is a sample of the data (it's from Google's ngram dataset):

aX13_X 2006 8 5
aX13_X 2007 4 3
aX13_X 2008 2 1
a\ 1852 1 1
a\ 1935 1 1
a\ 1937 2 2
ACT1V1T1ES 2003 15 11
ACT1V1T1ES 2004 63 6
ACT1V1T1ES 2005 1 1
ACT1V1T1ES 2006 5 4
ACT1V1T1ES 2008 4 3
ACTION=" 1995 3 3
ACTION=" 1996 6 5
ACTION=" 1997 9 7
ACTION=" 1998 19 11
ACTION=" 1999 11 5

And the table:

CREATE TABLE onegram (
 id SERIAL,
 ngram character text,
 year integer NOT NULL,
 match_count integer NOT NULL,
 volume_count integer NOT NULL
);

If I try without modifiers, the backslash escapes the tab:

ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv';
ERROR: missing data for column "volume_count"
CONTEXT: COPY onegram, line 4: "a\ 1852 1 1"

So I switch to CSV, and then the double quote characters quote the tabs:

ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv' WITH CSV DELIMITER E'\t';
ERROR: unterminated CSV quoted field
CONTEXT: COPY onegram, line 17: "ACTION=" 1999 11 5
"

Using CSV lets me use the DELIMITER keyword. It works if I pick a delimiter that doesn't occur in the sample (space, in this case):

ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv' WITH CSV DELIMITER E'\t' QUOTE E' ';
COPY 16

But I want to be able to include any character (except for tab and newline). So, how can I disable QUOTE? Or what can I use instead of COPY FROM?

Edit: For slightly arbitrary reasons, ideally I'd like an option that does not involve pre-processing the data.

Question 2

The copy command by default uses text format with tab delimiter. So only one thing you need is to escape backslashes:

copy onegram (ngram, year, match_count, volume_count)
from program 'sed ''s/\\/\\\\/g'' < /home/tims/data/ngram/test.tsv';
select * from onegram;
╔════╤════════════╤══════╤═════════════╤══════════════╗
║ id │ ngram │ year │ match_count │ volume_count ║
╠════╪════════════╪══════╪═════════════╪══════════════╣
║ 1 │ aX13_X │ 2006 │ 8 │ 5 ║
║ 2 │ aX13_X │ 2007 │ 4 │ 3 ║
║ 3 │ aX13_X │ 2008 │ 2 │ 1 ║
║ 4 │ a\ │ 1852 │ 1 │ 1 ║
║ 5 │ a\ │ 1935 │ 1 │ 1 ║
║ 6 │ a\ │ 1937 │ 2 │ 2 ║
║ 7 │ ACT1V1T1ES │ 2003 │ 15 │ 11 ║
║ 8 │ ACT1V1T1ES │ 2004 │ 63 │ 6 ║
║ 9 │ ACT1V1T1ES │ 2005 │ 1 │ 1 ║
║ 10 │ ACT1V1T1ES │ 2006 │ 5 │ 4 ║
║ 11 │ ACT1V1T1ES │ 2008 │ 4 │ 3 ║
║ 12 │ ACTION=" │ 1995 │ 3 │ 3 ║
║ 13 │ ACTION=" │ 1996 │ 6 │ 5 ║
║ 14 │ ACTION=" │ 1997 │ 9 │ 7 ║
║ 15 │ ACTION=" │ 1998 │ 19 │ 11 ║
║ 16 │ ACTION=" │ 1999 │ 11 │ 5 ║
╚════╧════════════╧══════╧═════════════╧══════════════╝

Question 3

It looks like you'll need to preprocess the data. COPY can't ingest this data if you want to allow any possible input. It isn't valid CSV, and it doesn't follow the rules of postgres's TSV-like native COPY format either. There is no option to set QUOTE NONE or something like that.

I suggest something like a perl/python script that transforms the data, which you invoke using COPY ... FROM PROGRAM. Or by piping the data into psql for client-side input, or by transforming the data and directly feeding it to Postgres via the client driver like DBD::Pg or psycopg2, both of which have COPY support.

You could always submit a postgres patch so the next person has an easier job of the same problem.

Question 4

I suspect you're right. I'll look at making a patch.

Question 5

It shouldn't be too hard (famous last words?) to allow COPY to take a keyword as the quote char, like NONE, which is an unreserved keyword already and thus suitable for that job. You'd need to change gram.y to accept it. CopyStmt doesn't need changes, the List of DefElem there is fine. In copy.c you'd need to make DoCopy check for the NONE keyword in the quote option element; around 1084. Then actually handle that in data input.

Question 6

From @Craig Ringer, you should re-format the data.

Based on your data, on Linux, I used tr & seq to format CSV file before importing.

For example: suppose that a.csv file as below

aX13_X 2006 8 5
aX13_X 2007 4 3
aX13_X 2008 2 1
a\ 1852 1 1
a\ 1935 1 1
a\ 1937 2 2
ACT1V1T1ES 2003 15 11
ACT1V1T1ES 2004 63 6
ACT1V1T1ES 2005 1 1
ACT1V1T1ES 2006 5 4
ACT1V1T1ES 2008 4 3
ACTION=" 1995 3 3
ACTION=" 1996 6 5
ACTION=" 1997 9 7
ACTION=" 1998 19 11
ACTION=" 1999 11 5

Format CSV, disable quote (replace multiple spaces to one space then space & \t to ;). c.csv is result file. However, please check your input data if it is large because I tested on small data .

cat a.csv | tr -s " " > b.csv; sed -e 's/"//g' -e "s/[\t ]/;/g" b.csv > c.csv 
cat c.csv
aX13_X;2006;8;5
aX13_X;2007;4;3
aX13_X;2008;2;1
a\;1852;1;1
a\;1935;1;1
a\;1937;2;2
ACT1V1T1ES;2003;15;11
ACT1V1T1ES;2004;63;6
ACT1V1T1ES;2005;1;1
ACT1V1T1ES;2006;5;4
ACT1V1T1ES;2008;4;3
ACTION=;1995;3;3
ACTION=;1996;6;5
ACTION=;1997;9;7
ACTION=;1998;19;11
ACTION=;1999;11;5

Import to PG

psql -p 5432 -d postgres -c "copy onegram (ngram, year, match_count, volume_count) from '/opt/c.csv' DELIMITER ';' CSV"

After importing, result likes that:

id | ngram | year | match_count | volume_count
----+------------+------+-------------+--------------
 20 | aX13_X | 2006 | 8 | 5
 21 | aX13_X | 2007 | 4 | 3
 22 | aX13_X | 2008 | 2 | 1
 23 | a\ | 1852 | 1 | 1
 24 | a\ | 1935 | 1 | 1
 25 | a\ | 1937 | 2 | 2
 26 | ACT1V1T1ES | 2003 | 15 | 11
 27 | ACT1V1T1ES | 2004 | 63 | 6
 28 | ACT1V1T1ES | 2005 | 1 | 1
 29 | ACT1V1T1ES | 2006 | 5 | 4
 30 | ACT1V1T1ES | 2008 | 4 | 3
 31 | ACTION= | 1995 | 3 | 3
 32 | ACTION= | 1996 | 6 | 5
 33 | ACTION= | 1997 | 9 | 7
 34 | ACTION= | 1998 | 19 | 11
 35 | ACTION= | 1999 | 11 | 5

Question 7

This solution doesn't suit my problem for a few reasons. First, the quotes are lost. I want to keep the ACTION=" ngram distinct from ACTION= ngram. Similarly the data probably has semicolons in the ngrams. Lastly, per my edit, I'm trying to avoid preprocessing.

Question 8

I often re-format data before using copy. Please update your solution when you find it. Thanks.

Abelisto Abelisto 1,5891 gold badge10 silver badges14 bronze badges · Answer 1 · 2016-12-05 20:56:41Z

The copy command by default uses text format with tab delimiter. So only one thing you need is to escape backslashes:

copy onegram (ngram, year, match_count, volume_count)
from program 'sed ''s/\\/\\\\/g'' < /home/tims/data/ngram/test.tsv';
select * from onegram;
╔════╤════════════╤══════╤═════════════╤══════════════╗
║ id │ ngram │ year │ match_count │ volume_count ║
╠════╪════════════╪══════╪═════════════╪══════════════╣
║ 1 │ aX13_X │ 2006 │ 8 │ 5 ║
║ 2 │ aX13_X │ 2007 │ 4 │ 3 ║
║ 3 │ aX13_X │ 2008 │ 2 │ 1 ║
║ 4 │ a\ │ 1852 │ 1 │ 1 ║
║ 5 │ a\ │ 1935 │ 1 │ 1 ║
║ 6 │ a\ │ 1937 │ 2 │ 2 ║
║ 7 │ ACT1V1T1ES │ 2003 │ 15 │ 11 ║
║ 8 │ ACT1V1T1ES │ 2004 │ 63 │ 6 ║
║ 9 │ ACT1V1T1ES │ 2005 │ 1 │ 1 ║
║ 10 │ ACT1V1T1ES │ 2006 │ 5 │ 4 ║
║ 11 │ ACT1V1T1ES │ 2008 │ 4 │ 3 ║
║ 12 │ ACTION=" │ 1995 │ 3 │ 3 ║
║ 13 │ ACTION=" │ 1996 │ 6 │ 5 ║
║ 14 │ ACTION=" │ 1997 │ 9 │ 7 ║
║ 15 │ ACTION=" │ 1998 │ 19 │ 11 ║
║ 16 │ ACTION=" │ 1999 │ 11 │ 5 ║
╚════╧════════════╧══════╧═════════════╧══════════════╝

Craig Ringer Craig Ringer 58k6 gold badges162 silver badges194 bronze badges · Answer 2 · 2016-12-05 14:15:29Z

It looks like you'll need to preprocess the data. COPY can't ingest this data if you want to allow any possible input. It isn't valid CSV, and it doesn't follow the rules of postgres's TSV-like native COPY format either. There is no option to set QUOTE NONE or something like that.

I suggest something like a perl/python script that transforms the data, which you invoke using COPY ... FROM PROGRAM. Or by piping the data into psql for client-side input, or by transforming the data and directly feeding it to Postgres via the client driver like DBD::Pg or psycopg2, both of which have COPY support.

You could always submit a postgres patch so the next person has an easier job of the same problem.

It shouldn't be too hard (famous last words?) to allow COPY to take a keyword as the quote char, like NONE, which is an unreserved keyword already and thus suitable for that job. You'd need to change gram.y to accept it. CopyStmt doesn't need changes, the List of DefElem there is fine. In copy.c you'd need to make DoCopy check for the NONE keyword in the quote option element; around 1084. Then actually handle that in data input.

Luan Huynh Luan Huynh 2,0107 gold badges27 silver badges37 bronze badges · Answer 3 · 2016-12-05 14:28:20Z

From @Craig Ringer, you should re-format the data.

Based on your data, on Linux, I used tr & seq to format CSV file before importing.

For example: suppose that a.csv file as below

aX13_X 2006 8 5
aX13_X 2007 4 3
aX13_X 2008 2 1
a\ 1852 1 1
a\ 1935 1 1
a\ 1937 2 2
ACT1V1T1ES 2003 15 11
ACT1V1T1ES 2004 63 6
ACT1V1T1ES 2005 1 1
ACT1V1T1ES 2006 5 4
ACT1V1T1ES 2008 4 3
ACTION=" 1995 3 3
ACTION=" 1996 6 5
ACTION=" 1997 9 7
ACTION=" 1998 19 11
ACTION=" 1999 11 5

Format CSV, disable quote (replace multiple spaces to one space then space & \t to ;). c.csv is result file. However, please check your input data if it is large because I tested on small data .

cat a.csv | tr -s " " > b.csv; sed -e 's/"//g' -e "s/[\t ]/;/g" b.csv > c.csv 
cat c.csv
aX13_X;2006;8;5
aX13_X;2007;4;3
aX13_X;2008;2;1
a\;1852;1;1
a\;1935;1;1
a\;1937;2;2
ACT1V1T1ES;2003;15;11
ACT1V1T1ES;2004;63;6
ACT1V1T1ES;2005;1;1
ACT1V1T1ES;2006;5;4
ACT1V1T1ES;2008;4;3
ACTION=;1995;3;3
ACTION=;1996;6;5
ACTION=;1997;9;7
ACTION=;1998;19;11
ACTION=;1999;11;5

Import to PG

psql -p 5432 -d postgres -c "copy onegram (ngram, year, match_count, volume_count) from '/opt/c.csv' DELIMITER ';' CSV"

After importing, result likes that:

id | ngram | year | match_count | volume_count
----+------------+------+-------------+--------------
 20 | aX13_X | 2006 | 8 | 5
 21 | aX13_X | 2007 | 4 | 3
 22 | aX13_X | 2008 | 2 | 1
 23 | a\ | 1852 | 1 | 1
 24 | a\ | 1935 | 1 | 1
 25 | a\ | 1937 | 2 | 2
 26 | ACT1V1T1ES | 2003 | 15 | 11
 27 | ACT1V1T1ES | 2004 | 63 | 6
 28 | ACT1V1T1ES | 2005 | 1 | 1
 29 | ACT1V1T1ES | 2006 | 5 | 4
 30 | ACT1V1T1ES | 2008 | 4 | 3
 31 | ACTION= | 1995 | 3 | 3
 32 | ACTION= | 1996 | 6 | 5
 33 | ACTION= | 1997 | 9 | 7
 34 | ACTION= | 1998 | 19 | 11
 35 | ACTION= | 1999 | 11 | 5

This solution doesn't suit my problem for a few reasons. First, the quotes are lost. I want to keep the ACTION=" ngram distinct from ACTION= ngram. Similarly the data probably has semicolons in the ngrams. Lastly, per my edit, I'm trying to avoid preprocessing.
I often re-format data before using copy. Please update your solution when you find it. Thanks.

Stack Exchange Network

How do I disable escaping when using COPY FROM in PostgreSQL?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How do I disable escaping when using COPY FROM in PostgreSQL?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions