I have a large tab delimited file that I want to read into a table in PostgreSQL 9.5. It contains double-quotes and backslashes that I want to treat as regular characters.
I think COPY FROM is the way to go, but I can't figure out how to disable escaping.
Here is a sample of the data (it's from Google's ngram dataset):
aX13_X 2006 8 5
aX13_X 2007 4 3
aX13_X 2008 2 1
a\ 1852 1 1
a\ 1935 1 1
a\ 1937 2 2
ACT1V1T1ES 2003 15 11
ACT1V1T1ES 2004 63 6
ACT1V1T1ES 2005 1 1
ACT1V1T1ES 2006 5 4
ACT1V1T1ES 2008 4 3
ACTION=" 1995 3 3
ACTION=" 1996 6 5
ACTION=" 1997 9 7
ACTION=" 1998 19 11
ACTION=" 1999 11 5
And the table:
CREATE TABLE onegram (
id SERIAL,
ngram character text,
year integer NOT NULL,
match_count integer NOT NULL,
volume_count integer NOT NULL
);
If I try without modifiers, the backslash escapes the tab:
ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv';
ERROR: missing data for column "volume_count"
CONTEXT: COPY onegram, line 4: "a\ 1852 1 1"
So I switch to CSV, and then the double quote characters quote the tabs:
ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv' WITH CSV DELIMITER E'\t';
ERROR: unterminated CSV quoted field
CONTEXT: COPY onegram, line 17: "ACTION=" 1999 11 5
"
Using CSV lets me use the DELIMITER keyword. It works if I pick a delimiter that doesn't occur in the sample (space, in this case):
ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv' WITH CSV DELIMITER E'\t' QUOTE E' ';
COPY 16
But I want to be able to include any character (except for tab and newline). So, how can I disable QUOTE? Or what can I use instead of COPY FROM?
Edit: For slightly arbitrary reasons, ideally I'd like an option that does not involve pre-processing the data.
3 Answers 3
The copy
command by default uses text
format with tab delimiter. So only one thing you need is to escape backslashes:
copy onegram (ngram, year, match_count, volume_count)
from program 'sed ''s/\\/\\\\/g'' < /home/tims/data/ngram/test.tsv';
select * from onegram;
╔════╤════════════╤══════╤═════════════╤══════════════╗
║ id │ ngram │ year │ match_count │ volume_count ║
╠════╪════════════╪══════╪═════════════╪══════════════╣
║ 1 │ aX13_X │ 2006 │ 8 │ 5 ║
║ 2 │ aX13_X │ 2007 │ 4 │ 3 ║
║ 3 │ aX13_X │ 2008 │ 2 │ 1 ║
║ 4 │ a\ │ 1852 │ 1 │ 1 ║
║ 5 │ a\ │ 1935 │ 1 │ 1 ║
║ 6 │ a\ │ 1937 │ 2 │ 2 ║
║ 7 │ ACT1V1T1ES │ 2003 │ 15 │ 11 ║
║ 8 │ ACT1V1T1ES │ 2004 │ 63 │ 6 ║
║ 9 │ ACT1V1T1ES │ 2005 │ 1 │ 1 ║
║ 10 │ ACT1V1T1ES │ 2006 │ 5 │ 4 ║
║ 11 │ ACT1V1T1ES │ 2008 │ 4 │ 3 ║
║ 12 │ ACTION=" │ 1995 │ 3 │ 3 ║
║ 13 │ ACTION=" │ 1996 │ 6 │ 5 ║
║ 14 │ ACTION=" │ 1997 │ 9 │ 7 ║
║ 15 │ ACTION=" │ 1998 │ 19 │ 11 ║
║ 16 │ ACTION=" │ 1999 │ 11 │ 5 ║
╚════╧════════════╧══════╧═════════════╧══════════════╝
It looks like you'll need to preprocess the data. COPY
can't ingest this data if you want to allow any possible input. It isn't valid CSV, and it doesn't follow the rules of postgres's TSV-like native COPY format either. There is no option to set QUOTE NONE
or something like that.
I suggest something like a perl/python script that transforms the data, which you invoke using COPY ... FROM PROGRAM
. Or by piping the data into psql
for client-side input, or by transforming the data and directly feeding it to Postgres via the client driver like DBD::Pg
or psycopg2
, both of which have COPY
support.
You could always submit a postgres patch so the next person has an easier job of the same problem.
-
I suspect you're right. I'll look at making a patch.Tim Smith– Tim Smith2016年12月06日 00:55:25 +00:00Commented Dec 6, 2016 at 0:55
-
1It shouldn't be too hard (famous last words?) to allow COPY to take a keyword as the quote char, like
NONE
, which is an unreserved keyword already and thus suitable for that job. You'd need to changegram.y
to accept it.CopyStmt
doesn't need changes, theList
ofDefElem
there is fine. Incopy.c
you'd need to makeDoCopy
check for theNONE
keyword in thequote
option element; around 1084. Then actually handle that in data input.Craig Ringer– Craig Ringer2016年12月06日 01:11:01 +00:00Commented Dec 6, 2016 at 1:11
From @Craig Ringer, you should re-format the data.
Based on your data, on Linux, I used tr
& seq
to format CSV file before importing.
For example: suppose that a.csv
file as below
aX13_X 2006 8 5
aX13_X 2007 4 3
aX13_X 2008 2 1
a\ 1852 1 1
a\ 1935 1 1
a\ 1937 2 2
ACT1V1T1ES 2003 15 11
ACT1V1T1ES 2004 63 6
ACT1V1T1ES 2005 1 1
ACT1V1T1ES 2006 5 4
ACT1V1T1ES 2008 4 3
ACTION=" 1995 3 3
ACTION=" 1996 6 5
ACTION=" 1997 9 7
ACTION=" 1998 19 11
ACTION=" 1999 11 5
Format CSV, disable quote (replace multiple spaces to one space then space & \t
to ;
). c.csv
is result file. However, please check your input data if it is large because I tested on small data .
cat a.csv | tr -s " " > b.csv; sed -e 's/"//g' -e "s/[\t ]/;/g" b.csv > c.csv
cat c.csv
aX13_X;2006;8;5
aX13_X;2007;4;3
aX13_X;2008;2;1
a\;1852;1;1
a\;1935;1;1
a\;1937;2;2
ACT1V1T1ES;2003;15;11
ACT1V1T1ES;2004;63;6
ACT1V1T1ES;2005;1;1
ACT1V1T1ES;2006;5;4
ACT1V1T1ES;2008;4;3
ACTION=;1995;3;3
ACTION=;1996;6;5
ACTION=;1997;9;7
ACTION=;1998;19;11
ACTION=;1999;11;5
Import to PG
psql -p 5432 -d postgres -c "copy onegram (ngram, year, match_count, volume_count) from '/opt/c.csv' DELIMITER ';' CSV"
After importing, result likes that:
id | ngram | year | match_count | volume_count
----+------------+------+-------------+--------------
20 | aX13_X | 2006 | 8 | 5
21 | aX13_X | 2007 | 4 | 3
22 | aX13_X | 2008 | 2 | 1
23 | a\ | 1852 | 1 | 1
24 | a\ | 1935 | 1 | 1
25 | a\ | 1937 | 2 | 2
26 | ACT1V1T1ES | 2003 | 15 | 11
27 | ACT1V1T1ES | 2004 | 63 | 6
28 | ACT1V1T1ES | 2005 | 1 | 1
29 | ACT1V1T1ES | 2006 | 5 | 4
30 | ACT1V1T1ES | 2008 | 4 | 3
31 | ACTION= | 1995 | 3 | 3
32 | ACTION= | 1996 | 6 | 5
33 | ACTION= | 1997 | 9 | 7
34 | ACTION= | 1998 | 19 | 11
35 | ACTION= | 1999 | 11 | 5
-
1This solution doesn't suit my problem for a few reasons. First, the quotes are lost. I want to keep the
ACTION="
ngram distinct fromACTION=
ngram. Similarly the data probably has semicolons in the ngrams. Lastly, per my edit, I'm trying to avoid preprocessing.Tim Smith– Tim Smith2016年12月06日 01:00:06 +00:00Commented Dec 6, 2016 at 1:00 -
I often re-format data before using
copy
. Please update your solution when you find it. Thanks.Luan Huynh– Luan Huynh2016年12月06日 02:03:34 +00:00Commented Dec 6, 2016 at 2:03