7

I have a large tab delimited file that I want to read into a table in PostgreSQL 9.5. It contains double-quotes and backslashes that I want to treat as regular characters.

I think COPY FROM is the way to go, but I can't figure out how to disable escaping.

Here is a sample of the data (it's from Google's ngram dataset):

aX13_X 2006 8 5
aX13_X 2007 4 3
aX13_X 2008 2 1
a\ 1852 1 1
a\ 1935 1 1
a\ 1937 2 2
ACT1V1T1ES 2003 15 11
ACT1V1T1ES 2004 63 6
ACT1V1T1ES 2005 1 1
ACT1V1T1ES 2006 5 4
ACT1V1T1ES 2008 4 3
ACTION=" 1995 3 3
ACTION=" 1996 6 5
ACTION=" 1997 9 7
ACTION=" 1998 19 11
ACTION=" 1999 11 5

And the table:

CREATE TABLE onegram (
 id SERIAL,
 ngram character text,
 year integer NOT NULL,
 match_count integer NOT NULL,
 volume_count integer NOT NULL
);

If I try without modifiers, the backslash escapes the tab:

ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv';
ERROR: missing data for column "volume_count"
CONTEXT: COPY onegram, line 4: "a\ 1852 1 1"

So I switch to CSV, and then the double quote characters quote the tabs:

ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv' WITH CSV DELIMITER E'\t';
ERROR: unterminated CSV quoted field
CONTEXT: COPY onegram, line 17: "ACTION=" 1999 11 5
"

Using CSV lets me use the DELIMITER keyword. It works if I pick a delimiter that doesn't occur in the sample (space, in this case):

ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv' WITH CSV DELIMITER E'\t' QUOTE E' ';
COPY 16

But I want to be able to include any character (except for tab and newline). So, how can I disable QUOTE? Or what can I use instead of COPY FROM?

Edit: For slightly arbitrary reasons, ideally I'd like an option that does not involve pre-processing the data.

asked Dec 5, 2016 at 12:53
0

3 Answers 3

5

The copy command by default uses text format with tab delimiter. So only one thing you need is to escape backslashes:

copy onegram (ngram, year, match_count, volume_count)
from program 'sed ''s/\\/\\\\/g'' < /home/tims/data/ngram/test.tsv';
select * from onegram;
╔════╤════════════╤══════╤═════════════╤══════════════╗
║ id │ ngram │ year │ match_count │ volume_count ║
╠════╪════════════╪══════╪═════════════╪══════════════╣
║ 1 │ aX13_X │ 2006 │ 8 │ 5 ║
║ 2 │ aX13_X │ 2007 │ 4 │ 3 ║
║ 3 │ aX13_X │ 2008 │ 2 │ 1 ║
║ 4 │ a\ │ 1852 │ 1 │ 1 ║
║ 5 │ a\ │ 1935 │ 1 │ 1 ║
║ 6 │ a\ │ 1937 │ 2 │ 2 ║
║ 7 │ ACT1V1T1ES │ 2003 │ 15 │ 11 ║
║ 8 │ ACT1V1T1ES │ 2004 │ 63 │ 6 ║
║ 9 │ ACT1V1T1ES │ 2005 │ 1 │ 1 ║
║ 10 │ ACT1V1T1ES │ 2006 │ 5 │ 4 ║
║ 11 │ ACT1V1T1ES │ 2008 │ 4 │ 3 ║
║ 12 │ ACTION=" │ 1995 │ 3 │ 3 ║
║ 13 │ ACTION=" │ 1996 │ 6 │ 5 ║
║ 14 │ ACTION=" │ 1997 │ 9 │ 7 ║
║ 15 │ ACTION=" │ 1998 │ 19 │ 11 ║
║ 16 │ ACTION=" │ 1999 │ 11 │ 5 ║
╚════╧════════════╧══════╧═════════════╧══════════════╝
answered Dec 5, 2016 at 20:56
0
3

It looks like you'll need to preprocess the data. COPY can't ingest this data if you want to allow any possible input. It isn't valid CSV, and it doesn't follow the rules of postgres's TSV-like native COPY format either. There is no option to set QUOTE NONE or something like that.

I suggest something like a perl/python script that transforms the data, which you invoke using COPY ... FROM PROGRAM. Or by piping the data into psql for client-side input, or by transforming the data and directly feeding it to Postgres via the client driver like DBD::Pg or psycopg2, both of which have COPY support.

You could always submit a postgres patch so the next person has an easier job of the same problem.

answered Dec 5, 2016 at 14:15
2
  • I suspect you're right. I'll look at making a patch. Commented Dec 6, 2016 at 0:55
  • 1
    It shouldn't be too hard (famous last words?) to allow COPY to take a keyword as the quote char, like NONE, which is an unreserved keyword already and thus suitable for that job. You'd need to change gram.y to accept it. CopyStmt doesn't need changes, the List of DefElem there is fine. In copy.c you'd need to make DoCopy check for the NONE keyword in the quote option element; around 1084. Then actually handle that in data input. Commented Dec 6, 2016 at 1:11
0

From @Craig Ringer, you should re-format the data.

Based on your data, on Linux, I used tr & seq to format CSV file before importing.

For example: suppose that a.csv file as below

aX13_X 2006 8 5
aX13_X 2007 4 3
aX13_X 2008 2 1
a\ 1852 1 1
a\ 1935 1 1
a\ 1937 2 2
ACT1V1T1ES 2003 15 11
ACT1V1T1ES 2004 63 6
ACT1V1T1ES 2005 1 1
ACT1V1T1ES 2006 5 4
ACT1V1T1ES 2008 4 3
ACTION=" 1995 3 3
ACTION=" 1996 6 5
ACTION=" 1997 9 7
ACTION=" 1998 19 11
ACTION=" 1999 11 5

Format CSV, disable quote (replace multiple spaces to one space then space & \t to ;). c.csv is result file. However, please check your input data if it is large because I tested on small data .

cat a.csv | tr -s " " > b.csv; sed -e 's/"//g' -e "s/[\t ]/;/g" b.csv > c.csv 
cat c.csv
aX13_X;2006;8;5
aX13_X;2007;4;3
aX13_X;2008;2;1
a\;1852;1;1
a\;1935;1;1
a\;1937;2;2
ACT1V1T1ES;2003;15;11
ACT1V1T1ES;2004;63;6
ACT1V1T1ES;2005;1;1
ACT1V1T1ES;2006;5;4
ACT1V1T1ES;2008;4;3
ACTION=;1995;3;3
ACTION=;1996;6;5
ACTION=;1997;9;7
ACTION=;1998;19;11
ACTION=;1999;11;5

Import to PG

psql -p 5432 -d postgres -c "copy onegram (ngram, year, match_count, volume_count) from '/opt/c.csv' DELIMITER ';' CSV"

After importing, result likes that:

id | ngram | year | match_count | volume_count
----+------------+------+-------------+--------------
 20 | aX13_X | 2006 | 8 | 5
 21 | aX13_X | 2007 | 4 | 3
 22 | aX13_X | 2008 | 2 | 1
 23 | a\ | 1852 | 1 | 1
 24 | a\ | 1935 | 1 | 1
 25 | a\ | 1937 | 2 | 2
 26 | ACT1V1T1ES | 2003 | 15 | 11
 27 | ACT1V1T1ES | 2004 | 63 | 6
 28 | ACT1V1T1ES | 2005 | 1 | 1
 29 | ACT1V1T1ES | 2006 | 5 | 4
 30 | ACT1V1T1ES | 2008 | 4 | 3
 31 | ACTION= | 1995 | 3 | 3
 32 | ACTION= | 1996 | 6 | 5
 33 | ACTION= | 1997 | 9 | 7
 34 | ACTION= | 1998 | 19 | 11
 35 | ACTION= | 1999 | 11 | 5
answered Dec 5, 2016 at 14:28
2
  • 1
    This solution doesn't suit my problem for a few reasons. First, the quotes are lost. I want to keep the ACTION=" ngram distinct from ACTION= ngram. Similarly the data probably has semicolons in the ngrams. Lastly, per my edit, I'm trying to avoid preprocessing. Commented Dec 6, 2016 at 1:00
  • I often re-format data before using copy. Please update your solution when you find it. Thanks. Commented Dec 6, 2016 at 2:03

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.