Processing CSV data on import

Question 1

I have an externally-produced CSV filled with geographic points and associated data, but I would like to run some functions on the columns before saving them. For example I need to digest longitude/latitude columns into points, run some numerical processing on other columns, and discarding a couple altogether.

What's the best practise here? Options I've considered:

Processing the data with a shell script before importing. Can be invoked directly from the COPY command with COPY FROM PROGRAM, but still leaves the problem of turning the lon/lat into a geographical type.
Having two tables, data and data_import, and updating data using functions on the columns of data_import.
Having a table where the imported data is as it appears in the CSV, working with it through a View that does the necessary processing.

Question 2

The answer to these questions is going to be around performance and where you want to pay the penalty. All three methods are good solutions and represent trade offs based around performance, storage and timing. I personally lean towards the second approach, having a table for the import and then transform it into it's final resting place. But that makes the Database Server work hard and it may be easier to have another application do the transformation work on a different CPU.

Question 3

The processing isn't especially complex, and the imports aren't too frequent. I've just read about materialised views, would their persistence gain the benefits of a second table without the added complexity of having multiple tables for the same dataset?

Question 4

I wouldn't go that route although it would give similar performance. Have a data_import table, transform the data to data_permanent and then truncate data_import until the next time you have new data. I'm not familiar with postgresql, but I know that in MSSQL that materialized views have some caveats which make working with them less than ideal. They have their place, but wouldn't be my first choice in the scenario you describe.

Question 5

The common way is to use a staging table (the table you called "data_import" which conventionally would be called "data_stg" or similar variation).
Preferably, no "updating" would be involved (maybe it was just a poor choice of words) but only a single insert into ... select ....

Question 6

Ah, Updating/inserting was my mistake, I've not fully grokked the distinction yet. Is the expectation that the staging table would be cleared after the insert operation every time?

Question 7

Generally yes, unless there was a pressing need to retain the data as it was originally imported. But then you could decide whether to keep the raw CSV files or the information contained in data_stg. An important consideration especially if this is going to be automated is what steps will be involved when the process breaks somewhere. You will need a method to either cleanly resume the process at any step or be able to clean the badly imported data and try again.

Question 8

Sorry, I forgot to get back to you. My recommendation is to leave the staging table as is after the loading and truncate it as part of the next load. In that way you'll have the possibility to do some fast verifications in case of some issue arise. I also recommend to save the source files.

score 0 · Accepted Answer · 2017-03-10 13:50:24Z

0

The common way is to use a staging table (the table you called "data_import" which conventionally would be called "data_stg" or similar variation).
Preferably, no "updating" would be involved (maybe it was just a poor choice of words) but only a single insert into ... select ....

Share

Improve this answer

answered Mar 10, 2017 at 13:50

David דודו Markovitz's user avatar

David דודו Markovitz David דודו Markovitz

3,27414 silver badges21 bronze badges

3

Ah, Updating/inserting was my mistake, I've not fully grokked the distinction yet. Is the expectation that the staging table would be cleared after the insert operation every time?

TroyHurts
– TroyHurts

2017年03月10日 14:02:46 +00:00
Commented Mar 10, 2017 at 14:02
Generally yes, unless there was a pressing need to retain the data as it was originally imported. But then you could decide whether to keep the raw CSV files or the information contained in data_stg. An important consideration especially if this is going to be automated is what steps will be involved when the process breaks somewhere. You will need a method to either cleanly resume the process at any step or be able to clean the badly imported data and try again.

Jonathan Fite
– Jonathan Fite

2017年03月10日 16:31:36 +00:00
Commented Mar 10, 2017 at 16:31
Sorry, I forgot to get back to you. My recommendation is to leave the staging table as is after the loading and truncate it as part of the next load. In that way you'll have the possibility to do some fast verifications in case of some issue arise. I also recommend to save the source files.

David דודו Markovitz
– David דודו Markovitz

2017年03月10日 16:39:26 +00:00
Commented Mar 10, 2017 at 16:39

Add a comment |

Stack Exchange Network

Processing CSV data on import

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Processing CSV data on import

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions