I have an externally-produced CSV filled with geographic points and associated data, but I would like to run some functions on the columns before saving them. For example I need to digest longitude/latitude columns into points, run some numerical processing on other columns, and discarding a couple altogether.
What's the best practise here? Options I've considered:
- Processing the data with a shell script before importing. Can be invoked directly from the COPY command with COPY FROM PROGRAM, but still leaves the problem of turning the lon/lat into a geographical type.
- Having two tables,
data
anddata_import
, and updatingdata
using functions on the columns ofdata_import
. - Having a table where the imported data is as it appears in the CSV, working with it through a View that does the necessary processing.
-
The answer to these questions is going to be around performance and where you want to pay the penalty. All three methods are good solutions and represent trade offs based around performance, storage and timing. I personally lean towards the second approach, having a table for the import and then transform it into it's final resting place. But that makes the Database Server work hard and it may be easier to have another application do the transformation work on a different CPU.Jonathan Fite– Jonathan Fite2017年03月10日 13:27:25 +00:00Commented Mar 10, 2017 at 13:27
-
The processing isn't especially complex, and the imports aren't too frequent. I've just read about materialised views, would their persistence gain the benefits of a second table without the added complexity of having multiple tables for the same dataset?TroyHurts– TroyHurts2017年03月10日 13:47:54 +00:00Commented Mar 10, 2017 at 13:47
-
I wouldn't go that route although it would give similar performance. Have a data_import table, transform the data to data_permanent and then truncate data_import until the next time you have new data. I'm not familiar with postgresql, but I know that in MSSQL that materialized views have some caveats which make working with them less than ideal. They have their place, but wouldn't be my first choice in the scenario you describe.Jonathan Fite– Jonathan Fite2017年03月10日 14:31:24 +00:00Commented Mar 10, 2017 at 14:31
1 Answer 1
The common way is to use a staging table (the table you called "data_import" which conventionally would be called "data_stg" or similar variation).
Preferably, no "updating" would be involved (maybe it was just a poor choice of words) but only a single insert into ... select ...
.
-
Ah, Updating/inserting was my mistake, I've not fully grokked the distinction yet. Is the expectation that the staging table would be cleared after the insert operation every time?TroyHurts– TroyHurts2017年03月10日 14:02:46 +00:00Commented Mar 10, 2017 at 14:02
-
Generally yes, unless there was a pressing need to retain the data as it was originally imported. But then you could decide whether to keep the raw CSV files or the information contained in data_stg. An important consideration especially if this is going to be automated is what steps will be involved when the process breaks somewhere. You will need a method to either cleanly resume the process at any step or be able to clean the badly imported data and try again.Jonathan Fite– Jonathan Fite2017年03月10日 16:31:36 +00:00Commented Mar 10, 2017 at 16:31
-
Sorry, I forgot to get back to you. My recommendation is to leave the staging table as is after the loading and truncate it as part of the next load. In that way you'll have the possibility to do some fast verifications in case of some issue arise. I also recommend to save the source files.David דודו Markovitz– David דודו Markovitz2017年03月10日 16:39:26 +00:00Commented Mar 10, 2017 at 16:39