I have a wide CSV file of about 350mb, and want to load it into a SQL database and properly model the data to make it easier to use for analysis.
- I could split the data into tables with python and then loaded into sql
- Or load the file into the database as a table, and then split it using sql.
What would be the standard approach? Or how should I choose?
-
2You only expect to do this once? Use whatever gets the job done fastest.Thorbjørn Ravn Andersen– Thorbjørn Ravn Andersen2023年12月22日 16:34:45 +00:00Commented Dec 22, 2023 at 16:34
-
For the current task I plan to do it only once, but I imagine that I'll have to perform similar tasks in the future. Thanks for the insight though, its definitely a viable option.HappilyCoding– HappilyCoding2023年12月22日 17:32:53 +00:00Commented Dec 22, 2023 at 17:32
-
"properly model the data" without even a hint what that means leaves a door open big enough to drive a Bagger 288 through.whatsisname– whatsisname2023年12月22日 21:49:11 +00:00Commented Dec 22, 2023 at 21:49
-
1load it all into sql and then work on it. The reason being, you will want to report on the errored lines. having the raw data in a "rawData" table makes it easy to say "couldn't parse row 123" and work out what the issue is, rather than errored on line 12312412, col x start again!Ewan– Ewan2023年12月22日 22:01:36 +00:00Commented Dec 22, 2023 at 22:01
1 Answer 1
It may be unsatisfying for you, but for this kind of task (as well as for the majority of other software engineering tasks), the answer is
There is no standard.
I have actually seen both kind of approaches working (and working well) for comparable tasks - ETL processes can be designed with "Extract" and "Transform" fully inside the database server using stored procedures, or on a client, with whatever programming language one is most familiar with.
So as long as you superiors don't tell you "at our team, we prefer X over Y, since X is our internal standard", choose whichever approach you feel more comfortable with.
-
Thank you, Its good to know that both approaches work well in production. Do you know if pandas would be a good tool to use in the python approach in production? I've used it and seen it widely used in ML competitions and such, but I don't think I've come across any job postings that mention it, unlike spark.HappilyCoding– HappilyCoding2023年12月22日 17:36:16 +00:00Commented Dec 22, 2023 at 17:36
-
2@HappilyCoding: No I did not use Pandas in the past. But even if I would have, I think it is never a good idea to give recommendations for or against a tool without knowing if the tool fits to the requirements.Doc Brown– Doc Brown2023年12月22日 19:22:02 +00:00Commented Dec 22, 2023 at 19:22
-
@HappilyCoding You don't even need to load the file into a DB to use SQL. There is a Python library called csvquery (for example) which supports using SQL directly on set of CSV files.JimmyJames– JimmyJames2023年12月27日 16:27:21 +00:00Commented Dec 27, 2023 at 16:27