Modeling a CSV file: What is the standard? Python or SQL?

Question 1

I have a wide CSV file of about 350mb, and want to load it into a SQL database and properly model the data to make it easier to use for analysis.

I could split the data into tables with python and then loaded into sql
Or load the file into the database as a table, and then split it using sql.

What would be the standard approach? Or how should I choose?

Question 2

You only expect to do this once? Use whatever gets the job done fastest.

Question 3

For the current task I plan to do it only once, but I imagine that I'll have to perform similar tasks in the future. Thanks for the insight though, its definitely a viable option.

Question 4

"properly model the data" without even a hint what that means leaves a door open big enough to drive a Bagger 288 through.

Question 5

load it all into sql and then work on it. The reason being, you will want to report on the errored lines. having the raw data in a "rawData" table makes it easy to say "couldn't parse row 123" and work out what the issue is, rather than errored on line 12312412, col x start again!

Question 6

It may be unsatisfying for you, but for this kind of task (as well as for the majority of other software engineering tasks), the answer is

There is no standard.

I have actually seen both kind of approaches working (and working well) for comparable tasks - ETL processes can be designed with "Extract" and "Transform" fully inside the database server using stored procedures, or on a client, with whatever programming language one is most familiar with.

So as long as you superiors don't tell you "at our team, we prefer X over Y, since X is our internal standard", choose whichever approach you feel more comfortable with.

Question 7

Thank you, Its good to know that both approaches work well in production. Do you know if pandas would be a good tool to use in the python approach in production? I've used it and seen it widely used in ML competitions and such, but I don't think I've come across any job postings that mention it, unlike spark.

Question 8

@HappilyCoding: No I did not use Pandas in the past. But even if I would have, I think it is never a good idea to give recommendations for or against a tool without knowing if the tool fits to the requirements.

Question 9

@HappilyCoding You don't even need to load the file into a DB to use SQL. There is a Python library called csvquery (for example) which supports using SQL directly on set of CSV files.

Doc Brown Doc Brown 220k35 gold badges409 silver badges621 bronze badges · Accepted Answer · 2023-12-22 17:11:00Z

8

It may be unsatisfying for you, but for this kind of task (as well as for the majority of other software engineering tasks), the answer is

There is no standard.

I have actually seen both kind of approaches working (and working well) for comparable tasks - ETL processes can be designed with "Extract" and "Transform" fully inside the database server using stored procedures, or on a client, with whatever programming language one is most familiar with.

So as long as you superiors don't tell you "at our team, we prefer X over Y, since X is our internal standard", choose whichever approach you feel more comfortable with.

Share

Improve this answer

answered Dec 22, 2023 at 17:11

Doc Brown's user avatar

Doc Brown Doc Brown

220k35 gold badges409 silver badges621 bronze badges

3

Thank you, Its good to know that both approaches work well in production. Do you know if pandas would be a good tool to use in the python approach in production? I've used it and seen it widely used in ML competitions and such, but I don't think I've come across any job postings that mention it, unlike spark.

HappilyCoding
– HappilyCoding

2023年12月22日 17:36:16 +00:00
Commented Dec 22, 2023 at 17:36
2

@HappilyCoding: No I did not use Pandas in the past. But even if I would have, I think it is never a good idea to give recommendations for or against a tool without knowing if the tool fits to the requirements.

Doc Brown
– Doc Brown

2023年12月22日 19:22:02 +00:00
Commented Dec 22, 2023 at 19:22
@HappilyCoding You don't even need to load the file into a DB to use SQL. There is a Python library called csvquery (for example) which supports using SQL directly on set of CSV files.

JimmyJames
– JimmyJames

2023年12月27日 16:27:21 +00:00
Commented Dec 27, 2023 at 16:27

Add a comment |

Stack Exchange Network

Modeling a CSV file: What is the standard? Python or SQL?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Modeling a CSV file: What is the standard? Python or SQL?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions