-1

I have a wide CSV file of about 350mb, and want to load it into a SQL database and properly model the data to make it easier to use for analysis.

  1. I could split the data into tables with python and then loaded into sql
  2. Or load the file into the database as a table, and then split it using sql.

What would be the standard approach? Or how should I choose?

Philip Kendall
26.1k10 gold badges66 silver badges68 bronze badges
asked Dec 22, 2023 at 15:40
4
  • 2
    You only expect to do this once? Use whatever gets the job done fastest. Commented Dec 22, 2023 at 16:34
  • For the current task I plan to do it only once, but I imagine that I'll have to perform similar tasks in the future. Thanks for the insight though, its definitely a viable option. Commented Dec 22, 2023 at 17:32
  • "properly model the data" without even a hint what that means leaves a door open big enough to drive a Bagger 288 through. Commented Dec 22, 2023 at 21:49
  • 1
    load it all into sql and then work on it. The reason being, you will want to report on the errored lines. having the raw data in a "rawData" table makes it easy to say "couldn't parse row 123" and work out what the issue is, rather than errored on line 12312412, col x start again! Commented Dec 22, 2023 at 22:01

1 Answer 1

8

It may be unsatisfying for you, but for this kind of task (as well as for the majority of other software engineering tasks), the answer is

There is no standard.

I have actually seen both kind of approaches working (and working well) for comparable tasks - ETL processes can be designed with "Extract" and "Transform" fully inside the database server using stored procedures, or on a client, with whatever programming language one is most familiar with.

So as long as you superiors don't tell you "at our team, we prefer X over Y, since X is our internal standard", choose whichever approach you feel more comfortable with.

answered Dec 22, 2023 at 17:11
3
  • Thank you, Its good to know that both approaches work well in production. Do you know if pandas would be a good tool to use in the python approach in production? I've used it and seen it widely used in ML competitions and such, but I don't think I've come across any job postings that mention it, unlike spark. Commented Dec 22, 2023 at 17:36
  • 2
    @HappilyCoding: No I did not use Pandas in the past. But even if I would have, I think it is never a good idea to give recommendations for or against a tool without knowing if the tool fits to the requirements. Commented Dec 22, 2023 at 19:22
  • @HappilyCoding You don't even need to load the file into a DB to use SQL. There is a Python library called csvquery (for example) which supports using SQL directly on set of CSV files. Commented Dec 27, 2023 at 16:27

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.