python or database?

Question 1

i am reading a csv file into a list of a list in python. it is around 100mb right now. in a couple of years that file will go to 2-5gigs. i am doing lots of log calculations on the data. the 100mb file is taking the script around 1 minute to do. after the script does a lot of fiddling with the data, it creates URL's that point to google charts and then downloads the charts locally.

can i continue to use python on a 2gig file or should i move the data into a database?

Question 2

Is your data changing at all? In other words do the old raw rows change over time?

Question 3

no, old rows do not change over time

Question 4

Do you need to do the calculations/download the charts for all rows, or only the new entries in the log? Also do the charts change over time?

Question 5

Are your calculations intense crunching (lots of floating point stuff / simulations / model scoring) or simple sums / counts / trends / groupings?

Question 6

im taking logarithms for every row in the data set, and then doing some simple stuff, like multiply

Question 7

I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.

Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.

ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.
You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...
You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.
As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.
If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.
Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.
Before an entire database, you may want to think of SQLite.
Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.

Question 8

java or C is faster than python by 30x?!?!?!

Question 9

Today's Great Language Shootout has the fastest program beating Python by 10x. Python is quite slow quite often.

Question 10

Depending upon what you are doing it can be. Compiled languages have a big advantage for tight loops and calculations. For those types of things a 10x+ difference is not unheard of.

Question 11

shootout.alioth.debian.org/u32/…

Question 12

@Paul nathan - Wow. Actually that's where I saw 30x in speed in some test between C and Python (although not recently). Only 10x between C and Python is a huge improvement on Python's part...

Question 13

If you need to go through all lines each time you perform the "fiddling" it wouldn't really make much difference, assuming the actual "fiddling" is whats eating your cycles.

Perhaps you could store the results of your calculations somehow, then a database would probably be nice. Also, databases have methods for ensuring data integrity and stuff like that, so a database is often a great place for storing large sets of data (duh! ;)).

Question 14

+1 for "store the results of your calculations". I'll point out that it's also possible for a file if you choose to add them to your file at the end of the calculation, so it's a wash.

Question 15

Yeah :) And of course a database is just some fancy algorithms and "a file" in the end. So you can reinvent the database using python if you want (it actually sounds fun...).

Question 16

Usually the databases are written in compiled languages, and for a sort compiled languages Python are orders of magnitude apart. Also sometimes databases can automatically parallelize things across processors/disks for you. But at the same time a database is mostly just another way to store the data. Unless there is some specific way you plan to take advantage of something it provides for speed, it's not going to magically make things faster. On per record basis often even scripting languages beat SQL Cursors.

Question 17

I'd only put it into a relational database if:

The data is actually relational and expressing it that way helps shrink the size of the data set by normalizing it.
You can take advantage of triggers and stored procedures to offload some of the calculations that your Python code is performing now.
You can take advantage of queries to only perform calculations on data that's changed, cutting down on the amount of work done by Python.

If neither of those things is true, I don't see much difference between a database and a file. Both ultimately have to be stored on the file system.

If Python has to process all of it, and getting it into memory means loading an entire data set, then there's no difference between a database and a flat file.

2GB of data in memory could mean page swapping and thrashing by your application. I would be careful and get some data before I blamed the problem on the file. Just because you access the data from a database won't solve a paging problem.

If your data's flat, I see less advantage in a database, unless "flat" == "highly denormalized".

I'd recommend some profiling to see what's consuming CPU and memory before I made a change. You're guessing about the root cause right now. Better to get some data so you know where the time is being spent.

Question 18

correct me if i am wrong, but for example a database would be better on a huge file that requires you to sort stuff, right?

Question 19

The answer depends on the file and the schema. You're correct that databases are good at sorting, but there are other considerations: indexing, number of JOINs, etc.

Question 20

it's flat. there's no relational data

Question 21

Databases are often really good at sorting huge amounts of data. Sorting a big ol' list in python would probably not be very efficient if the list doesn't fit in your memory for example. Also, indexing would allow you to search your data efficiently.

Question 22

@Andre - agreed, but there's no indication that the data processing has to sort or that the calculations depend on the data being in sorted order.

Question 23

I always reach for a database for larger datasets.

A database gives me some stuff for "free"; that is, I don't have to code it.

searching
sorting
indexing
language-independent connections

Something like SQLite might be the answer for you.

Also, you should investigate the "nosql" databases; it sounds like your problem might fit well into one of them.

Question 24

Also databases give you stuff you don't ask for like concurrency, locking, constraints, etc... Mostly you want these but from a text file it is adding extra stuff you don't want. Definitely explore optimizing your text file, then NOSQL and SQLLite solutions. And finally databases. Although I think for just a speed up a database won't help. You could probably do faster sorting on your own. 4GB already fits into memory, so a quick sort (even two quick sorts and a merge) would probably beat a database sort.

Question 25

Err assuming you aren't using Python to do that sort... In that case the compiled advantage may make even a database sort quicker than Python for large numbers of records....

Question 26

nosql is a category of database management systems - usually they don't have relational constraints, often they don't have ACID property.

Question 27

@Cervo: "NoSQL" == "Not Only SQL". Look at CouchDB, Voldemort, Neo4J, Hadoop, BigTable, etc. nosql-database.org

Question 28

I was thinking some of the simpler NoSQL solutions. But generally any database comes with the whole transaction processing/locking baggage and data integrity checking. Not all NoSQL solutions come with all that. Some are more complex than others and made for handling different aspects of transactions. I was thinking more of super simple like BDB (well I don't think that would apply to this problem) than something like BigTable or Cassandra.

Question 29

At 2 gigs, you may start running up against speed issues. I work with model simulations for which it calls hundreds of csv files and it takes about an hour to go through 3 iterations, or about 20 minutes per loop.

This is a matter of personal preference, but I would go with something like PostGreSql because it integrates the speed of python with the capacity of a sql-driven relational database. I encountered the same issue a couple of years ago when my Access db was corrupting itself and crashing on a daily basis. It was either MySQL or PostGres and I chose Postgres because of its python friendliness. Not to say MySQL would not work with Python, because it does, which is why I say its personal preference.

Hope that helps with your decision-making!

Question 30

thank you very much this is very helpful. can you give me an examlpe of why postgres is better?

Question 31

I wouldn't say Postgres is better than, for example MySQL or even Oracle. For me it was cost. Postgres is open source and my database is non-commercial, so I wanted to keep things as transparent and flexible as possible. I also like PostgreSQL's interface and from a usability standpoint it matched my learning curve.

Question 32

I think duffymo's explanation covers it. Relational databases are super powerful and will handle many of the tasks you are asking python to do. However, if you are simply interested in storage and reference, with little to no use for querying/calculating. My assumption was that you were going to be eventually performing calculations and adding/changing data, which is why I recommended going with a RDBMS

Cervo 3,0821 gold badge25 silver badges27 bronze badges · Accepted Answer · 2010-08-05 23:00:09Z

I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.

Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.

ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.
You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...
You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.
As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.
If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.
Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.
Before an entire database, you may want to think of SQLite.
Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.

Today's Great Language Shootout has the fastest program beating Python by 10x. Python is quite slow quite often.
Depending upon what you are doing it can be. Compiled languages have a big advantage for tight loops and calculations. For those types of things a 10x+ difference is not unheard of.
@Paul nathan - Wow. Actually that's where I saw 30x in speed in some test between C and Python (although not recently). Only 10x between C and Python is a huge improvement on Python's part...

CollectivesTM on Stack Overflow

python or database?

5 Answers 5

7 Comments

3 Comments

7 Comments

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

5 Answers 5

7 Comments

3 Comments

7 Comments

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related