i am reading a csv file into a list of a list in python. it is around 100mb right now. in a couple of years that file will go to 2-5gigs. i am doing lots of log calculations on the data. the 100mb file is taking the script around 1 minute to do. after the script does a lot of fiddling with the data, it creates URL's that point to google charts and then downloads the charts locally.
can i continue to use python on a 2gig file or should i move the data into a database?
-
Is your data changing at all? In other words do the old raw rows change over time?mcpeterson– mcpeterson2010年08月05日 23:21:57 +00:00Commented Aug 5, 2010 at 23:21
-
no, old rows do not change over timeAlex Gordon– Alex Gordon2010年08月05日 23:23:15 +00:00Commented Aug 5, 2010 at 23:23
-
Do you need to do the calculations/download the charts for all rows, or only the new entries in the log? Also do the charts change over time?Cervo– Cervo2010年08月05日 23:26:14 +00:00Commented Aug 5, 2010 at 23:26
-
Are your calculations intense crunching (lots of floating point stuff / simulations / model scoring) or simple sums / counts / trends / groupings?mcpeterson– mcpeterson2010年08月05日 23:26:47 +00:00Commented Aug 5, 2010 at 23:26
-
im taking logarithms for every row in the data set, and then doing some simple stuff, like multiplyAlex Gordon– Alex Gordon2010年08月05日 23:29:59 +00:00Commented Aug 5, 2010 at 23:29
5 Answers 5
I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.
Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.
ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.
You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...
You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.
As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.
If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.
Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.
Before an entire database, you may want to think of SQLite.
Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.
7 Comments
If you need to go through all lines each time you perform the "fiddling" it wouldn't really make much difference, assuming the actual "fiddling" is whats eating your cycles.
Perhaps you could store the results of your calculations somehow, then a database would probably be nice. Also, databases have methods for ensuring data integrity and stuff like that, so a database is often a great place for storing large sets of data (duh! ;)).
3 Comments
I'd only put it into a relational database if:
- The data is actually relational and expressing it that way helps shrink the size of the data set by normalizing it.
- You can take advantage of triggers and stored procedures to offload some of the calculations that your Python code is performing now.
- You can take advantage of queries to only perform calculations on data that's changed, cutting down on the amount of work done by Python.
If neither of those things is true, I don't see much difference between a database and a file. Both ultimately have to be stored on the file system.
If Python has to process all of it, and getting it into memory means loading an entire data set, then there's no difference between a database and a flat file.
2GB of data in memory could mean page swapping and thrashing by your application. I would be careful and get some data before I blamed the problem on the file. Just because you access the data from a database won't solve a paging problem.
If your data's flat, I see less advantage in a database, unless "flat" == "highly denormalized".
I'd recommend some profiling to see what's consuming CPU and memory before I made a change. You're guessing about the root cause right now. Better to get some data so you know where the time is being spent.
7 Comments
I always reach for a database for larger datasets.
A database gives me some stuff for "free"; that is, I don't have to code it.
- searching
- sorting
- indexing
- language-independent connections
Something like SQLite might be the answer for you.
Also, you should investigate the "nosql" databases; it sounds like your problem might fit well into one of them.
5 Comments
At 2 gigs, you may start running up against speed issues. I work with model simulations for which it calls hundreds of csv files and it takes about an hour to go through 3 iterations, or about 20 minutes per loop.
This is a matter of personal preference, but I would go with something like PostGreSql because it integrates the speed of python with the capacity of a sql-driven relational database. I encountered the same issue a couple of years ago when my Access db was corrupting itself and crashing on a daily basis. It was either MySQL or PostGres and I chose Postgres because of its python friendliness. Not to say MySQL would not work with Python, because it does, which is why I say its personal preference.
Hope that helps with your decision-making!