sorting 1172026 entries

Sun May 6 19:54:16 EDT 2012

On 06May2012 18:36, J. Mwebaze <jmwebaze at gmail.com> wrote:
| > for filename in txtfiles:
| > temp=[]
| > f=open(filename)
| > for line in f.readlines():
| > line = line.strip()
| > line=line.split()
| > temp.append((parser.parse(line[0]), float(line[1])))

Have you timed the different parts of your code instead of the whole
thing?
Specificly, do you know the sort time is the large cost?
I would point out that the loop above builds the list by append(), one
item at a time. That should have runtime cost of the square of the list
length, 1172026 * 1172026. Though I've just done this:
 [Documents/python]oscar1*> python
 Python 2.7.3 (default, May 4 2012, 16:19:02) 
 [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> L1 = []
 >>> for i in range(1000000): L1.append(0)
 ... 
and it only took a few seconds.
As pointed out by others, the readlines() is also a little expensive,
conceivably similarly so (it also needs to build a huge list).
Anyway, put some:
 print time.time()
at various points. Not in the inner bits of the loops, but around larger
chunks, example:
 from time import time
 temp=[]
 f=open(filename)
 print "after open", time()
 lines = f.readlines()
 print "after readlines", time()
 for line in lines:
 line = line.strip()
 line=line.split()
 temp.append((parser.parse(line[0]), float(line[1])))
 print "after read loop", time()
and so on. AT least then you will have more feel for what part of your
code is taking so long.
Ceers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/
The shortest path between any two truths in the real domain passes through
the complex domain. - J. Hadamand