Timing analysis of many test cases in Python

Question 1

I am learning about timing analysis in Python. I used time module before it was pointed out that timeit module would be better for timing analysis.

My main focus on timing analysis is for conducting many test cases on 2 variations of the same function. In this example code I used 2 variations of function for reversing a number.

Can the timing analysis be done in a better way? Is the method scalable for many test cases?

def reverse_num(num):
 '''returns the reverse of an integer '''
 if num < 0:
 return - int(str(-num)[::-1])
 else:
 return int(str(num)[::-1])
def reverse_num2(num):
 rev = 0
 while num > 0:
 rev *= 10
 rev += num % 10
 num /= 10
 return rev
if __name__ == "__main__":
 from timeit import Timer
 def test(f):
 for i in xrange(1000):
 f(i)
 print Timer(lambda: test(reverse_num)).timeit(number = 100)
 print Timer(lambda: test(reverse_num2)).timeit(number = 100)

I am posting sample test runs for my usage of timeit module for timing for the following code. I wrote it for this question.

def mysort(words):
 mylist1 = sorted([i for i in words if i[:1] == "s"])
 mylist2 = sorted([i for i in words if i[:1] != "s"])
 list = mylist1 + mylist2
 return list
def mysort3(words):
 ans = []
 p = ans.append
 q = words.remove
 words.sort()
 for i in words[:]:
 if i[0] == 's':
 p(i)
 q(i)
 return ans + words
def mysort4(words):
 ans1 = []
 ans2 = []
 p = ans1.append
 q = ans2.append
 for i in words:
 if i[0] == 's':
 p(i)
 else:
 q(i)
 ans1.sort()
 ans2.sort()
 return ans1 + ans2
def mysort6(words):
 return ( sorted([i for i in words if i[:1] == "s"]) +
 sorted([i for i in words if i[:1] != "s"])
 )
if __name__ == "__main__":
 from timeit import Timer
 def test(f):
 f(['a','b','c','abcd','s','se', 'ee', 'as'])
 print Timer(lambda: test(mysort)).timeit(number = 10000)
 print Timer(lambda: test(mysort3)).timeit(number = 10000)
 print Timer(lambda: test(mysort4)).timeit(number = 10000)
 print Timer(lambda: test(mysort6)).timeit(number = 10000)

The timing results are

>>> ================================ RESTART ================================
>>> 
0.0643831414457
0.0445699517515
0.0446611241927
0.0616633662243
>>> ================================ RESTART ================================
>>> 
0.0625827896417
0.045101588386
0.0443568108447
0.0607123363607
>>> ================================ RESTART ================================
>>> 
0.0647336488305
0.0596154305912
0.0445711673841
0.0614101094434
>>> ================================ RESTART ================================
>>> 
0.0689924148581
0.0502542495475
0.0443466805735
0.0903267660811
>>> ================================ RESTART ================================
>>> 
0.0695374234506
0.0579001730656
0.0443790974415
0.0835670386907
>>> ================================ RESTART ================================
>>> 
0.0675612101379
0.05079925814
0.044170413854
0.0681050030978

As you can see that the results vary each time. I understand that the times are small but that is the whole purpose of number = 10000 in timeit to measure consistently by doing timing many times and finding average or the best.

In most runs the 2nd function takes more time than 3rd function but sometimes it doesn't. The 4th function takes more time than 1st function in some cases and sometimes less.

I think it is correct according to usage but why these variable results? Is this the proper way of doing timing or not? I am really confused about this.

After some discussion in the chat room and some more thinking I came up with for testing functions that have integer inputs. Additional nested functions can be defined for more tests and the repetitions can be made more. It may take some time but it gives good results as far as I can see. Simple function calls need to be added for variations of the same function.

def testing():
 from timeit import Timer
 import random
 def tests(f, times):
 def test1(f):
 f(random.randint(1, 1000))
 def test2(f):
 f(random.randint(100000, 1000000))
 print(f.__name__)
 print(Timer(lambda: test1(f)).timeit(number = times))
 print(Timer(lambda: test2(f)).timeit(number = times//10))
 print()
 tests(reverse_num, 10000)
 tests(reverse_num2, 10000)

Please note that I started using Python 3 while I was still discussing the question. I won't be using anything specific to Python 3 so all codes can be run on Python 2. Anyone wanting to run new codes will just need to change the print statements. Nothing else.

Question 2

"My usage of timeit module gave variable results on every run." How many tests/how long runs were they? What was the relative difference?

Question 3

Updated the question.

Question 4

Software timing benchmarks are notoriously noisy because your computer isn't a very good controlled experimental platform.

Some algorithms are nondeterministic. For instance, some quicksort implementations choose a random pivot element.
While you're running your benchmarks, your python process may get preempted by other running applications, operating system threads, etc.
Even if your process got the same amount of CPU during every run, the timing may be different because a different mix of applications is running so your processor's memory caches have different hit/miss patterns.
Some runs may get better branch prediction than others (again, due to the mix of other processes running). This will cause more pipeline stalls.

I would suspect that the effects I've given are given in order of decreasing strength (except that I believe Python's sort algorithm is deterministic, so the first effect is not a factor here). I also suspect that I've just scratched the surface.

EDIT:

OK, for reversing a number, suppose that your typical input will be 5 digits then you can use this.

if __name__ == "__main__":
 from timeit import Timer
 import random
 def test(f):
 f(random.randint(10000,99999))
 print Timer(lambda: test(reverse_num)).timeit(number = 10000000)
 print Timer(lambda: test(reverse_num2)).timeit(number = 10000000)

Question 5

Thanks for answering. I thought it was never going to be answered by anyone. Just one thing. If I comment could you please reply? You never replied to any of my comments on your last answer.

Question 6

Does this mean there is no hope of any proper timing analysis for finding better implementations of an algorithm? Large difference are obviously noticeable but I wanted to know about timing of many test cases. How can I find which one is better at test cases? How do I test small functions? How does anyone do proper timing for various test cases? How do people find that this function works better for small values and that function works for large values? I think this much questions will be able to explain my confusion. If not, please reply I'll try to elaborate.

Question 7

I think that "no hope" is a bit strong. If your benchmarks are so close that sometimes they cross one another, you can probably conclude that neither approach is significantly more efficient than the other.

Question 8

As for small functions, it helps to run the function many times, as you do. To determine how different algorithms do on small vs large inputs, the usual practice is to come up with a random function that produces small values, and test against many different small inputs, and then come up with a random function that produces large values, and test against many different large values.

Question 9

Can you come into this chat room? Otherwise the comments will be deleted due to long discussions.

ruds ruds 3,37117 silver badges23 bronze badges · Accepted Answer · 2013-07-18 05:43:31Z

Software timing benchmarks are notoriously noisy because your computer isn't a very good controlled experimental platform.

Some algorithms are nondeterministic. For instance, some quicksort implementations choose a random pivot element.
While you're running your benchmarks, your python process may get preempted by other running applications, operating system threads, etc.
Even if your process got the same amount of CPU during every run, the timing may be different because a different mix of applications is running so your processor's memory caches have different hit/miss patterns.
Some runs may get better branch prediction than others (again, due to the mix of other processes running). This will cause more pipeline stalls.

I would suspect that the effects I've given are given in order of decreasing strength (except that I believe Python's sort algorithm is deterministic, so the first effect is not a factor here). I also suspect that I've just scratched the surface.

EDIT:

OK, for reversing a number, suppose that your typical input will be 5 digits then you can use this.

if __name__ == "__main__":
 from timeit import Timer
 import random
 def test(f):
 f(random.randint(10000,99999))
 print Timer(lambda: test(reverse_num)).timeit(number = 10000000)
 print Timer(lambda: test(reverse_num2)).timeit(number = 10000000)

Thanks for answering. I thought it was never going to be answered by anyone. Just one thing. If I comment could you please reply? You never replied to any of my comments on your last answer.
Does this mean there is no hope of any proper timing analysis for finding better implementations of an algorithm? Large difference are obviously noticeable but I wanted to know about timing of many test cases. How can I find which one is better at test cases? How do I test small functions? How does anyone do proper timing for various test cases? How do people find that this function works better for small values and that function works for large values? I think this much questions will be able to explain my confusion. If not, please reply I'll try to elaborate.
I think that "no hope" is a bit strong. If your benchmarks are so close that sometimes they cross one another, you can probably conclude that neither approach is significantly more efficient than the other.
As for small functions, it helps to run the function many times, as you do. To determine how different algorithms do on small vs large inputs, the usual practice is to come up with a random function that produces small values, and test against many different small inputs, and then come up with a random function that produces large values, and test against many different large values.
Can you come into this chat room? Otherwise the comments will be deleted due to long discussions.

Stack Exchange Network

Timing analysis of many test cases in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Timing analysis of many test cases in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions