Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 12:41

To reap the benefits of numpy, you need to use numpy arrays and functions throughout.

The simplest fix is using numpy.diff:

import numpy as np
for _ in range(10000):
 input = 100 * np.random.random(size=190)
 # size = len(input) - 1
 # differences = [(input[i + 1] - input[i]) for i in np.arange(0, size)]
 differences = np.diff(input)
 good_sigma = np.std(differences[20:60])
 upperbound = 3 * good_sigma
 lowerbound = -3 * good_sigma
 discard = np.where((lowerbound < differences) & (differences < upperbound))
 average_discard = np.average(input[np.min(discard):])

Swapping the two difference calls result in 1.8s for your list version vs 0.6s using numpy.diff (on my machine). Note that this includes the generation of ten-thousand random input vectors.

Apart from that, here are a few more tips:

Avoid shadowing built-in functions/variables. input is already an existing function that takes user input.
While the 0 + 3 * sigma helps to realize that the mean is at zero, a simple comment should be enough for that and save a few unneeded cycles.
In general, comments should explain why you are doung something the way you are doing it, instead of what you are doing. The latter should be clear from the code itself (which is not really the case with your code). This is especially important for example for the np.std call. While you explain that using standard deviation is a good way to measure spread (duh), you don't explain why you only take it for `differences[20:60]. If it is not possible to understand what the code does from looking at it, encapsulate the hard-to-understand code in a properly named function with a descriptive docstring.
Try to avoid creating copies of arrays by slicing.
Some small improvements can be made by getting the right index of where to start discarding (this assumes that later you also discard all values after the first to be discarded value). For this I used the slightly faster numpy.argmax, as described [here][1]. It could be even faster if your array of floats was sorted. It uses max, because True == 1 and it returns the first maximum that appears.
The algorithm as written seems to start discarding whenever the difference is smaller than three sigma, whereas normally you would probably want to start discarding as soon as the difference is larger than three sigma. For this you would have to turn around your logical operators (not done in code below).

import numpy as np array_of_floats = 100 * np.random.random(size=190) differences = np.diff(array_of_floats) three_sigma = 3 * np.std(differences[20:60]) # choose three sigma region around mean of zero upperbound = three_sigma lowerbound = -three_sigma discard = np.argmax((lowerbound

To reap the benefits of numpy, you need to use numpy arrays and functions throughout.

The simplest fix is using numpy.diff:

import numpy as np
for _ in range(10000):
 input = 100 * np.random.random(size=190)
 # size = len(input) - 1
 # differences = [(input[i + 1] - input[i]) for i in np.arange(0, size)]
 differences = np.diff(input)
 good_sigma = np.std(differences[20:60])
 upperbound = 3 * good_sigma
 lowerbound = -3 * good_sigma
 discard = np.where((lowerbound < differences) & (differences < upperbound))
 average_discard = np.average(input[np.min(discard):])

Swapping the two difference calls result in 1.8s for your list version vs 0.6s using numpy.diff (on my machine). Note that this includes the generation of ten-thousand random input vectors.

Apart from that, here are a few more tips:

Avoid shadowing built-in functions/variables. input is already an existing function that takes user input.
While the 0 + 3 * sigma helps to realize that the mean is at zero, a simple comment should be enough for that and save a few unneeded cycles.
In general, comments should explain why you are doung something the way you are doing it, instead of what you are doing. The latter should be clear from the code itself (which is not really the case with your code). This is especially important for example for the np.std call. While you explain that using standard deviation is a good way to measure spread (duh), you don't explain why you only take it for `differences[20:60]. If it is not possible to understand what the code does from looking at it, encapsulate the hard-to-understand code in a properly named function with a descriptive docstring.
Try to avoid creating copies of arrays by slicing.
Some small improvements can be made by getting the right index of where to start discarding (this assumes that later you also discard all values after the first to be discarded value). For this I used the slightly faster numpy.argmax, as described [here][1]. It could be even faster if your array of floats was sorted. It uses max, because True == 1 and it returns the first maximum that appears.
The algorithm as written seems to start discarding whenever the difference is smaller than three sigma, whereas normally you would probably want to start discarding as soon as the difference is larger than three sigma. For this you would have to turn around your logical operators (not done in code below).

To reap the benefits of numpy, you need to use numpy arrays and functions throughout.

The simplest fix is using numpy.diff:

import numpy as np
for _ in range(10000):
 input = 100 * np.random.random(size=190)
 # size = len(input) - 1
 # differences = [(input[i + 1] - input[i]) for i in np.arange(0, size)]
 differences = np.diff(input)
 good_sigma = np.std(differences[20:60])
 upperbound = 3 * good_sigma
 lowerbound = -3 * good_sigma
 discard = np.where((lowerbound < differences) & (differences < upperbound))
 average_discard = np.average(input[np.min(discard):])

Swapping the two difference calls result in 1.8s for your list version vs 0.6s using numpy.diff (on my machine). Note that this includes the generation of ten-thousand random input vectors.

Apart from that, here are a few more tips:

Avoid shadowing built-in functions/variables. input is already an existing function that takes user input.
While the 0 + 3 * sigma helps to realize that the mean is at zero, a simple comment should be enough for that and save a few unneeded cycles.
In general, comments should explain why you are doung something the way you are doing it, instead of what you are doing. The latter should be clear from the code itself (which is not really the case with your code). This is especially important for example for the np.std call. While you explain that using standard deviation is a good way to measure spread (duh), you don't explain why you only take it for `differences[20:60]. If it is not possible to understand what the code does from looking at it, encapsulate the hard-to-understand code in a properly named function with a descriptive docstring.
Try to avoid creating copies of arrays by slicing.
Some small improvements can be made by getting the right index of where to start discarding (this assumes that later you also discard all values after the first to be discarded value). For this I used the slightly faster numpy.argmax, as described [here][1]. It could be even faster if your array of floats was sorted. It uses max, because True == 1 and it returns the first maximum that appears.
The algorithm as written seems to start discarding whenever the difference is smaller than three sigma, whereas normally you would probably want to start discarding as soon as the difference is larger than three sigma. For this you would have to turn around your logical operators (not done in code below).

added 149 characters in body

Source Link

edited Jan 6, 2017 at 18:05

Graipher

edited Jan 6, 2017 at 18:05

Graipher

41.6k
7
70
134

To reap the benefits of numpy, you need to use numpy arrays and functions throughout.

The simplest fix is using numpy.diff:

import numpy as np
for _ in range(10000):
 input = 100 * np.random.random(size=190)
 # size = len(input) - 1
 # differences = [(input[i + 1] - input[i]) for i in np.arange(0, size)]
 differences = np.diff(input)
 good_sigma = np.std(differences[20:60])
 upperbound = 3 * good_sigma
 lowerbound = -3 * good_sigma
 discard = np.where((lowerbound < differences) & (differences < upperbound))
 average_discard = np.average(input[np.min(discard):])

Swapping the two difference calls result in 1.8s for your list version vs 0.6s using numpy.diff (on my machine). Note that this includes the generation of ten-thousand random input vectors.

Apart from that, here are a few more tips:

Avoid shadowing built-in functions/variables. input is already an existing function that takes user input.
While the 0 + 3 * sigma helps to realize that the mean is at zero, a simple comment should be enough for that and save a few unneeded cycles.
In general, comments should explain why you are doung something the way you are doing it, instead of what you are doing. The latter should be clear from the code itself (which is not really the case with your code). This is especially important for example for the np.std call. While you explain that using standard deviation is a good way to measure spread (duh), you don't explain why you only take it for `differences[20:60]. If it is not possible to understand what the code does from looking at it, encapsulate the hard-to-understand code in a properly named function with a descriptive docstring.
Try to avoid creating copies of arrays by slicing.
Some small improvements can be made by getting the right index of where to start discarding (this assumes that later you also discard all values after the first to be discarded value). For this I used the slightly faster numpy.argmax, as described [here][1]. It could be even faster if your array of floats was sorted. It uses max, because True == 1 and it returns the first maximum that appears.
The algorithm as written seems to start discarding whenever the difference is smaller than three sigma, whereas normally you would probably want to start discarding as soon as the difference is larger than three sigma. For this you would have to turn around your logical operators (not done in code below).

To reap the benefits of numpy, you need to use numpy arrays and functions throughout.

The simplest fix is using numpy.diff:

import numpy as np
for _ in range(10000):
 input = 100 * np.random.random(size=190)
 # size = len(input) - 1
 # differences = [(input[i + 1] - input[i]) for i in np.arange(0, size)]
 differences = np.diff(input)
 good_sigma = np.std(differences[20:60])
 upperbound = 3 * good_sigma
 lowerbound = -3 * good_sigma
 discard = np.where((lowerbound < differences) & (differences < upperbound))
 average_discard = np.average(input[np.min(discard):])

Swapping the two difference calls result in 1.8s for your list version vs 0.6s using numpy.diff (on my machine). Note that this includes the generation of ten-thousand random input vectors.

Apart from that, here are a few more tips:

Avoid shadowing built-in functions/variables. input is already an existing function that takes user input.
While the 0 + 3 * sigma helps to realize that the mean is at zero, a simple comment should be enough for that and save a few unneeded cycles.
In general, comments should explain why you are doung something the way you are doing it, instead of what you are doing. The latter should be clear from the code itself (which is not really the case with your code). This is especially important for example for the np.std call. While you explain that using standard deviation is a good way to measure spread (duh), you don't explain why you only take it for `differences[20:60]. If it is not possible to understand what the code does from looking at it, encapsulate the hard-to-understand code in a properly named function with a descriptive docstring.
Try to avoid creating copies of arrays by slicing.
Some small improvements can be made by getting the right index of where to start discarding (this assumes that later you also discard all values after the first to be discarded value). For this I used the slightly faster numpy.argmax, as described [here][1]. It could be even faster if your array of floats was sorted. It uses max, because True == 1 and it returns the first maximum that appears.
The algorithm as written seems to start discarding whenever the difference is smaller than three sigma, whereas normally you would probably want to start discarding as soon as the difference is larger than three sigma. For this you would have to turn around your logical operators (not done in code below).

To reap the benefits of numpy, you need to use numpy arrays and functions throughout.

The simplest fix is using numpy.diff:

import numpy as np
for _ in range(10000):
 input = 100 * np.random.random(size=190)
 # size = len(input) - 1
 # differences = [(input[i + 1] - input[i]) for i in np.arange(0, size)]
 differences = np.diff(input)
 good_sigma = np.std(differences[20:60])
 upperbound = 3 * good_sigma
 lowerbound = -3 * good_sigma
 discard = np.where((lowerbound < differences) & (differences < upperbound))
 average_discard = np.average(input[np.min(discard):])

Swapping the two difference calls result in 1.8s for your list version vs 0.6s using numpy.diff (on my machine). Note that this includes the generation of ten-thousand random input vectors.

Apart from that, here are a few more tips:

Avoid shadowing built-in functions/variables. input is already an existing function that takes user input.
While the 0 + 3 * sigma helps to realize that the mean is at zero, a simple comment should be enough for that and save a few unneeded cycles.
In general, comments should explain why you are doung something the way you are doing it, instead of what you are doing. The latter should be clear from the code itself (which is not really the case with your code). This is especially important for example for the np.std call. While you explain that using standard deviation is a good way to measure spread (duh), you don't explain why you only take it for `differences[20:60]. If it is not possible to understand what the code does from looking at it, encapsulate the hard-to-understand code in a properly named function with a descriptive docstring.
Try to avoid creating copies of arrays by slicing.
Some small improvements can be made by getting the right index of where to start discarding (this assumes that later you also discard all values after the first to be discarded value). For this I used the slightly faster numpy.argmax, as described [here][1]. It could be even faster if your array of floats was sorted. It uses max, because True == 1 and it returns the first maximum that appears.
The algorithm as written seems to start discarding whenever the difference is smaller than three sigma, whereas normally you would probably want to start discarding as soon as the difference is larger than three sigma. For this you would have to turn around your logical operators (not done in code below).