I have a set of samples that is sorted, but due to errors in the data sometimes unsorted values creep in. I need to detect these values and remove them. I'll show some sample datasets below.
20 30 21 22 23 24 25
30 31 21 22 23 24 25
30 21 22 23 24 25 26
20 21 22 23 18 25 26
20 15 21 22 23 24 25
In each case the bold italic numbers are the ones that should be removed. What would be an algorithm to do remove these numbers/detect the indices of these numbers?
3 Answers 3
Detecting is relatively simpler and takes less steps - you could do it in O(n)
time. Simply iterate over the array and compare each element to the next. You would be able to find (and mark indices or throw-out) out-of-sequence numbers.
However, your second case makes doing this a problem. I will assume that you always want to keep the longest increasing sub-sequence of the list of numbers (like in the 2nd case).
You can solve this problem efficiently with arrays and binary search. The algorithm performs a single binary search per sequence element, its total time can be expressed as O(n log n)
.
Process the sequence elements in order, maintain the longest increasing sub-sequence found so far. Denote the sequence values as X[0], X[1]
, etc. L
representing the length of the longest increasing sub-sequence found so far.
M[j]
stores the index k
of the smallest value X[k]
such that there is an increasing sub-sequence of length j
ending at X[k]
on the range k ≤ i
. j ≤ k ≤ i
always. P[k]
stores the index of the predecessor of X[k]
in the longest increasing sub-sequence ending at X[k]
The sequence X[M[1]], X[M[2]], ..., X[M[L]]
is always nondecreasing at all points of the algorithm.
P = array of length N
M = array of length N + 1 // Using a 1 indexed array for ease of understanding
L = 0
for i in range 0 to N-1:
// Binary search
lo = 1
hi = L
while lo ≤ hi:
mid = ceil((lo+hi)/2)
if X[M[mid]] < X[i]:
lo = mid+1
else:
hi = mid-1
newL = lo
P[i] = M[newL-1]
M[newL] = i
if newL > L:
L = newL
S = array of length L
k = M[L]
for i in range L-1 to 0:
S[i] = X[k]
k = P[k]
return S
The pseudo-code can be found on the Wikipedia article for this.
If you do want to keep the out-of-sequence elements in the list, simply sort the array using insertion sort.
-
Please give pseudo-code for detection in case of a single unsorted number. I can't iterate over and do "if(arr[k + 1] < arr[k]) throw arr[k + 1]" since that would fail in my 1st case, and I can't do "if(arr[k+1] < arr[k]) throw arr[k]" since that would fail in my 4th case.Ashik– Ashik2016年01月15日 12:22:47 +00:00Commented Jan 15, 2016 at 12:22
-
1
if(arr[k-1] < arr[k] && arr[k] > arr[k+1] || arr[k-1] > arr[k] && arr[k] < arr[k+1]) throw arr[k]
Handle corner indices separately.Madhav Datt– Madhav Datt2016年01月15日 12:46:14 +00:00Commented Jan 15, 2016 at 12:46
Detecting only
It takes at least N-1 steps to check (check each element and next), to do it.
But it is ambiguous: In list 2, what is wrong ? 30/31, or 21/.../25 ?
If bad numbers are isolated, you just remove them. But if you have, say, 2 numbers, what to do ? You must define more rules.
Detecting, and sorting :
Complexity:
If your list is perfectly sorted, it takes N-1 steps (check each element and next), to do it.
If there is one unsorted element, it takes log N to replace it at good place (if I suppose everything else is sorted, and in a structure ad hoc like binary tree).
It there are k unsorted elements, it will take k log N .
So N (check) + k log N (insert).
And if everything is messed, N log N, which is classical complexity for sorting.
Algorithm:
So, simpliest algorithm is to iterate, and insert at good place, in a balanced tree. It is a sort by insertion.
It is like smoothsort: https://en.wikipedia.org/wiki/Smoothsort
-
@HenkHolterman - yes - implicitely, I suppose Set is binary . I precise my answer.guillaume girod-vitouchkina– guillaume girod-vitouchkina2016年01月15日 11:53:55 +00:00Commented Jan 15, 2016 at 11:53
-
If there was only one unsorted element, what would be "N(check)"?Ashik– Ashik2016年01月15日 12:00:44 +00:00Commented Jan 15, 2016 at 12:00
-
@diAblo unfortunately, even with one unsorted, you need N (really N-1) iterations to check it.guillaume girod-vitouchkina– guillaume girod-vitouchkina2016年01月16日 10:07:17 +00:00Commented Jan 16, 2016 at 10:07
I think this should work for you. It finds the longest subsequence and then clears up the other elements. The implementation is in c#
public static void Main() {
int[][] dataList = {
new []{20,30,21,22,23,24,25},
new []{30,31,21,22,23,24,25},
new []{30,21,22,23,24,25,26},
new []{20,21,22,23,18,25,26},
new []{20,15,21,22,23,24,25}
};
foreach (var data in dataList)
DetectAndRemoveUnsorted(data);
}
/// <summary>
/// Assumes ascending data. You can adapt it for descending data too
/// </summary>
static void DetectAndRemoveUnsorted(IList<int> data) {
// first pass: Find the outliers; rather find the correct sequence
int startOfLongestSeq = 0, lenOfLongestSeq = 0;
int startOfCurrSeq = 0, lenOfCurrSeq = 0;
for (int i = 0; i < data.Count - 1; ++i) {
if (data[i] > data[i + 1]) { // we are breaking the ascending order, so this is another sequence
lenOfCurrSeq = i - startOfCurrSeq + 1;
if (lenOfCurrSeq > lenOfLongestSeq) {
lenOfLongestSeq = lenOfCurrSeq;
startOfLongestSeq = startOfCurrSeq;
}
startOfCurrSeq = i + 1;
}
}
lenOfCurrSeq = data.Count - startOfCurrSeq;
if (lenOfCurrSeq > lenOfLongestSeq) {
lenOfLongestSeq = lenOfCurrSeq;
startOfLongestSeq = startOfCurrSeq;
}
// second pass: cleanup outliers
// now we know which sequence is the largest
// we should get rid of the other sequences
for (int i = startOfLongestSeq - 1; i >= 0; --i)
data[i] = -1; // Mark them as invalid. if you want, you can delete them as well
for (int i = data.Count - 1; i >= startOfLongestSeq + lenOfLongestSeq; --i)
data[i] = -1; // Mark them as invalid. if you want, you can delete them as well
}
-
Down voters, care to explain what's wrong with the solution?Vikhram– Vikhram2016年04月18日 12:11:24 +00:00Commented Apr 18, 2016 at 12:11
O(n)
. What else might you be thinking of ?