4

I have a set of samples that is sorted, but due to errors in the data sometimes unsorted values creep in. I need to detect these values and remove them. I'll show some sample datasets below.

20 30 21 22 23 24 25

30 31 21 22 23 24 25

30 21 22 23 24 25 26

20 21 22 23 18 25 26

20 15 21 22 23 24 25

In each case the bold italic numbers are the ones that should be removed. What would be an algorithm to do remove these numbers/detect the indices of these numbers?

asked Jan 15, 2016 at 11:05
6
  • My question is what would be an algorithm to achieve the same. I can explain some approaches I tried if that helps Commented Jan 15, 2016 at 11:19
  • 2
    Read the array from low index to high index (or vice versa, it hardly matters) and throw out (or otherwise mark) the out-of-sequence values. You're not going to find a general approach with complexity lower than O(n). What else might you be thinking of ? Commented Jan 15, 2016 at 11:23
  • @HighPerformanceMark That fails for 2nd test case. You will need to do it repeatidly with this approach, and that's not O(n). Commented Jan 15, 2016 at 11:26
  • O(n) is what I am looking for. I want to know how to mark the out-of-sequence elements? Can't use arr[k + 1] < arr[k] as far as I can see. Commented Jan 15, 2016 at 11:28
  • 1
    You should be looking for the longest increasing subsequence. See also: Remove unsorted/outlier elements in nearly-sorted array Commented Jan 15, 2016 at 11:30

3 Answers 3

3

Detecting is relatively simpler and takes less steps - you could do it in O(n) time. Simply iterate over the array and compare each element to the next. You would be able to find (and mark indices or throw-out) out-of-sequence numbers.

However, your second case makes doing this a problem. I will assume that you always want to keep the longest increasing sub-sequence of the list of numbers (like in the 2nd case).

You can solve this problem efficiently with arrays and binary search. The algorithm performs a single binary search per sequence element, its total time can be expressed as O(n log n).

Process the sequence elements in order, maintain the longest increasing sub-sequence found so far. Denote the sequence values as X[0], X[1], etc. L representing the length of the longest increasing sub-sequence found so far.

M[j] stores the index k of the smallest value X[k] such that there is an increasing sub-sequence of length j ending at X[k] on the range k ≤ i. j ≤ k ≤ i always. P[k] stores the index of the predecessor of X[k] in the longest increasing sub-sequence ending at X[k]

The sequence X[M[1]], X[M[2]], ..., X[M[L]] is always nondecreasing at all points of the algorithm.

P = array of length N
M = array of length N + 1 // Using a 1 indexed array for ease of understanding
L = 0
for i in range 0 to N-1:
 // Binary search
 lo = 1
 hi = L
 while lo ≤ hi:
 mid = ceil((lo+hi)/2)
 if X[M[mid]] < X[i]:
 lo = mid+1
 else:
 hi = mid-1
newL = lo
P[i] = M[newL-1]
M[newL] = i
if newL > L:
 L = newL
S = array of length L
k = M[L]
for i in range L-1 to 0:
 S[i] = X[k]
 k = P[k]
return S

The pseudo-code can be found on the Wikipedia article for this.

If you do want to keep the out-of-sequence elements in the list, simply sort the array using insertion sort.

answered Jan 15, 2016 at 12:14
2
  • Please give pseudo-code for detection in case of a single unsorted number. I can't iterate over and do "if(arr[k + 1] < arr[k]) throw arr[k + 1]" since that would fail in my 1st case, and I can't do "if(arr[k+1] < arr[k]) throw arr[k]" since that would fail in my 4th case. Commented Jan 15, 2016 at 12:22
  • 1
    if(arr[k-1] < arr[k] && arr[k] > arr[k+1] || arr[k-1] > arr[k] && arr[k] < arr[k+1]) throw arr[k] Handle corner indices separately. Commented Jan 15, 2016 at 12:46
1

Detecting only

It takes at least N-1 steps to check (check each element and next), to do it.

But it is ambiguous: In list 2, what is wrong ? 30/31, or 21/.../25 ?

If bad numbers are isolated, you just remove them. But if you have, say, 2 numbers, what to do ? You must define more rules.

Detecting, and sorting :

Complexity:

If your list is perfectly sorted, it takes N-1 steps (check each element and next), to do it.

If there is one unsorted element, it takes log N to replace it at good place (if I suppose everything else is sorted, and in a structure ad hoc like binary tree).

It there are k unsorted elements, it will take k log N .

So N (check) + k log N (insert).

And if everything is messed, N log N, which is classical complexity for sorting.

Algorithm:

So, simpliest algorithm is to iterate, and insert at good place, in a balanced tree. It is a sort by insertion.

It is like smoothsort: https://en.wikipedia.org/wiki/Smoothsort

answered Jan 15, 2016 at 11:45
3
  • @HenkHolterman - yes - implicitely, I suppose Set is binary . I precise my answer. Commented Jan 15, 2016 at 11:53
  • If there was only one unsorted element, what would be "N(check)"? Commented Jan 15, 2016 at 12:00
  • @diAblo unfortunately, even with one unsorted, you need N (really N-1) iterations to check it. Commented Jan 16, 2016 at 10:07
-1

I think this should work for you. It finds the longest subsequence and then clears up the other elements. The implementation is in c#

public static void Main() {
 int[][] dataList = {
 new []{20,30,21,22,23,24,25},
 new []{30,31,21,22,23,24,25},
 new []{30,21,22,23,24,25,26},
 new []{20,21,22,23,18,25,26},
 new []{20,15,21,22,23,24,25}
 };
 foreach (var data in dataList)
 DetectAndRemoveUnsorted(data);
}
/// <summary>
/// Assumes ascending data. You can adapt it for descending data too
/// </summary>
static void DetectAndRemoveUnsorted(IList<int> data) {
 // first pass: Find the outliers; rather find the correct sequence
 int startOfLongestSeq = 0, lenOfLongestSeq = 0;
 int startOfCurrSeq = 0, lenOfCurrSeq = 0;
 for (int i = 0; i < data.Count - 1; ++i) {
 if (data[i] > data[i + 1]) { // we are breaking the ascending order, so this is another sequence
 lenOfCurrSeq = i - startOfCurrSeq + 1;
 if (lenOfCurrSeq > lenOfLongestSeq) {
 lenOfLongestSeq = lenOfCurrSeq;
 startOfLongestSeq = startOfCurrSeq;
 }
 startOfCurrSeq = i + 1;
 }
 }
 lenOfCurrSeq = data.Count - startOfCurrSeq;
 if (lenOfCurrSeq > lenOfLongestSeq) {
 lenOfLongestSeq = lenOfCurrSeq;
 startOfLongestSeq = startOfCurrSeq;
 }
 // second pass: cleanup outliers
 // now we know which sequence is the largest
 // we should get rid of the other sequences
 for (int i = startOfLongestSeq - 1; i >= 0; --i)
 data[i] = -1; // Mark them as invalid. if you want, you can delete them as well
 for (int i = data.Count - 1; i >= startOfLongestSeq + lenOfLongestSeq; --i)
 data[i] = -1; // Mark them as invalid. if you want, you can delete them as well
}
answered Jan 15, 2016 at 13:33
1
  • Down voters, care to explain what's wrong with the solution? Commented Apr 18, 2016 at 12:11

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.