Detecting unsorted elements in an almost sorted array

Question 1

I have a set of samples that is sorted, but due to errors in the data sometimes unsorted values creep in. I need to detect these values and remove them. I'll show some sample datasets below.

20 30 21 22 23 24 25

30 31 21 22 23 24 25

30 21 22 23 24 25 26

20 21 22 23 18 25 26

20 15 21 22 23 24 25

In each case the bold italic numbers are the ones that should be removed. What would be an algorithm to do remove these numbers/detect the indices of these numbers?

Question 2

My question is what would be an algorithm to achieve the same. I can explain some approaches I tried if that helps

Question 3

Read the array from low index to high index (or vice versa, it hardly matters) and throw out (or otherwise mark) the out-of-sequence values. You're not going to find a general approach with complexity lower than O(n). What else might you be thinking of ?

Question 4

@HighPerformanceMark That fails for 2nd test case. You will need to do it repeatidly with this approach, and that's not O(n).

Question 5

O(n) is what I am looking for. I want to know how to mark the out-of-sequence elements? Can't use arr[k + 1] < arr[k] as far as I can see.

Question 6

You should be looking for the longest increasing subsequence. See also: Remove unsorted/outlier elements in nearly-sorted array

Question 7

Detecting is relatively simpler and takes less steps - you could do it in O(n) time. Simply iterate over the array and compare each element to the next. You would be able to find (and mark indices or throw-out) out-of-sequence numbers.

However, your second case makes doing this a problem. I will assume that you always want to keep the longest increasing sub-sequence of the list of numbers (like in the 2nd case).

You can solve this problem efficiently with arrays and binary search. The algorithm performs a single binary search per sequence element, its total time can be expressed as O(n log n).

Process the sequence elements in order, maintain the longest increasing sub-sequence found so far. Denote the sequence values as X[0], X[1], etc. L representing the length of the longest increasing sub-sequence found so far.

M[j] stores the index k of the smallest value X[k] such that there is an increasing sub-sequence of length j ending at X[k] on the range k ≤ i. j ≤ k ≤ i always. P[k] stores the index of the predecessor of X[k] in the longest increasing sub-sequence ending at X[k]

The sequence X[M[1]], X[M[2]], ..., X[M[L]] is always nondecreasing at all points of the algorithm.

P = array of length N
M = array of length N + 1 // Using a 1 indexed array for ease of understanding
L = 0
for i in range 0 to N-1:
 // Binary search
 lo = 1
 hi = L
 while lo ≤ hi:
 mid = ceil((lo+hi)/2)
 if X[M[mid]] < X[i]:
 lo = mid+1
 else:
 hi = mid-1
newL = lo
P[i] = M[newL-1]
M[newL] = i
if newL > L:
 L = newL
S = array of length L
k = M[L]
for i in range L-1 to 0:
 S[i] = X[k]
 k = P[k]
return S

The pseudo-code can be found on the Wikipedia article for this.

If you do want to keep the out-of-sequence elements in the list, simply sort the array using insertion sort.

Question 8

Please give pseudo-code for detection in case of a single unsorted number. I can't iterate over and do "if(arr[k + 1] < arr[k]) throw arr[k + 1]" since that would fail in my 1st case, and I can't do "if(arr[k+1] < arr[k]) throw arr[k]" since that would fail in my 4th case.

Question 9

if(arr[k-1] < arr[k] && arr[k] > arr[k+1] || arr[k-1] > arr[k] && arr[k] < arr[k+1]) throw arr[k] Handle corner indices separately.

Question 10

Detecting only

It takes at least N-1 steps to check (check each element and next), to do it.

But it is ambiguous: In list 2, what is wrong ? 30/31, or 21/.../25 ?

If bad numbers are isolated, you just remove them. But if you have, say, 2 numbers, what to do ? You must define more rules.

Detecting, and sorting :

Complexity:

If your list is perfectly sorted, it takes N-1 steps (check each element and next), to do it.

If there is one unsorted element, it takes log N to replace it at good place (if I suppose everything else is sorted, and in a structure ad hoc like binary tree).

It there are k unsorted elements, it will take k log N .

So N (check) + k log N (insert).

And if everything is messed, N log N, which is classical complexity for sorting.

Algorithm:

So, simpliest algorithm is to iterate, and insert at good place, in a balanced tree. It is a sort by insertion.

It is like smoothsort: https://en.wikipedia.org/wiki/Smoothsort

Question 11

@HenkHolterman - yes - implicitely, I suppose Set is binary . I precise my answer.

Question 12

If there was only one unsorted element, what would be "N(check)"?

Question 13

@diAblo unfortunately, even with one unsorted, you need N (really N-1) iterations to check it.

Question 14

I think this should work for you. It finds the longest subsequence and then clears up the other elements. The implementation is in c#

public static void Main() {
 int[][] dataList = {
 new []{20,30,21,22,23,24,25},
 new []{30,31,21,22,23,24,25},
 new []{30,21,22,23,24,25,26},
 new []{20,21,22,23,18,25,26},
 new []{20,15,21,22,23,24,25}
 };
 foreach (var data in dataList)
 DetectAndRemoveUnsorted(data);
}
/// <summary>
/// Assumes ascending data. You can adapt it for descending data too
/// </summary>
static void DetectAndRemoveUnsorted(IList<int> data) {
 // first pass: Find the outliers; rather find the correct sequence
 int startOfLongestSeq = 0, lenOfLongestSeq = 0;
 int startOfCurrSeq = 0, lenOfCurrSeq = 0;
 for (int i = 0; i < data.Count - 1; ++i) {
 if (data[i] > data[i + 1]) { // we are breaking the ascending order, so this is another sequence
 lenOfCurrSeq = i - startOfCurrSeq + 1;
 if (lenOfCurrSeq > lenOfLongestSeq) {
 lenOfLongestSeq = lenOfCurrSeq;
 startOfLongestSeq = startOfCurrSeq;
 }
 startOfCurrSeq = i + 1;
 }
 }
 lenOfCurrSeq = data.Count - startOfCurrSeq;
 if (lenOfCurrSeq > lenOfLongestSeq) {
 lenOfLongestSeq = lenOfCurrSeq;
 startOfLongestSeq = startOfCurrSeq;
 }
 // second pass: cleanup outliers
 // now we know which sequence is the largest
 // we should get rid of the other sequences
 for (int i = startOfLongestSeq - 1; i >= 0; --i)
 data[i] = -1; // Mark them as invalid. if you want, you can delete them as well
 for (int i = data.Count - 1; i >= startOfLongestSeq + lenOfLongestSeq; --i)
 data[i] = -1; // Mark them as invalid. if you want, you can delete them as well
}

Question 15

Down voters, care to explain what's wrong with the solution?

Madhav Datt Madhav Datt 1,0832 gold badges10 silver badges25 bronze badges · Accepted Answer · 2016-01-15 12:14:52Z

Detecting is relatively simpler and takes less steps - you could do it in O(n) time. Simply iterate over the array and compare each element to the next. You would be able to find (and mark indices or throw-out) out-of-sequence numbers.

However, your second case makes doing this a problem. I will assume that you always want to keep the longest increasing sub-sequence of the list of numbers (like in the 2nd case).

You can solve this problem efficiently with arrays and binary search. The algorithm performs a single binary search per sequence element, its total time can be expressed as O(n log n).

Process the sequence elements in order, maintain the longest increasing sub-sequence found so far. Denote the sequence values as X[0], X[1], etc. L representing the length of the longest increasing sub-sequence found so far.

M[j] stores the index k of the smallest value X[k] such that there is an increasing sub-sequence of length j ending at X[k] on the range k ≤ i. j ≤ k ≤ i always. P[k] stores the index of the predecessor of X[k] in the longest increasing sub-sequence ending at X[k]

The sequence X[M[1]], X[M[2]], ..., X[M[L]] is always nondecreasing at all points of the algorithm.

P = array of length N
M = array of length N + 1 // Using a 1 indexed array for ease of understanding
L = 0
for i in range 0 to N-1:
 // Binary search
 lo = 1
 hi = L
 while lo ≤ hi:
 mid = ceil((lo+hi)/2)
 if X[M[mid]] < X[i]:
 lo = mid+1
 else:
 hi = mid-1
newL = lo
P[i] = M[newL-1]
M[newL] = i
if newL > L:
 L = newL
S = array of length L
k = M[L]
for i in range L-1 to 0:
 S[i] = X[k]
 k = P[k]
return S

The pseudo-code can be found on the Wikipedia article for this.

If you do want to keep the out-of-sequence elements in the list, simply sort the array using insertion sort.

Please give pseudo-code for detection in case of a single unsorted number. I can't iterate over and do "if(arr[k + 1] < arr[k]) throw arr[k + 1]" since that would fail in my 1st case, and I can't do "if(arr[k+1] < arr[k]) throw arr[k]" since that would fail in my 4th case.
if(arr[k-1] < arr[k] && arr[k] > arr[k+1] || arr[k-1] > arr[k] && arr[k] < arr[k+1]) throw arr[k] Handle corner indices separately.

CollectivesTM on Stack Overflow

Detecting unsorted elements in an almost sorted array

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related