"stray" point correction in 2D array

Question 1

I have a code which does the following:

For each value v in a 2D array, look at neighboring positions - how far to look is specified by distThreshold. If the count of values identical to v in the neighborhood is lower or equal to countThreshold, change v to a value which is most prevalent in the neighborhood.

For example: input:

[5, 2, 1, 0, 5]
[1, 5, 5, 0, 4]
[0, 2, 0, 0, 5]
[1, 3, 2, 4, 0]
[3, 3, 4, 3, 0]

output for distThreshold = 1, countThreshold = 0:

[5, 5, 5, 0, 0]
[2, 5, 5, 0, 0]
[1, 2, 0, 0, 0]
[3, 3, 2, 4, 0]
[3, 3, 4, 0, 0]

Functions that are important:

public static int[][] correctedArray(int distThreshold, int countThreshold, int[][] arr)
{
 int[][] newArray = arrCopy(arr);
 for (int j = 0; j < arr.Length; j++)
 {
 for (int i = 0; i < arr[0].Length; i++)
 {
 int[] dcfac = dupeCountForAutoCorrect(i, j, distThreshold, countThreshold, arr);
 int dupeCount = dcfac[0];
 int replacementIndex = dcfac[1];
 if (dupeCount <= countThreshold)
 {
 newArray[j][i] = replacementIndex;
 }
 }
 }
 return newArray;
}
private static int[] dupeCountForAutoCorrect(int x, int y, int distThreshold, int countThreshold, int[][] arr)
{
 int testedIndex = arr[y][x];
 int[] result = new int[2];
 Dictionary < int, int > colorCounts = new Dictionary < int, int > ();
 for (int j = y - distThreshold; j <= y + distThreshold; j++)
 {
 for (int i = x - distThreshold; i <= x + distThreshold; i++)
 {
 if (!(i == x && j == y))
 {
 try
 {
 int currentIndex = arr[j][i];
 if (currentIndex == testedIndex) result[0]++;
 if (result[0] > countThreshold) return result;
 if (colorCounts.TryGetValue(currentIndex, out int value))
 {
 value++;
 colorCounts[currentIndex] = value;
 }
 else
 {
 colorCounts.Add(currentIndex, 1);
 }
 }
 catch (IndexOutOfRangeException)
 {
 //do nothing and continue
 }
 }
 }
 }
 //returns the index with highest count. I don't understand this at all, ripped from Stack Overflow
 result[1] = colorCounts.Aggregate((xx, yy) => xx.Value > yy.Value ? xx : yy).Key;
 return result;
}

Helper functions and testing:

public static void testArr()
{
 int maxIndex = 5;
 int arrWidth = 5;
 int arrHeight = 5;
 int distThreshold = 1;
 int countThreshold = 0;
 int[][] originalArray = new int[arrHeight][];
 Random rnd = new Random();
 for (int j = 0; j < arrHeight; j++)
 {
 int[] r = new int[arrWidth];
 for (int i = 0; i < arrWidth; i++)
 {
 int rndn = rnd.Next(maxIndex + 1);;
 r[i] = rndn;
 }
 originalArray[j] = r;
 }
 DateTime nao = DateTime.Now;
 int[][] newArr = correctedArray(distThreshold, countThreshold, originalArray);
 Console.WriteLine((DateTime.Now - nao).TotalSeconds.ToString() + "seconds");
 string origArrString = arrToString(originalArray);
 string newArrString = arrToString(newArr);
 Console.Write(origArrString + System.Environment.NewLine + newArrString);
}
public static string arrToString(int[][] arr)
{
 string result = "";
 for (int j = 0; j < arr.Length; j++)
 {
 result += "[";
 for (int i = 0; i < arr[0].Length; i++)
 {
 int val = arr[j][i];
 result += val + (i == arr[0].Length - 1 ? "]" + System.Environment.NewLine : ", ");
 }
 }
 return result;
}
public static int[][] arrCopy(int[][] arr)
{
 int[][] result = new int[arr.Length][];
 for (int j = 0; j < arr.Length; j++)
 {
 result[j] = new int[arr[0].Length];
 for (int i = 0; i < arr[0].Length; i++)
 {
 result[j][i] = arr[j][i];
 }
 }
 return result;
}

100*100 array is processed in a little over 3 seconds on my machine which seems a little slow to me. I am very new to C#, so pardon my naming and other convention violations :X

Question 2

Could you please format your code to avoid unnecessary indentations? I would suggest to use online code formatters like this.

Question 3

@PeterCsala Done, thank you for the suggestion. I noticed it looked weird when I copied it out of visual studio.

Question 4

The whole arrCopy could be replaced with arr.Select(inner => inner.ToArray()).ToArray(). But of course you should measure your baseline and then perform any modification to be able to determine whether it introduced a performance improvement or degradation.

Question 5

For C# questions, we generally prefer to see the entire class including the using statements at the top.

Question 6

Welcome to Code Review! I have rolled back your last edits. Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 7

This submission is about performance. It does not include any profiler measurements.

robust testing

public static void testArr()
 ...
 int arrWidth = 5;
 int arrHeight = 5;

Prefer a 5 ×ばつ 6 matrix, or other non-square shape.

Why? Sometimes target code will accidentally invert (x, y) into (y, x), and we wish to provoke and fix such bugs.

Consider using a distance threshold greater than 1 in your testing, if you wish to expose performance hotspots.

variant spelling

 DateTime nao = DateTime.Now;

You knao perfectly well how to spell that word. Prefer its standard spelling.

names

In many identifiers you describe something as an "index" where my understanding is it's more like a pixel "color" or "value".

 if (colorCounts.TryGetValue(currentIndex, out int value))
 {
 value++;
 colorCounts[currentIndex] = value;

Here value is too vague; prefer count.

common exceptions

For a square matrix of side N, your test code throws about 12 ×ばつ N IndexOutOfRangeExceptions, which seems like perhaps more than you'd prefer since doing so takes time.

Consider expanding the array with borders, so we have (N + 2 ×ばつ distThreshold)2 cells in total. Fill the border with unique serial numbers, or with periodic pattern of period at least 1 + 2 * distThreshold. That way a border value won't be the most popular.

common operations

Each point requests the allocation and deallocation of a Dictionary, causing GC work. Maybe hang onto a single Dictionary, and .Clear() it when starting on a new cell? Or maybe turn the dictionary into a vector of size C, the number of colors?

When scanning a raster we re-compute the same old partial sums several times. For low values of distThreshold, such as what your test code uses, I'm skeptical that one can do much better.

For moderate or large thresholds, consider an alternate approach. Initialize by making a single scan to find number of distinct pixel colors, C, in the input matrix. Depending on size of threshold, allocate either a byte or a UInt16 count matrix, with shape arrWidth ×ばつ arrHeight ×ばつ C, or slightly larger if we have tacked on border cells.

When computing window sums for a moving average, we can efficiently add a figure from the leading edge and subtract another figure from the trailing edge. Similarly, in this situation, we can efficiently compute "count of certain pixel color in this cell's neighborhood" by copying the counts we had just finished computing, increment based on leading edge, and decrement based on trailing edge. It is perfectly OK to have some garbage counts in the border, as we won't actually consult them.

Area of neighborhood is quadratic with distThreshold. For this approach to work well, number of distinct colors C should be smallish relative to neighborhood area. Sometimes result[1] must be set to the majority color, and computing this will have O(C) cost. (It could have O(log N) cost, but that sounds like a lot of pqueues.)

simplify the input

Initially computing a color histogram, the population count of each distinct pixel color, is fairly cheap. We might find there's a handful of very popular colors plus small numbers of rare color values. It might be acceptable to perform a lossy input transform, which lumps all unpopular colors together as a single designated value, or uses a randomly chosen mapping to turn them into just a few distinct colors. We could even put our thumb on the scale, so the function will never return an unpopular color.

This would give us an attractively small C, at the cost of slightly changing the output. You wouldn't want to use this technique in an adversarial setting.

median

The majority filter you propose appears to be intended to cope with salt-and-pepper noise. It seems like it lies on this neighborhood filtering continuum:

mean
median
majority

In image regions that are somewhat distant from an edge, you may be able to get away with dynamically switching to one of those other filters, taking advantage of a rapid, mature implementation. Sampling every Nth pixel neighborhood to support such a decision may be sufficient to give acceptable results.

Consider relying on the scikit-image implementation of this or related filters.

Question 8

It was the exceptions!! Thank you so much, some of this is a little above my head like the common operations part, but I changed the !(i == x && j == y) to !(i == x && j == y) && i > 0 && j > 0 && i < arr[0].Length && j < arr.Length, removed the try catch and now 2000*3000 array is processed in 2 seconds. I also agree with everything else, I've had problems with testing on square arrays and then facing errors when trying rectangular ones. And that spelling part...yea, I'm just being silly

Question 9

I also tried initializing the dictionary only once and clearing it after every point, which also gave me some performance improvement.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2024-02-07 21:07:01Z

This submission is about performance. It does not include any profiler measurements.

robust testing

public static void testArr()
 ...
 int arrWidth = 5;
 int arrHeight = 5;

Prefer a 5 ×ばつ 6 matrix, or other non-square shape.

Why? Sometimes target code will accidentally invert (x, y) into (y, x), and we wish to provoke and fix such bugs.

Consider using a distance threshold greater than 1 in your testing, if you wish to expose performance hotspots.

variant spelling

 DateTime nao = DateTime.Now;

You knao perfectly well how to spell that word. Prefer its standard spelling.

names

In many identifiers you describe something as an "index" where my understanding is it's more like a pixel "color" or "value".

 if (colorCounts.TryGetValue(currentIndex, out int value))
 {
 value++;
 colorCounts[currentIndex] = value;

Here value is too vague; prefer count.

common exceptions

For a square matrix of side N, your test code throws about 12 ×ばつ N IndexOutOfRangeExceptions, which seems like perhaps more than you'd prefer since doing so takes time.

Consider expanding the array with borders, so we have (N + 2 ×ばつ distThreshold)2 cells in total. Fill the border with unique serial numbers, or with periodic pattern of period at least 1 + 2 * distThreshold. That way a border value won't be the most popular.

common operations

Each point requests the allocation and deallocation of a Dictionary, causing GC work. Maybe hang onto a single Dictionary, and .Clear() it when starting on a new cell? Or maybe turn the dictionary into a vector of size C, the number of colors?

When scanning a raster we re-compute the same old partial sums several times. For low values of distThreshold, such as what your test code uses, I'm skeptical that one can do much better.

For moderate or large thresholds, consider an alternate approach. Initialize by making a single scan to find number of distinct pixel colors, C, in the input matrix. Depending on size of threshold, allocate either a byte or a UInt16 count matrix, with shape arrWidth ×ばつ arrHeight ×ばつ C, or slightly larger if we have tacked on border cells.

When computing window sums for a moving average, we can efficiently add a figure from the leading edge and subtract another figure from the trailing edge. Similarly, in this situation, we can efficiently compute "count of certain pixel color in this cell's neighborhood" by copying the counts we had just finished computing, increment based on leading edge, and decrement based on trailing edge. It is perfectly OK to have some garbage counts in the border, as we won't actually consult them.

Area of neighborhood is quadratic with distThreshold. For this approach to work well, number of distinct colors C should be smallish relative to neighborhood area. Sometimes result[1] must be set to the majority color, and computing this will have O(C) cost. (It could have O(log N) cost, but that sounds like a lot of pqueues.)

simplify the input

Initially computing a color histogram, the population count of each distinct pixel color, is fairly cheap. We might find there's a handful of very popular colors plus small numbers of rare color values. It might be acceptable to perform a lossy input transform, which lumps all unpopular colors together as a single designated value, or uses a randomly chosen mapping to turn them into just a few distinct colors. We could even put our thumb on the scale, so the function will never return an unpopular color.

This would give us an attractively small C, at the cost of slightly changing the output. You wouldn't want to use this technique in an adversarial setting.

median

The majority filter you propose appears to be intended to cope with salt-and-pepper noise. It seems like it lies on this neighborhood filtering continuum:

mean
median
majority

In image regions that are somewhat distant from an edge, you may be able to get away with dynamically switching to one of those other filters, taking advantage of a rapid, mature implementation. Sampling every Nth pixel neighborhood to support such a decision may be sufficient to give acceptable results.

Consider relying on the scikit-image implementation of this or related filters.

It was the exceptions!! Thank you so much, some of this is a little above my head like the common operations part, but I changed the !(i == x && j == y) to !(i == x && j == y) && i > 0 && j > 0 && i < arr[0].Length && j < arr.Length, removed the try catch and now 2000*3000 array is processed in 2 seconds. I also agree with everything else, I've had problems with testing on square arrays and then facing errors when trying rectangular ones. And that spelling part...yea, I'm just being silly
I also tried initializing the dictionary only once and clearing it after every point, which also gave me some performance improvement.

Stack Exchange Network

"stray" point correction in 2D array

1 Answer 1

robust testing

variant spelling

names

common exceptions

common operations

simplify the input

median

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

"stray" point correction in 2D array

1 Answer 1

robust testing

variant spelling

names

common exceptions

common operations

simplify the input

median

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions