4

For every build of my software I run it through a series of tests that execute a number of commands on a command line and captures the results.

Most of the time the output is a combination of: - Static information that is, and should be, the same between every run. - Dynamic information that is not easy to predict.

For the purpose of testing this software I desire to only verify that the static information is the same between each test.

Currently I use sed on the captured output of my automated tests to filter out the dynamic information. Any differences between test runs are then identified using the filtered data resulting from applying the sed rules.

Because of the large variety of command output that is processed, maintaining the sed rules has become an onerous job.

Is there a system, that with a large sample of executed test results, can show uncommon changes between two files contained in that large sample of test results? To clarify, if I were to diff two files, I would only want to see differences for items that changed 5% of the time between all the files involved. Any other data that changed 95+% of the time would be excluded as differences when comparing two of the files within the large set of test results.

In this example, among the large set of test runs, the first column is almost always consistent:

Run 1:

1/*/1 6643 2433 12343 232424 112 
2/1/1 2311 455 1342 60233 5231
2/1/2 2355 1223 93 12389 1342

Run 2:

1/*/1 3121 112 98832 451 1233 
2/1/1 6342 2345 84421 3456 8362
2/1/2 72453 5 421 64321 7634

Run 3:

1/*/1 12 65 312 653 973 
2/1/1 12442 54231 46012 6 734 
2/1/3 734 28 623 76 1834 

On top of the three runs above, there are hundreds of samples of this same commands output generated for each incremental build of the software. When performing a difference between each of the three examples above I would like to have the following results:

No Differences between Run #1 and Run #2. Difference showing that "2/1/2" was changed to "2/1/3" when asked to show the difference between Run #3 and either of the other two runs.

Currently I have a sed script that replaces all the "random looking" numbers above (counts, timestamps, etc) with hashmarks or other constant indicators so that all that is left to compare is the starting column.

What options, if any, exist in this area? Is there a name for this kind of text comparison?

asked Mar 4, 2012 at 3:55
2
  • your question is understandable (and good one) except the term - text that changes "rarely" versus the text that changes all of the time. Please define this. Commented Mar 4, 2012 at 4:55
  • I did a bit of a re-work on the original question to clarify the "rarely" vs "all of the time" portion. Hopefully my question is clearer now. Commented Mar 4, 2012 at 19:41

2 Answers 2

1

It should be a matter of building a histogram for each field, then finding the most frequent value. If that value's frequency is more than 95%, it's a "static" field. Once you know which fields are dynamic, your idea of replacing them with a marker is a good one, or you can diff against the most frequent value from the histogram.

The most efficient way to build a histogram is to make a hash table with the field's values as keys. Each time that value appears in a run, you increment the counter stored in the hash table.

answered Mar 5, 2012 at 13:30
1
  • I think this approach has the best chance of working within the constraints that exist for our project. Commented Mar 6, 2012 at 16:39
1

IMO, this is an anti-pattern in testing. Instead of post-processing the "dynamic" parts of the output, you should control the execution environment so that the output is unvarying. If you are reading the system time, or using a random number generator, mock them or wrap them so you can control the test output.

BTW, whatever you do, don't try to change the system time to run your tests. That may be OK on an embedded system, but creates a mess on Windows and Linux systems.

answered Mar 4, 2012 at 19:44
1
  • 1
    I agree that this would be a good approach if I could start greenfield. The system we are dealing with is developed and maintained by a large muti-site group. The amount of code changes, getting folks to march in the same direction, and time involved excludes this as an option. I may be able to make these changes a few commands at a time, but I could never get the coverage I need in the time I have provided. I have a pretty solid sed script that detects timestamps in our system. It is one of the few places I am not overly concerned about. Commented Mar 4, 2012 at 20:18

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.