What to do when test data takes up most of repo size?

Question 1

TL;DR: Given a repo with a lot of big fixture tests, when should I look for a solution where the golden files are not part of the repo anymore? Where should they be stored?

The setup

Testing a complex data processing project with very diverse input sources
Around 10^2 CSV files as fixture tests, each a few MB in size (Git LFS).
- These tests aim to be an exhaustive sample of the currently expected & supported inputs to the system, and the manually validated correct outputs.
- So far, these tests allowed us to create a high quality business logic core for our system.
Current workflow:
- About every week, a column in each file is changed or added, or a new file is added.
- Column changes are typically because of code changes (which usually arise from a change or new feature in the specs.
- New files are typically added as regression tests for testing new input scenarios.
- We already have good tooling to manually inspect and approve the diffs. This part is working well.

Pain points

Biggest pain point: Merge conflicts when two different columns are changed. (This is tedious because the conflict only arises because merging in git is line based, not CSV column based.)
Inspecting changes in code reviews on Gitlab/Github/... is hard, since our tooling to look at the diffs is homegrown.
Repo size is a few GB.
- This makes cloning and completely checking out the repo a hard task.
- Similarly, git log -p takes a long time and can clutter everything for those users who have activated automatic git lfs checkout.

Incomplete ideas to deal with this

Moving from CSV to an external database. Downside: DB access while testing necessary, DB maintenance, need to worry about keeping DB migrations in sync with code changes
Developing a CSV diff & merging tool that works with git. Downside: Dev effort, doesn't solve repo size issue

Questions

Is this a common pattern in larger code bases that deal with data processing?
Is this a sign of a deeper, fundamental error in the chosen testing approach?
Are there tools and methodologies out there that can support the current workflow and make it easier?

EDIT: Clarified several points that were asked about in the comments, in particular I believe I used the term "golden test" incorrectly.

Question 2

@DocBrown, yes, and I agree that it's perfectly ok to argue that repo size is not a problem with Git LFS. However, there are some aspects to repo size that Git LFS doesn't solve completely, e.g. the size of the checked out tree, or the time it takes to clone the repo, or lags when doing git log -p (depending on LFS config).

Question 3

If you have a certain amount of test data to manage, it won't magically decrease by switching the tool. Git LFS is a solution which aims exactly for you use case. For example, the size of the checked out tree becomes a lot smaller except for the machine where your characterization tests will be run (should not be everyones machine). Git LFS is something recommended for repos from 5 to 20GB, so don't expect to develop something better on your own easily.

Question 4

... column-oriented test data should be handled like binary files - never ever change them in parallel, to avoid the need for merges. Instead, make people lock them before changing them (maybe establish an "organizational lock", only one or two members are allowed to change the files directly).

Question 5

And 10^2 CSV files is only 100, which isn't a terrible number of files to deal with.

Question 6

Wait, wait, wait. Let's back up. What was the goal when you introduced the golden tests (characterization tests) in the first place? What were you trying to do? The idea with these was to use them when the correct behavior of a legacy system is unknown/undocumented, so that you can in some way characterize what it is currently doing, so that you can safely shuffle the "insides" of the code in preparation for incrementally evolving the system, or to add an interaction point for an extension, etc. Was something like that the plan?

Question 7

Is this a common pattern in larger code bases that deal with data processing?

Yes. When starting out developing a processing pipeline, checking some "real" inputs against some known good outputs is probably the simplest test strategy, especially if unfamiliar with unit testing and TDD.

Is this a sign of a deeper, fundamental error in the chosen testing approach?

Yes, there are many issues with relying only on characterisation tests:

As the pipeline complexity increases, the complexity of maintaining the tests grows with something like O(total number of possible paths through the code), even if a lot of those paths are never traversed. In a unit tested code base, the complexity should grow close to linear with the number of branches in the code. This complexity means that as the test suite grows, developers pretty quickly have to reduce testing rigour to get anything done. What ends up happening is that they will do the pipeline change, reprocess everything, briefly check that the output looks sane, and commit the result as job done. This normalises deviance, since "looks sane" is a far cry from a good test.
New developers have almost zero chance of grasping the evolution of tests, since the tests are not in any way self-explanatory, just a bunch of CSVs.
Test isolation goes out the window — a small fix somewhere in the internals could mean changing a lot of tests.
For any but a trivial pipeline they are several orders of magnitude slower than unit tests.

Are there tools and methodologies out there that can support the current workflow and make it easier?

Linux distros typically come with GNU Coreutils, which can do a lot of text processing really fast. You could also look into the many solutions for transposing ("flipping the table on its side") the CSV, to be able to diff in a column-based fashion.

Finally, diversify your test strategy. Yes, you definitely want some end-to-end tests, but not for every single branch in your code. It's simply unsustainable. You'll want unit tests for your business rules and branches. If you can run your tests in a reasonable time you should probably look into mutation testing as well. It doesn't require writing any more tests, and is good at discovering missing tests (independently of code coverage, which you should also generate to get a baseline for how you're doing).

Question 8

We do in fact have a big unit test suite as well, and we are also working on translating every identified feature in the fixture tests into a unit test.

l0b0 l0b0 11.6k2 gold badges45 silver badges49 bronze badges · Answer 1 · 2023-12-14 19:39:36Z

Is this a common pattern in larger code bases that deal with data processing?

Yes. When starting out developing a processing pipeline, checking some "real" inputs against some known good outputs is probably the simplest test strategy, especially if unfamiliar with unit testing and TDD.

Is this a sign of a deeper, fundamental error in the chosen testing approach?

Yes, there are many issues with relying only on characterisation tests:

As the pipeline complexity increases, the complexity of maintaining the tests grows with something like O(total number of possible paths through the code), even if a lot of those paths are never traversed. In a unit tested code base, the complexity should grow close to linear with the number of branches in the code. This complexity means that as the test suite grows, developers pretty quickly have to reduce testing rigour to get anything done. What ends up happening is that they will do the pipeline change, reprocess everything, briefly check that the output looks sane, and commit the result as job done. This normalises deviance, since "looks sane" is a far cry from a good test.
New developers have almost zero chance of grasping the evolution of tests, since the tests are not in any way self-explanatory, just a bunch of CSVs.
Test isolation goes out the window — a small fix somewhere in the internals could mean changing a lot of tests.
For any but a trivial pipeline they are several orders of magnitude slower than unit tests.

Are there tools and methodologies out there that can support the current workflow and make it easier?

Linux distros typically come with GNU Coreutils, which can do a lot of text processing really fast. You could also look into the many solutions for transposing ("flipping the table on its side") the CSV, to be able to diff in a column-based fashion.

Finally, diversify your test strategy. Yes, you definitely want some end-to-end tests, but not for every single branch in your code. It's simply unsustainable. You'll want unit tests for your business rules and branches. If you can run your tests in a reasonable time you should probably look into mutation testing as well. It doesn't require writing any more tests, and is good at discovering missing tests (independently of code coverage, which you should also generate to get a baseline for how you're doing).

We do in fact have a big unit test suite as well, and we are also working on translating every identified feature in the fixture tests into a unit test.

Stack Exchange Network

What to do when test data takes up most of repo size?

The setup

Pain points

Incomplete ideas to deal with this

Questions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

What to do when test data takes up most of repo size?

The setup

Pain points

Incomplete ideas to deal with this

Questions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions