TL;DR: Given a repo with a lot of big fixture tests, when should I look for a solution where the golden files are not part of the repo anymore? Where should they be stored?
The setup
- Testing a complex data processing project with very diverse input sources
- Around 10^2 CSV files as fixture tests, each a few MB in size (Git LFS).
- These tests aim to be an exhaustive sample of the currently expected & supported inputs to the system, and the manually validated correct outputs.
- So far, these tests allowed us to create a high quality business logic core for our system.
- Current workflow:
- About every week, a column in each file is changed or added, or a new file is added.
- Column changes are typically because of code changes (which usually arise from a change or new feature in the specs.
- New files are typically added as regression tests for testing new input scenarios.
- We already have good tooling to manually inspect and approve the diffs. This part is working well.
Pain points
- Biggest pain point: Merge conflicts when two different columns are changed. (This is tedious because the conflict only arises because merging in git is line based, not CSV column based.)
- Inspecting changes in code reviews on Gitlab/Github/... is hard, since our tooling to look at the diffs is homegrown.
- Repo size is a few GB.
- This makes cloning and completely checking out the repo a hard task.
- Similarly,
git log -p
takes a long time and can clutter everything for those users who have activated automatic git lfs checkout.
Incomplete ideas to deal with this
- Moving from CSV to an external database. Downside: DB access while testing necessary, DB maintenance, need to worry about keeping DB migrations in sync with code changes
- Developing a CSV diff & merging tool that works with git. Downside: Dev effort, doesn't solve repo size issue
Questions
- Is this a common pattern in larger code bases that deal with data processing?
- Is this a sign of a deeper, fundamental error in the chosen testing approach?
- Are there tools and methodologies out there that can support the current workflow and make it easier?
EDIT: Clarified several points that were asked about in the comments, in particular I believe I used the term "golden test" incorrectly.
1 Answer 1
Is this a common pattern in larger code bases that deal with data processing?
Yes. When starting out developing a processing pipeline, checking some "real" inputs against some known good outputs is probably the simplest test strategy, especially if unfamiliar with unit testing and TDD.
Is this a sign of a deeper, fundamental error in the chosen testing approach?
Yes, there are many issues with relying only on characterisation tests:
- As the pipeline complexity increases, the complexity of maintaining the tests grows with something like O(total number of possible paths through the code), even if a lot of those paths are never traversed. In a unit tested code base, the complexity should grow close to linear with the number of branches in the code. This complexity means that as the test suite grows, developers pretty quickly have to reduce testing rigour to get anything done. What ends up happening is that they will do the pipeline change, reprocess everything, briefly check that the output looks sane, and commit the result as job done. This normalises deviance, since "looks sane" is a far cry from a good test.
- New developers have almost zero chance of grasping the evolution of tests, since the tests are not in any way self-explanatory, just a bunch of CSVs.
- Test isolation goes out the window — a small fix somewhere in the internals could mean changing a lot of tests.
- For any but a trivial pipeline they are several orders of magnitude slower than unit tests.
Are there tools and methodologies out there that can support the current workflow and make it easier?
Linux distros typically come with GNU Coreutils, which can do a lot of text processing really fast. You could also look into the many solutions for transposing ("flipping the table on its side") the CSV, to be able to diff in a column-based fashion.
Finally, diversify your test strategy. Yes, you definitely want some end-to-end tests, but not for every single branch in your code. It's simply unsustainable. You'll want unit tests for your business rules and branches. If you can run your tests in a reasonable time you should probably look into mutation testing as well. It doesn't require writing any more tests, and is good at discovering missing tests (independently of code coverage, which you should also generate to get a baseline for how you're doing).
-
We do in fact have a big unit test suite as well, and we are also working on translating every identified feature in the fixture tests into a unit test.Turion– Turion2023年12月15日 08:43:51 +00:00Commented Dec 15, 2023 at 8:43
git log -p
(depending on LFS config).