-
Notifications
You must be signed in to change notification settings - Fork 50
Can the agent distinguish stdout and stderr? #37
Open
Description
Hi!
I find that some test functions failed against a generated executable because the test expected to have some string in stderr, while the executable gave that string in stdout.
I also noticed that ProgramBench’s container execution helper appears to merge stdout/stderr streams here.
Is stdout/stderr separation intentionally unavailable to agents during exploration? (I think it is still possible for an agent to manually test stream placement with shell redirection, e.g. cmd >/tmp/out 2>/tmp/err, but the default observation format seems to make the distinction easy to miss.)
Thank you!
The commands I ran
uv run --with mini-swe-agent mini-extra programbench \ --filter "abishekvashok__cmatrix.5c082c6" \ --output output \ --model openai/gpt-5.4 uv run programbench eval output
Score log
Evaluation Summary Instance Score Comment abishekvashok__cmatrix.5c082c6 94 507 tests Average 94 1 instances
A failed test in eval.json
- task:
abishekvashok/cmatrix - test function name:
eval.tests.test_cmatrix.TestColorOptions.test_color_invalid_shows_error_message - branch:
1b991a57d4e9
self = <test_cmatrix.TestColorOptions object at 0x7732071f2bf0> def test_color_invalid_shows_error_message(self): """Test invalid color name produces error message.""" result = run("-C", "purple") assert result.returncode == 0 # Exits with 0 but shows error > assert b"Invalid color" in result.stderr E AssertionError: assert b'Invalid color' in b'' E + where b'' = CompletedProcess(args=['./executable', '-C', 'purple'], returncode=0, stdout=b' Invalid color selection\n Valid colors are green, red, blue, white, yellow, cyan, magenta and black.\n', stderr=b'').stderr eval/tests/test_cmatrix.py:120: AssertionError
Metadata
Metadata
Assignees
Type
Fields
Give feedbackNo fields configured for issues without a type.