-
Notifications
You must be signed in to change notification settings - Fork 48
[Task] Investigate slow integrate-test stalls in message sender barrier #350
Description
Background
The Integrate test job for HugeGraph Computer can take a long time even though the integration suite is small. The slow part is usually not the number of tests, but a long wait in the message/input synchronization path.
An observed CI log repeatedly prints:
EtcdClient - Wait for keys with prefix 'BSP_WORKER_INPUT_DONE' and timeout 86400000ms, expect 1 keys but actual got 0 keys
The same log shows the worker entering input step and starting vertex message sending before the wait:
WorkerService inputstep started
MessageSendManager - Start sending message(type=VERTEX)
So the master is waiting for the worker's BSP_WORKER_INPUT_DONE signal, but the worker has not reached Bsp4Worker.workerInputDone() yet.
Initial code pointers
- CI runs
mvn test -P integrate-test -ntpin.github/workflows/computer-ci.yml. - The
integrate-testprofile includesIntegrateTestSuite, which currently containsSenderIntegrateTest. SenderIntegrateTesthas only a few cases, buttestOneWorkerWithBusyClient()intentionally slows the send path by wrapping the client's send function withThread.sleep(100).WorkerInputManager.loadGraph()sends vertices and edges first. Only after it returns doesWorkerService.inputstep()callbsp4Worker.workerInputDone().ComputerOptions.BSP_WAIT_WORKERS_TIMEOUTandBSP_WAIT_MASTER_TIMEOUTdefault to 24 hours, so a hidden sender/session/input problem can become a very slow CI wait instead of a fast, actionable failure.
Relevant files:
.github/workflows/computer-ci.yml
computer/computer-test/src/main/java/org/apache/hugegraph/computer/suite/integrate/IntegrateTestSuite.java
computer/computer-test/src/main/java/org/apache/hugegraph/computer/suite/integrate/SenderIntegrateTest.java
computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/input/WorkerInputManager.java
computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/worker/WorkerService.java
computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/QueuedMessageSender.java
computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/MessageSendManager.java
computer/computer-api/src/main/java/org/apache/hugegraph/computer/core/config/ComputerOptions.java
Related prior symptom: #203 reported The origin future must be null in SenderIntegrateTest. That may be in the same control-message/future/session area, but this task is specifically about the slow CI wait and fail-fast/debuggability of the integration test.
Suggested investigation
-
Reproduce the integration suite with etcd available:
cd computer mvn test -P integrate-test -Dtest=IntegrateTestSuite -ntp
-
Confirm which test case spends time before
BSP_WORKER_INPUT_DONE. Start withSenderIntegrateTest#testOneWorkerWithBusyClient. -
Trace the input path:
SenderIntegrateTest -> WorkerService.execute() -> WorkerService.inputstep() -> WorkerInputManager.loadGraph() -> MessageSendManager.startSend()/finishSend() -> QueuedMessageSender.send() -> Bsp4Worker.workerInputDone() -
Check whether START/FINISH control futures in
QueuedMessageSendercan be left stale, completed late, or hidden behind the sender thread. The old [Bug] The origin future must be null #203 stack aroundfutureRefis a useful clue. -
Make the test fail fast and print useful diagnostics. Possible directions:
- set much smaller
bsp.wait_workers_timeout/bsp.wait_master_timeoutfor integration tests; - add a JUnit/test-level timeout around each integration case;
- dump worker/master thread states when the input barrier is not reached;
- ensure sender exceptions propagate to both the worker future and the master-side wait;
- replace the sleep-based busy-client simulation with a more deterministic back-pressure or blocked-client fixture.
- set much smaller
Expected result
- Integration tests should not spend many minutes printing only
BSP_WORKER_INPUT_DONEwait logs. - If the sender/input path is broken, the test should fail quickly with an actionable error and enough thread/session state to locate the failing component.
- The slow/busy-client path should have regression coverage so future changes do not reintroduce the long wait.
Newcomer scope
This is a good newcomer task because the suspected area is narrow: one integration suite, the input-step barrier, and the message sender control future path. A complete fix does not need a large algorithm or distributed-runtime redesign; first improving timeout/diagnostics and then isolating the sender/session condition would already be valuable.