Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[Task] Investigate slow integrate-test stalls in message sender barrier #350

Open
Labels

Description

Background

The Integrate test job for HugeGraph Computer can take a long time even though the integration suite is small. The slow part is usually not the number of tests, but a long wait in the message/input synchronization path.

An observed CI log repeatedly prints:

EtcdClient - Wait for keys with prefix 'BSP_WORKER_INPUT_DONE' and timeout 86400000ms, expect 1 keys but actual got 0 keys

The same log shows the worker entering input step and starting vertex message sending before the wait:

WorkerService inputstep started
MessageSendManager - Start sending message(type=VERTEX)

So the master is waiting for the worker's BSP_WORKER_INPUT_DONE signal, but the worker has not reached Bsp4Worker.workerInputDone() yet.

Initial code pointers

  • CI runs mvn test -P integrate-test -ntp in .github/workflows/computer-ci.yml.
  • The integrate-test profile includes IntegrateTestSuite, which currently contains SenderIntegrateTest.
  • SenderIntegrateTest has only a few cases, but testOneWorkerWithBusyClient() intentionally slows the send path by wrapping the client's send function with Thread.sleep(100).
  • WorkerInputManager.loadGraph() sends vertices and edges first. Only after it returns does WorkerService.inputstep() call bsp4Worker.workerInputDone().
  • ComputerOptions.BSP_WAIT_WORKERS_TIMEOUT and BSP_WAIT_MASTER_TIMEOUT default to 24 hours, so a hidden sender/session/input problem can become a very slow CI wait instead of a fast, actionable failure.

Relevant files:

.github/workflows/computer-ci.yml
computer/computer-test/src/main/java/org/apache/hugegraph/computer/suite/integrate/IntegrateTestSuite.java
computer/computer-test/src/main/java/org/apache/hugegraph/computer/suite/integrate/SenderIntegrateTest.java
computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/input/WorkerInputManager.java
computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/worker/WorkerService.java
computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/QueuedMessageSender.java
computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/MessageSendManager.java
computer/computer-api/src/main/java/org/apache/hugegraph/computer/core/config/ComputerOptions.java

Related prior symptom: #203 reported The origin future must be null in SenderIntegrateTest. That may be in the same control-message/future/session area, but this task is specifically about the slow CI wait and fail-fast/debuggability of the integration test.

Suggested investigation

  1. Reproduce the integration suite with etcd available:

    cd computer
    mvn test -P integrate-test -Dtest=IntegrateTestSuite -ntp
  2. Confirm which test case spends time before BSP_WORKER_INPUT_DONE. Start with SenderIntegrateTest#testOneWorkerWithBusyClient.

  3. Trace the input path:

    SenderIntegrateTest
     -> WorkerService.execute()
     -> WorkerService.inputstep()
     -> WorkerInputManager.loadGraph()
     -> MessageSendManager.startSend()/finishSend()
     -> QueuedMessageSender.send()
     -> Bsp4Worker.workerInputDone()
    
  4. Check whether START/FINISH control futures in QueuedMessageSender can be left stale, completed late, or hidden behind the sender thread. The old [Bug] The origin future must be null #203 stack around futureRef is a useful clue.

  5. Make the test fail fast and print useful diagnostics. Possible directions:

    • set much smaller bsp.wait_workers_timeout / bsp.wait_master_timeout for integration tests;
    • add a JUnit/test-level timeout around each integration case;
    • dump worker/master thread states when the input barrier is not reached;
    • ensure sender exceptions propagate to both the worker future and the master-side wait;
    • replace the sleep-based busy-client simulation with a more deterministic back-pressure or blocked-client fixture.

Expected result

  • Integration tests should not spend many minutes printing only BSP_WORKER_INPUT_DONE wait logs.
  • If the sender/input path is broken, the test should fail quickly with an actionable error and enough thread/session state to locate the failing component.
  • The slow/busy-client path should have regression coverage so future changes do not reintroduce the long wait.

Newcomer scope

This is a good newcomer task because the suspected area is narrow: one integration suite, the input-step barrier, and the message sender control future path. A complete fix does not need a large algorithm or distributed-runtime redesign; first improving timeout/diagnostics and then isolating the sender/session condition would already be valuable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /