Skip to main content
arXiv is now an independent nonprofit! Learn more
archive

Robotics

See recent articles

Showing new listings for Thursday, 2 July 2026

Total of 94 entries
Showing up to 2000 entries per page: fewer | more | all

New submissions (showing 53 of 53 entries)

[1] arXiv:2607.00020 [pdf, html, other]
Title: EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories
Subjects: Robotics (cs.RO)

Spatial grounding remains a key limitation of vision-language-action (VLA) systems for robotic manipulation. While current models can recognize objects and follow language instructions, they often lack an explicit representation of how objects are arranged in space, including support, containment, ordering, occlusion, and depth-sensitive relations. We introduce EmbodimentSemantic, a spatial scene-graph dataset and benchmark for evaluating relational grounding in embodied manipulation. EmbodimentSemantic represents scenes as directed object-relation-object triplets, where each triplet specifies a spatial relation between an ordered pair of objects using a fixed set of relations. This representation enables direct evaluation of object binding, relation prediction, and spatial consistency. The dataset includes real-world manipulation observations collected with the low-cost SO101 robot arm, together with generated scene graphs for studying spatial grounding in practical robotic settings. To provide controlled validation, we also introduce a simulator-grounded LIBERO benchmark with over 60K manipulation frames and more than 120K camera-specific scene graphs across paired third-person and wrist views, where ground-truth relations are derived automatically from MuJoCo geometry, world coordinates, camera projections, and visibility constraints. We further test whether scene graphs improve downstream control by injecting them into existing VLA policy prompts. Experiments across open-source and commercial VLMs show that current models often predict plausible relations but struggle with exact depth-aware and viewpoint-dependent spatial structure. EmbodimentSemantic provides a unified framework for diagnosing spatial grounding in VLM perception and testing its utility for VLA manipulation.

[2] arXiv:2607.00022 [pdf, html, other]
Title: When to Personalize Household Object Search: A Rigidity-Gated Hybrid Policy
Comments: 8 pages. Accepted to IROS 2026
Subjects: Robotics (cs.RO)

Service robots searching for household objects rely on spatial priors to reduce search cost, yet object locations can vary with resident traits. Collecting longitudinal, trait-specific in-home trajectories is invasive and hard to scale. We study when personalization helps and propose PerSim, a rigidity-gated hybrid policy that combines a trait-conditioned prior with a population-frequency baseline, personalizing only when placement behavior is variable. To scale resident-conditioned dynamics, we employ a human-calibrated simulation pipeline to generate and validate object-placement transitions in diverse home layouts, and train a predictor that injects continuous Big Five vectors to output room-level priors and within-room co-occurrence cues. In a unified human study (N=200), dual-layer validation shows that (i) synthetic transitions are judged behaviorally plausible (mean 3.85/5, p < 1e-6), and (ii) in a blinded A/B comparison, personalization is favored primarily for low-rigidity objects (p=0.005), while the population-frequency baseline remains strong for universally placed items, yielding a decision rule for when to personalize. In an offline objective test, we observe a small but significant improvement on unseen continuous trait vectors over nearest discrete configuration matching (p=0.035), supporting interpolation in five-dimensional trait space. Finally, in a home digital twin we show that PerSim reduces expected search cost by combining room visitation effort with within-room cue checking, demonstrating end-to-end gains beyond isolated prediction metrics.

[3] arXiv:2607.00024 [pdf, html, other]
Title: Decentralized Geometric Control for Cable-Suspended Payload Transport with Adaptive Mass Estimation
Comments: Accepted to be presented at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026
Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA); Systems and Control (eess.SY)

Cooperative aerial transport requires controllers that respect nonlinear manifold geometry, operate without centralized coordination, and respect operational safety constraints. To address these demands, we present GPAC, a four-layer hierarchical architecture that enables $N$ quadrotors to transport a cable-suspended payload without a central coordinator or by exchanging cable states or adaptive parameters. The key insight is implicit coordination: each quadrotor independently estimates its effective load share from local cable measurements, so combined forces converge to the correct total, even without knowledge of $N$ or the payload mass; the payload position is reconstructed locally from each agent's own cable geometry, and the only inter-agent communication is a low-rate neighbor-position broadcast for collision avoidance. GPAC operates directly on the full nonlinear configuration manifold and integrates geometric position and attitude control, anti-swing regulation, an extended-state observer for wind rejection, concurrent learning-based mass estimation without persistent excitation, and a priority-ordered control barrier function (CBF)-inspired safety filter that reduces operational risk, with input-to-state safety (ISSf) margins that hold exactly under single-constraint activation. A compatibility result shows that the filter's force modifications keep the desired attitude within the almost-global stability region of the $\mathrm{SO}(3)$ attitude controller. Finally, high-fidelity simulation with flexible cables, onboard sensor fusion, and wind turbulence -- with all control and estimation loops closed through the estimator -- yields a mean payload-tracking RMSE of 33.8 cm (2.8\% coefficient of variation over 13 seeds) at a low per-agent computational cost.

[4] arXiv:2607.00025 [pdf, html, other]
Title: FLYNN: Robust Neural Network for Robot Navigation using Fly Brain Topology
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

While deep learning models achieve state-of-the-art performance in complex tasks, they remain brittle when faced with new environments or sensory deprivation. In contrast, biological systems exhibit remarkable tolerance to these challenges. We address this vulnerability by developing a recurrent neural network (RNN) whose architecture is directly derived from the synaptic-resolution brain connectome of the fruit fly Drosophila melanogaster. We demonstrate the feasibility of training the fly connectome neural network (FLYNN) to perform vision-based navigation in MuJoCo, achieving performance comparable to modern hand-crafted networks of similar parameter counts. Crucially, FLYNN exhibits superior resistance to out-of-distribution (OOD) data and tolerance to sensory loss without further training. It remained functional even under total vision loss while hand-crafted networks largely failed, even when specifically trained with camera dropout. Principal Component Analysis (PCA) of the internal state of FLYNN suggests that it exhibits a particularly high degree of representational modularity, which might be related to its robustness. Our work provides a new direction for designing resilient artificial agents following the topology of biological brains.

[5] arXiv:2607.00026 [pdf, html, other]
Title: Invariant Stochastic Filtering on SE(3) for Inertial-Encoder State Estimation of Serial Rigid Manipulators
Comments: This document is an arXiv preprint posted for open access and citation purposes. It is under review and subject to revision
Subjects: Robotics (cs.RO)

An invariant extended Kalman filter (IEKF) is developed for state estimation of serial rigid manipulators with an arbitrary number of links, formulated entirely within the Lie group SE(3). The group-affine property of the kinematic equations makes the linearised error dynamics autonomous, so the Riccati equation governs the true error covariance rather than a local approximation. A physically separated noise model treats gyroscope and accelerometer channels independently: the accelerometer provides translational twist via gravity-compensated integration, yielding a measurement covariance that scales with the sample interval in exact analogy with process noise discretisation; a state-dependent Coriolis noise term captures gyroscope noise propagating through the nonlinear dynamics, vanishing at rest and growing with twist magnitude. The filter is structured as a modular chain of per-link IEKFs in which the predicted covariance of each link depends on its predecessor only through the Adjoint-transformed posterior, giving linear computational cost in link count. Exponential ultimate boundedness in mean square is established via a Lie algebra Lyapunov function, with per-link bounds chained through the Adjoint operator norm to yield a stability certificate that is modular and scalable to arbitrary chain length. Numerical results validate the design.

[6] arXiv:2607.00027 [pdf, other]
Title: Urban Deceleration Behavior Modes Under Scene Context: An Early-Kinematic Classifier from Argoverse 2 Multi-Agent Trajectories
Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)

Urban deceleration is one of the most empirically studied yet least taxonomically organized behaviors in car-following research. Recent perception-equipped autonomous-vehicle datasets enable trajectory-anchored mode discovery. We extract 1,219 sustained deceleration events from 234 urban driving logs of the Argoverse 2 Sensor dataset, encode each event in a 19-dimensional kinematic feature vector, discover behavioral modes via K-means clustering with bootstrap stability analysis, and quantify modulation by eleven scene-context variables. A HistGradientBoosting classifier predicts mode membership from the first 1.0 s of each event. Four stable modes emerge with a bootstrap Adjusted Rand Index of 0.897 across 50 resamples: anticipatory soft (62.8%), reactive closing (30.6%), brake-like jerk (4.8%), and an outlier category (1.8%). Only pair age shows a medium effect (epsilon^2 = 0.085); scene geometry and vulnerable-road-user proximity show negligible effects. The early-event classifier achieves macro-F1 = 0.758 at 1.0 s, with scene context contributing +0.059 F1 over kinematics alone. Modes are regime-invariant in medium-speed driving (ARI = 0.817) but regime-dependent at low speed (ARI = 0.166). A small set of stable kinematic modes structures urban deceleration; early-window jerk dominates predictive signal; and pair age is the primary contextual modulator.

[7] arXiv:2607.00028 [pdf, html, other]
Title: Trajectory Learning with Graph Representations for Social Robot Navigation
Subjects: Robotics (cs.RO)

Autonomous mobile robots are expected to exhibit socially compliant navigation for minimizing pedestrian disturbance. While capturing social interactions and incorporating pedestrian motion estimations into decision-making are beneficial for compliance, prior methods fail to address both spatial and temporal characteristics present in real-world data. Reinforcement Learning offers high capability, but it requires hand-crafted reward functions that reduce social behavior to static criteria, limiting its ability to reproduce patterns that exist in real pedestrian behavior. Imitation Learning offers direct training from real-world data but lacks modeling of social interactions and suffers from error accumulation. To this end, we propose an imitation learning framework that leverages spatiotemporal dynamics for socially compliant navigation. To represent social context based on interactions, we introduce a graph-based auxiliary network that encodes crowd states by attending to pedestrians. In addition, we present a navigation module that captures temporal dynamics and mitigates error accumulations by incorporating encoded state predictions and employing a trajectory-level learning objective. Our framework outperforms established data-driven baselines on simulation and a real-world dataset across diverse social metrics.

[8] arXiv:2607.00029 [pdf, html, other]
Title: Memory-Native Non-Terrestrial Networks for Embodied Intelligence
Comments: 8 pages, 4 figures, 2 tables, submitted to IEEE for possible publication
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)

Non-terrestrial networks (NTN) provide ubiquitous connectivity for embodied intelligence (EI), enabling robots in wilderness to leverage cloud resources or report critical information to remote centers. However, the synergy is nontrivial due to the highly-dynamic, resource-constrained, topology-varying, and task-oriented environment. Existing memoryless NTN protocols become inefficient, since the decisions are driven by local channel conditions and instantaneous service demands. To address these limitations, this paper proposes the memory-native NTN (MemNTN) paradigm that leverages long-horizon contexts for memory augmented system optimization. To realize this paradigm shift, we establish a dual-memory architecture that distinguishes between physical memory representing the state of the world and digital memory encoding historical network experience. We develop memory acquisition, compression, valuation, update, and utilization mechanisms that facilitate cross-layer, memory-native decision-making, spanning from the physical and access layers up to the network and application layers. Experiments in satellite embodied question answering (SEQA) demonstrate that the proposed MemNTN significantly outperforms conventional stateless NTN and terrestrial approaches.

[9] arXiv:2607.00030 [pdf, html, other]
Title: A Unified Benchmark for RCM-Constrained Visual Servoing: Modeling-Controller Interaction and Robustness Analysis in Laparoscopic Robots
Subjects: Robotics (cs.RO)

In robot-assisted laparoscopic minimally invasive surgery (MIS), accurate enforcement of the remote center of motion (RCM) constraint is critical for safe and stable automatic field-of-view (FoV) adjustment. Although control-based RCM strategies are widely adopted due to their flexibility and cost-effectiveness, systematic comparison of different RCM formulations and image-based visual servoing (IBVS) frameworks remains challenging due to the lack of a unified and reproducible benchmark. This paper presents an open-source simulation framework integrating three representative RCM modeling approaches and six IBVS-based control architectures within a unified velocity-level formulation, enabling controlled and consistent evaluation. Through structured case studies, the framework reveals key structural sensitivities arising from modeling and controller interactions, including the impact of tangent-plane definition, constraint dimensionality, open- versus closed-loop enforcement, and robustness near kinematic singularities. All resources are released and demostrations are provided in the supplementary video, providing a reproducible foundation for RCM-constrained visual servoing research.

[10] arXiv:2607.00031 [pdf, html, other]
Title: Joint Discovery of Object and Action Symbols through Effect Prediction for Robotic Manipulation Planning
Subjects: Robotics (cs.RO)

To perform complex manipulation planning, autonomous robots are required to abstract continuous, high-dimensional sensorimotor interactions into discrete object and action representations. Earlier work either categorized objects based on visual appearances, which fails to distinguish objects that appear similar but behave differently, or based on effects under interaction, but was limited to predefined actions. To address these limitations, we propose a model that jointly discovers high-level manipulation primitives and object categories through a binary bottleneck layer, trained to predict multi-modal outcomes, including object motion, contact, and force feedback, from random interaction data. Building on these discovered binary representations, we leverage a discrete planning method that uses intermediate steps in the predicted effect trajectory to enable partial action executions for precise low-level control. Additionally, we evaluate our framework's generalization capabilities on novel objects by assigning object categories through comparing a small number of interaction effects with the predicted effects of learned object symbols, enabling few-shot generalization based on behavior rather than visual similarity. We conduct experiments on tabletop repositioning and stacking tasks, and confirm that our effect-driven planning approach outperforms both a state-of-the-art method and a visual-based alternative in planning precision across seen and novel objects.

[11] arXiv:2607.00033 [pdf, html, other]
Title: Learning Dexterous Manipulation Using Contact Wrench Guidance From Human Demonstration
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Dexterous robot manipulation can benefit from the abundance of human demonstrations, but transferring such demonstrations to robot policies remains challenging. We present Contact Wrench Guidance from Human Demonstration in Robotic Dexterous Manipulation (CHORD), a framework for long-horizon manipulation of rigid and articulated objects with reinforcement learning. The key idea is object-centric contact wrench space guidance: we represent human and robot motions by the forces and torques they can induce on the object, enabling similarity to be measured by the induced instantaneous motions. This guidance makes reinforcement learning more scalable for contact-rich dexterous manipulation. We further introduce a large-scale simulation benchmark with 4,739 bimanual dexterous manipulation tasks, constructed from motion-capture datasets and reconstructed in-house videos. Evaluated on 1,831 benchmark tasks, CHORD achieves an average success rate of 82.12%, demonstrating strong scalability. CHORD also generalizes to whole-body manipulation from hand-only and third-person demonstrations, achieving a 90.77% success rate, and the learned policies transfer to the real world in both open-loop and closed-loop settings.

[12] arXiv:2607.00065 [pdf, html, other]
Title: Optimal any-angle path planning in static and dynamic environments
Comments: 33 pages, 13 figures
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Any-angle path planning extends traditional graph-based path planning by allowing movement between any pair of vertices, rather than being restricted by predefined edges. It can find straighter and shorter paths in continuous space with graphs, making it particularly suitable for navigation in open areas such as airspaces, warehouses, and oceans. Many any-angle path-planning algorithms have been proposed, but only a few can guarantee optimal solutions, especially in the presence of dynamic obstacles. To address this challenge, this article focuses on optimal any-angle path planning on grids and introduces two general techniques that accelerate computation while preserving optimality in both static and dynamic environments: 1) elliptical forward expansion, which leverages ellipse-based neighborhoods to restrict the search space, and 2) field of view, which replaces traditional line-of-sight methods to speed up visibility checks. To integrate these two techniques, inverted and forward scanning are introduced. Inverted scanning establishes visual connections from open nodes, whereas forward scanning initiates scans from closed nodes. Building on the proposed techniques, Zeta* and Zeta*-SIPP are developed for static and dynamic environments respectively. Zeta*, when combined with forward scanning, is similar to the state-of-the-art algorithm Anya and attains comparable performance. Unlike Anya, Zeta* can be readily extended to other settings, such as dynamic environments (e.g., Zeta*-SIPP). Zeta*-SIPP, with either scanning method, is more than 20 times faster than the corresponding state-of-the-art optimal planner TO-AA-SIPP. Overall, this research identifies the key requirements for achieving optimal any-angle path planning and introduces a unified approach suitable for different environments.

[13] arXiv:2607.00066 [pdf, html, other]
Title: Learning Expert Strategy for Autonomous Robotic Endovascular Intervention via Decoupled Procedural Execution
Comments: This paper has been accepted by IEEE/RSJ IROS 2026. 8 pages, 4 figures, 3 tables
Subjects: Robotics (cs.RO)

Endovascular interventions are high-stakes procedures requiring precise device operation within complex and tortuous vascular anatomies. Autonomous endovascular navigation has the potential to standardize procedural quality and reduce the performance variability inherent in manual operation. Although Reinforcement Learning (RL) approaches have demonstrated promise in enabling autonomy in endovascular intervention, they often struggle with explicit constraint satisfaction and safety guarantees. To address these challenges, a learning-based expert strategy is introduced, enhancing procedural consistency in autonomous endovascular intervention by explicitly decoupling high-level strategic decision-making from low-level procedural execution. The proposed framework replicates the expert clinical decision-making process: a strategic RL policy generates global navigation intents, which are subsequently refined through an expert-informed execution module. This module ensures that robot movements strictly adhere to expert operational norms, real-time kinematic limits, and vessel safety constraints. Experimental evaluation across high-fidelity 3D simulations and a real-world robotic platform demonstrates that the proposed framework not only outperforms baseline policies but also effectively replicates expert-level proficiency. The framework achieves a high navigation success rate (> 96%) and a 29.3% reduction in operational steps, which translates to enhanced operative efficiency and minimized device-vessel interaction. Furthermore, a 13% reduction in trajectory variance indicates superior procedural standardization, aligning autonomous behavior with established clinical norms. These results underscore its potential to enhance the predictability, safety, and consistency of robotic endovascular interventions.

[14] arXiv:2607.00141 [pdf, other]
Title: AD-MPCC: Adaptive Differentiable Model Predictive Contouring Control for Autonomous Racing
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

This paper presents Adaptive Differentiable Model Predictive Contouring Control (AD-MPCC), a framework for autonomous racing that integrates differentiable MPCC with online parameter estimation to handle varying road-surface conditions. For online parameter estimation, we leverage a parameterized Pacejka Magic Formula together with a regularized moving-horizon estimation scheme with exponentially decaying weights to capture road interactions and update parameters in real time. Furthermore, we propose a differentiable MPCC (Diff-MPCC) framework that enables optimal adjustment of objective weights based on predefined long-horizon performance costs. To implement Diff-MPCC for online objective weight adaptation, we propose a Pacejka-informed machine learning model that is trained in a supervised manner using data generated by Diff-MPCC to tune the objective weights. Simulation results demonstrate that AD-MPCC reliably ensures safety and achieves faster lap times compared to baseline controllers in both single-surface and multiple-surface scenarios.

[15] arXiv:2607.00142 [pdf, html, other]
Title: Stop Pretending Social Robots Are Inevitable
Comments: Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
Subjects: Robotics (cs.RO)

This paper takes issue with the recent themes of both the RO-MAN and the HRI conferences for their portrayal of a future human-robot society as inevitable. The focus is on discussing how such statements ultimately shape research. By treating a future human-robot society as a fait accompli, license is given for user studies to imagine any scenario they like, no matter whether it has any ecological relevance, and to emphasise the scenario design over actually creating robot abilities needed to fullfill the imagined role. Meanwhile, research that focusses on actual societal needs, without assuming that robots are a solution, is deprioritised, as is technical development, in particular with respect to abilities that are necessary to enable robots that function as social agents rather than a mere automation of tasks. A frame that simply assumes a robot future not only detracts from scientific advancement in favour of a techno-solutionism we ought to resist, it is also self-defeating as it risks stifling the research needed to bring it about. We should therefore reject attempts to frame and promote the field in terms of the inevitable social robot and instead focus on one that facilitates advances in the field regardless of what the future holds. This paper suggests that a renewed focus on cognitive mechanisms necessary for the "I" in HRI would be a good starting point.

[16] arXiv:2607.00145 [pdf, html, other]
Title: Iterated Invariant EKF for 3D Landmark-Aided Inertial Navigation
Subjects: Robotics (cs.RO)

Inertial navigation systems aided by three-dimensional landmark measurements constitute a fundamental problem in robotic perception and state estimation. Classical SO(3)-based Extended Kalman Filter (SO(3)-EKF) approaches provide practical solutions, but suffer from the false observability problem, in which the filter becomes overconfident in unobservable directions, leading to degraded estimation performance. The Invariant EKF (IEKF) addresses this limitation by reformulating the system dynamics as a group-affine system on a Lie group, although its measurement update does not fully satisfy certain state compatibility properties. More recently, the Iterated Invariant EKF (IterIEKF) was proposed to further improve the IEKF by ensuring, in the low-noise regime, that the estimated state remains on the observed state manifold while the uncertainty is confined to its tangent space. In this work, we formulate and apply the IterIEKF to landmark-based inertial 3D localization for the first time. Through numerical simulations, we show that the proposed approach outperforms the classical SO(3)-EKF, the Iterated SO(3)-EKF, and the IEKF in terms of both estimation accuracy and consistency.

[17] arXiv:2607.00148 [pdf, html, other]
Title: 3D Point World Models: Point Completion Enables More Accurate Dynamics Learning
Comments: 21 Pages
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Learning predictive models of the world enables robotic control through planning, potentially allowing robots to improvise solutions on new tasks. However, large video-based dynamics models lack explicit 3D spatial structure and suffer from geometrically inconsistent long-term rollouts with compounding errors. Emerging 3D dynamics models based on partial point clouds improve geometric consistency but remain sensitive to occlusions and accumulated prediction drift. To address these challenges, we present 3D Point World Models (3DPWM) - a task-agnostic world model that operates entirely in 3D space by first completing partial point clouds and then learning action-conditioned dynamics in this completed 3D scene. By operating on completed geometry, 3DPWM enables reliable long-horizon rollouts and more accurate cost evaluation for model-based planning while supporting adaptation to new tasks. Experiments across different robotic embodiments and tabletop manipulation benchmarks demonstrate that 3DPWM achieves significantly more reliable long-horizon rollouts (100-300+ steps), supports both open-loop and closed-loop planning, and enables successful sim-to-real transfer.

[18] arXiv:2607.00156 [pdf, other]
Title: Dual-Informed Vertical Expansion for Multi-Objective Node Selection in Anytime Conflict-Based Search
Subjects: Robotics (cs.RO)

Conflict-Based Search (CBS) is a leading exact algorithm for Multi-Agent Path Finding (MAPF), but its high-level node-selection rule is usually treated as a fixed implementation detail. Standard best-first selection is strong for minimizing expanded nodes and closing the optimality certificate, yet it can maintain a large frontier, interrupt parent-child expansion sequences, and provide no feasible incumbent until termination. This paper studies node selection as a first-class design choice for exact CBS. We introduce Dual-Informed Vertical Expansion (DIVE), a policy that is best-bound between dives and depth-oriented within a dive. DIVE starts each dive from the current best-bound frontier, follows promising children to exploit parent-child locality, and uses incumbent pruning to limit unproductive excursions. We formalize CBS node selection through a branch-and-bound view, prove that the traversal policy can be changed without affecting exactness, and analyze the resulting trade-offs among expanded nodes, dive breaks, queue size, and primal-dual bound progress. The analysis predicts three complementary extremes. Best-first search is node efficient, iterative deepening is memory efficient, and DIVE is dive efficient while retaining regular best-bound reanchoring. Experiments on standard MAPF benchmarks support this trade-off map. DIVE consistently reduces dive breaks, provides early incumbents with certified gaps, uses substantially less queue memory than best-first search, and benefits from warm starts and simple responsive variants in dense or memory-limited regimes.

[19] arXiv:2607.00160 [pdf, html, other]
Title: Distributed Multi Robot Lunar Cargo Transportation via Phase Decomposed Reinforcement Learning
Comments: 8 pages, 9 Figures, Accepted at IROS2026
Subjects: Robotics (cs.RO)

Modular reconfigurable robotic systems provide a scalable solution for cooperative surface operations in future lunar missions. However, cooperative cargo transportation remains challenging due to morphology-dependent topology changes, strong payload-induced coupling, long-horizon decision making, and safety constraints. This paper proposes a phase-decomposed reinforcement learning framework for cooperative cargo transport with distributed robotic units. The task is decomposed into lifting, transportation, and placement, each optimized with a dedicated joint-state policy capturing inter-agent coupling. Centralized training promotes stable convergence, while deployment uses onboard proprioception for control and OptiTrack motion capture for ground-truth evaluation and post-processed metrics. A deterministic phase controller expressed in Markov state representation regulates transitions between stages, and a failure-sensitive synchronization mechanism ensures coordinated progression and safety-aware halting during real-world execution. The framework is evaluated in simulation and through controlled field experiments at a JAXA space exploration test facility. Results demonstrate reliable cooperative transport across all stages in both simulation and hardware experiments.

[20] arXiv:2607.00191 [pdf, html, other]
Title: HydraCollab: Adaptive Collaborative-Perception for Distributed Autonomous Systems
Comments: Accepted at IROS 2026
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

Collaborative-perception enables multi-robot systems to enhance situational awareness by sharing perceptual information. Existing collaborative-perception systems face an inherent trade-off between communication bandwidth requirements and perception accuracy, where methods that exchange more information achieve better perception results at the cost of increased communication overhead. However, real-world communication networks impose bandwidth constraints that require minimizing communication overhead without sacrificing perception performance. To address this challenge, we propose HydraCollab, an adaptive collaborative-perception framework that (i) selectively transmits the most informative sensor features and (ii) dynamically employs collaboration strategies (intermediate or late) based on spatial confidence maps. Extensive evaluations on the V2X-R, V2X-Radar and UAV3D-mini datasets demonstrate that HydraCollab achieves the best overall trade-off between accuracy and communication cost among existing collaborative-perception methods. Relative to SOTA Where2comm, HydraCollab uses only 41% of the bandwidth on V2X-R and 26% on V2X-Radar while improving performance by 0.78% and 0.75% respectively. Our code and models are available at this https URL.

[21] arXiv:2607.00215 [pdf, html, other]
Title: ELMP: Efficient Learning for Motion Planning via Analytical Policy Gradients
Comments: 8 pages, 7 figures, 4 tables
Subjects: Robotics (cs.RO)

Neural Motion Planners (NMPs) enable fast reactive motion generation, but adapting them to new environments typically requires recollecting large expert datasets, which is computationally prohibitive. We propose ELMP, a framework for data-efficient adaptation via self-supervised fine-tuning. Rather than generating additional expert trajectories with expensive global planners, ELMP directly optimizes the policy through a differentiable kinematic layer using dense collision, target-reaching, and smoothness objectives. This replaces expert data generation with rapid problem sampling, reducing per-sample adaptation cost by roughly two orders of magnitude. To further support robust generalization across changing kinematic chains, we introduce a mechanism to explicitly encode tool geometry via point clouds. Benchmarked against classical and neural baselines, ELMP achieves an 84.8% average success rate with orders-of-magnitude lower cold-start latency than classical methods. In unseen environments, self-supervised fine-tuning improves success rate from 57.3% (zero-shot) to 89.8%, removing the data collection bottleneck. Our approach maintains millisecond-level inference latency and is validated on a physical Franka Emika Panda robot.

[22] arXiv:2607.00272 [pdf, html, other]
Title: ASPIRE: Agentic /Skills Discovery for Robotics
Comments: 43 pages, 12 figures, 9 tables. Project page: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

Traditional robot programming is challenging: it requires orchestrating multimodal perception, managing physical contact dynamics, and handling diverse configurations and execution failures. We introduce ASPIRE (Agentic Skill Programming through Iterative Robot Exploration), a continual learning system that autonomously writes and refines robot control programs in a code-as-policy paradigm while compounding experience into a reusable skill library. ASPIRE discovers skills that persist across tasks, simulation and real-world settings, and embodiments. It operates in an open-ended loop with three components: (1) a closed-loop robot execution engine that exposes fine-grained multimodal traces, enabling autonomous failure diagnosis, repair synthesis, and validation; (2) a continually expanding skill library that distills validated fixes into reusable, transferable knowledge; and (3) evolutionary search that generates diverse task sequences and control programs to explore beyond single-trajectory refinement. ASPIRE surpasses prior methods by up to 77% on LIBERO-Pro manipulation under perturbation, 72% on Robosuite bimanual handover, and 32% on BEHAVIOR-1K long-horizon household tasks. Its accumulated library also enables zero-shot generalization to unseen long-horizon tasks: on LIBERO-Pro Long, ASPIRE achieves 31% success versus 4% for prior methods despite their use of test-time reasoning and retries. Finally, simulation-discovered skills provide initial evidence of sim-to-real transfer, substantially reducing real-robot programming effort across different embodiments and robot APIs.

[23] arXiv:2607.00283 [pdf, other]
Title: What's Hidden Matters: Identifying Planning-Critical Occluded Agents using Vision-Language Models
Comments: Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026). 9 pages, 5 figures, 5 tables
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Autonomous vehicles must safely navigate complex environments where planning-critical agents may be hidden from view. Current approaches often treat all occlusions with uniform conservatism, yielding needlessly defensive driving, or they infer hidden spaces without estimating the impact on the planner. This work bridges the critical gap between perception and planning by enabling Vision-Language Models (VLMs) to identify and reason about the specific hidden agents that are most critical to the ego-vehicle's trajectory. We introduce a novel framework that uses Planning KL-divergence (PKL), an information-theoretic metric, to systematically identify and rank occluded agents based on their impact on the ego vehicle's plan. Using this planning-aware ranking, we employ an expert VLM (GPT-5) to generate rich, structured annotations that capture the visual evidence and reasoning required for this task. We apply this framework to the nuScenes dataset to create a new benchmark focused on high-impact scenarios. We conduct comprehensive experiments on a wide range of general-purpose and domain-adapted VLMs, demonstrating that fine-tuning on our PKL-guided data yields dramatic performance improvements across all models. Notably, our results show that smaller, fine-tuned models significantly outperform their much larger zero-shot counterparts, and that our PKL-guided data selection strategy improves performance by approximately 30\% over random sampling. Our work presents the first systematic approach for training VLMs to focus on planning-critical occlusions, enabling more semantically grounded and efficient risk assessment in autonomous driving.

[24] arXiv:2607.00326 [pdf, html, other]
Title: NeHMO: Neural Hamilton-Jacobi Reachability Learning for Decentralized Safe Multi-Arm Motion Planning
Subjects: Robotics (cs.RO)

Safe multi-arm motion planning is a challenging problem in robotics due to its high dimensionality, coupled configuration space, and complex collision constraints. Centralized planners are capable of coordinating all arms but often face scalability limitations, restricting applicability in real-time settings. On the other hand, decentralized methods are scalable and recent deep learning-based approaches have shown promising results. However, these depend on accurate behavior prediction or coordination protocols and may fail when other arms act unpredictably. To address these challenges, we introduce a neural Hamilton-Jacobi Reachability (HJR) learning-based approach to approximate a safety value function that captures worst-case inter-arm safety constraints. We further develop a decentralized trajectory optimization framework that uses the learned HJR representation for real-time planning. The proposed method is scalable and data-efficient, generalizes across multi-manipulator systems, and outperforms state-of-the-art baselines on challenging multi-arm motion planning tasks.

[25] arXiv:2607.00351 [pdf, html, other]
Title: Unleashing More Actions via Action Compositional Training for VLA Models
Subjects: Robotics (cs.RO)

Vision-Language-Action models excel at robotic manipulation, driven by the scale and diversity of demonstration data. However, standard training paradigms often cause VLA models to severely overfit to specific behavioral patterns, rendering them unable to generalize to out-of-distribution scenarios even when those scenarios merely require novel combinations of identical sub-skills. While expanding datasets can mitigate this overfitting, acquiring high-quality robot data remains notoriously labor-intensive and cost-prohibitive. To resolve this impasse without expensive human teleoperation and to truly unleash more actions,i.e., enable VLA models to compose known sub-skills into a much broader set of executable behaviors beyond the original demonstrations-we propose ACT-VLA (Action Compositional Training for VLA Models), an offline data augmentation framework that leverages the model's latent task representations to synthesize novel, physically valid demonstrations directly from existing tasks for policy training. By eliminating additional manual data collection, our method automatically expands the training distribution and mitigates overfitting. We evaluate our approach on challenging manipulation tasks in simulation. Experiments demonstrate that while baseline VLA models generalize poorly due to original distribution overfitting, policies trained with our synthesized data achieve substantially higher success rates, validating that leveraging existing tasks for automated demonstration synthesis provides an effective, scalable, and data-efficient route to broadening VLA generalization.

[26] arXiv:2607.00424 [pdf, html, other]
Title: Robust Operational Space Control with Conformal Disturbance Bounds for Safe Redundant Manipulation
Comments: Paper accepted to IROS 2026
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

Redundant robotic manipulators operating in constrained and human-interactive environments require accurate task-space tracking together with rigorous safety guarantees under dynamic uncertainties. Classical operational space computed torque controller (OSCTC) relies on accurate dynamic models and degrades in the presence of disturbances. In contrast, the data-driven paradigm of residual learning approximates disturbances as functions learned from full-state measurements, which are often noisy in practice, lack rigorous theoretical guarantees, and introduce additional design complexity. This paper proposes a robust OSCTC framework that integrates an extended state observer (ESO) with conformal prediction to combine model-based robustness and data-driven adaptability. The ESO estimates lumped disturbances directly in operational space without requiring full-state measurements as in residual learning, and a robust control barrier function (CBF) is constructed to enforce safety under uncertainty. However, robust CBFs require a known disturbance-variation bound to guarantee absolute safety, which often leads to conservatism in practice. To address this limitation, we further employ a sliding-window conformal prediction mechanism to estimate the bound online in a distribution-free manner, thereby achieving practical probabilistic safety guarantees. Experiments on a 7-DoF Franka Research 3 manipulator demonstrate millimeter-level tracking accuracy and real-time safe control at 1~kHz under various disturbances.

[27] arXiv:2607.00442 [pdf, html, other]
Title: Learning Gait-Aware Quadruped Locomotion with Temporal Logic Specifications
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Reinforcement learning (RL) for quadruped locomotion commonly depends on fixed, hand-crafted, and Markovian reward functions that limit both interpretability of learned policies and lack explicit control over gait behaviors. We introduce a framework where distinct gaits are specified using parameterized constraints expressed in Signal Temporal Logic (STL). These include safety bounds, gait synchronization constraints, command tracking, and actuation bounds. From these specifications, we develop a reward shaping mechanism that provides learning agents a dense, continuous reward landscape that encodes desired behavior. We define parametric STL templates for three speed regimes (walking-trot, trot, bound), calibrate their parameters from reference rollouts, and compute rewards from using smooth approximations of STL robustness over the rollouts. The generated rewards can be used to provide shaped gradients compatible with Proximal Policy Optimization (PPO). We instantiate the approach on Google's Barkour quadruped robot in MuJoCo XLA (MJX). We use parallelization within the simulator to improve training speeds and use domain randomization to robustify learned policies. We show that compared to a baseline of hand-crafted rewards, the STL-shaped rewards yield tighter velocity tracking and more stable training. Videos can be found on our project website: this https URL.

[28] arXiv:2607.00444 [pdf, html, other]
Title: Search-Based Spatiotemporal and Multi-Robot Motion Planning on Graphs of Space-Time Convex Sets
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Spatiotemporal motion planning, especially in multi-robot settings, requires robots to reason about collision-free regions that change over time, which is challenging in continuous spaces when feasible regions are transient and geometrically constrained. We present an algorithmic framework based on graphs of space-time convex sets (ST-GCSs), where collision-free regions are represented as convex sets in space-time and trajectories correspond to paths on the graph together with continuous motions within the selected sets. We formulate time-optimal planning on ST-GCSs as a graph-search problem over path-indexed states and develop a best-first search solver that evaluates partial paths via continuous trajectory optimization, guided by admissible heuristics and dominance checks. We further present an Exact Convex Decomposition (ECD) scheme to reserve trajectory occupancies in space-time, enabling unified handling of dynamic obstacles and multi-robot interactions. For multi-robot motion planning, we integrate ST-GCS planning and ECD into prioritized planning methods and introduce a windowed coordination scheme to improve efficiency. Extensive experiments on single-robot and multi-robot problems demonstrate substantial speedups over various planners while maintaining high solution quality, particularly in environments with narrow and transient feasible regions. Large-scale demonstrations further show that the proposed multi-robot motion planner can solve instances with up to 100ドル$ robots within only a few minutes. Project homepage: this https URL

[29] arXiv:2607.00483 [pdf, html, other]
Title: VLM-AR3L: Vision-Language Models for Absolute and Relative Rewards in Reinforcement Learning
Comments: Accepted at IJCAI 2026. Project website: \url{this https URL}
Subjects: Robotics (cs.RO)

Designing effective reward functions remains a major challenge in reinforcement learning (RL), particularly in open-ended environments where task goals are abstract and difficult to quantify. In this work, we present VLM-AR3L, a framework that leverages Vision-Language Models (VLMs) to provide both absolute and relative rewards for RL. VLM-AR3L interprets an agent's visual observations in the context of a natural language task goal, and learns both absolute and relative rewards from VLM-generated preference labels. The absolute reward model predicts scalar evaluations for individual states, while the relative reward model compares consecutive observations to infer progress or regression toward the task goal. Their integration combines the stability of state-based evaluation with the robustness of comparative supervision. We evaluate VLM-AR3L across benchmarks spanning classic control, manipulation, and open-world embodied tasks, with a particular focus on Minecraft given its visual complexity and long-horizon decision-making requirements. Experimental results show that VLM-AR3L consistently outperforms prior VLM-based reward learning methods.

[30] arXiv:2607.00530 [pdf, html, other]
Title: From Technical Metrics to User Perception: A User Study of a Multimodal Human-Robot Interaction System for Object Detection and Grasping
Comments: 8 pages
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Improvements in the technical performance of human--robot interaction (HRI) systems do not automatically translate into differences that human users can detect during live interaction. This paper investigates whether a 15 percentage point gain in end-to-end task success (from 75% in a multimodal baseline system to 90% in an improved configuration identified through a prior ablation study) is sufficient to produce consistent and measurable differences in user perception. The baseline system combines Whisper for speech recognition, Florence-2 for open-vocabulary object detection, LLaMA 3.1 for action extraction, and an interval Type-2 fuzzy logic controller for motion execution. The improved configuration replaces the perception and language modules with Grounding DINO + SAM and Qwen 3.5 9B, respectively, while retaining the same controller. A within-subject user study with 24 participants compared both systems on the same tabletop object-grasping task. After interacting with each configuration, participants rated perceived speed, reliability, and overall competence and fluency on a 7-point Likert scale. Results show that 17 out of 24 participants (70.83%) preferred the improved system (exact binomial test, p = 0.043, h = 0.43), and all three perceptual constructs were rated significantly higher for the improved configuration after Holm correction, with large to very large effect sizes (p < 0.001). These findings confirm that the identified technical improvements are perceptible to users in direct interaction and underscore the importance of complementing benchmark evaluation with user-centred evidence when assessing robotic manipulation pipelines.

[31] arXiv:2607.00534 [pdf, html, other]
Title: Learning from Demonstration via Spatiotemporal Tubes for Unknown Euler-Lagrange Systems
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

We present STT-LfD, a unified Learning from Demonstration (LfD) framework that integrates motion learning with control for unknown Euler-Lagrange systems. Unlike traditional decoupled approaches that track a fixed reference, the proposed method treats demonstrations as a data-driven safety specification. Using heteroscedastic Gaussian Processes, STT-LfD learns Spatiotemporal Tubes (STTs) as an intent envelope that capture time-varying precision requirements of a task. A closed-form feedback controller then enforces these learned constraints while respecting actuator limits, without requiring explicit system identification. The approach preserves the temporal structure of demonstrations, remains computationally efficient, and avoids explicit system identification. Hardware experiments on a mobile robot and a 7-DOF manipulator show that it outperforms baselines in robustness to disturbances and computational speed.

[32] arXiv:2607.00569 [pdf, html, other]
Title: [Preprint] Dynamic Modeling, Gait Synthesis, and Control of a Novel Subsurface Bore Propagator
Comments: 8 pages
Subjects: Robotics (cs.RO)

In this article, we present dynamic modeling, gait synthesis, and feedback control design for a modular novel subsurface robot, designed for human-free subsurface exploration and excavation. The subsurface propagator design is based on two major aspects: 1) anchor and propel movement like an earthworm and 2) excavation similar to tunnel boring machines. This design is decoupled into five separate modules: one drill head to excavate and create cavity for propagation, two modules to anchor the robot, and two modules to enable propagation of the body. In order to design a controller for each of the modules, dynamic models using the Euler-Lagrange framework are developed. These mathematical models are used as a baseline to design controlled decoupled operation of the different joint movements. The operation of robotic assembly is constructed via a centralized state machine for gait synthesis with integration of the designed feedback controller. The controllers are tested on the real robot geometry to aid sim-to-real integration: A physics-based Unity simulation using a CAD model of the robot and integration of the trained controller via ROS verifies the performance of the robot. The experimental results demonstrate that the proposed design, controllers and the gait synthesis strategy together are capable of anchoring the robot in place and creating an total advancement of 30,円mm into the soil after completing 3 gait cycles.

[33] arXiv:2607.00571 [pdf, other]
Title: Enhancing Robustness in Robot-Environment Interactions through Passive Compliant Degrees of Freedom: A Hybrid Position-Force Control Approach with Feedback Linearization
Subjects: Robotics (cs.RO)

Robot-environment interactions in dynamic or unstructured settings are often degraded by impact shocks, vibrations, and uncertainties in contact geometry and mechanical properties. This paper proposes an interaction architecture that combines feedback-linearized hybrid position-force control with a passive compliant degree of freedom embedded at the end-effector. Unlike conventional hybrid position-force control, which relies mainly on active feedback, force sensing, and gain tuning, the proposed architecture uses a physical spring-damper interface to store and dissipate impact energy at the contact point before high-frequency shocks propagate to the actuated joints and force-control loop. The approach is evaluated in MATLAB/Simulink on a 2-DOF planar manipulator with three end-effector configurations: rigid, spring-only, and spring-damper. Results under fixed and time-varying interaction conditions show that the spring-damper configuration provides stronger attenuation of contact-induced oscillations, lower force and velocity error variance, and smoother joint-torque response. Representative reductions include 36.5% in fixed-environment tangential force-error standard deviation, 25.4% in variable-environment normal force-error standard deviation, and 41.1% in variable-environment normal velocity-error standard deviation.

[34] arXiv:2607.00591 [pdf, html, other]
Title: From Real-Time Planning to Reliable Execution:Scalable Coordination for Heterogeneous Multi-Robot Fleets in Industrial Environments
Comments: 11 pages, 9 figures
Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)

With the increasing deployment of heterogeneous robot fleets in industrial environments, efficient coordination remains a critical challenge. Real-time path planning must simultaneously accommodate high robot densities and heterogeneous motion capabilities, while communication delays, execution uncertainties, and other disturbances may cause robots to deviate from the temporal assumptions underlying planned paths. Such deviations can lead to excessive waiting and congestion propagation across the fleet. This paper presents SCALE, a reactive online coordination framework that enables real-time planning while maintaining robust execution. Within this framework, we introduce a motion-induced conflict reduction mechanism to support the online generation of feasible paths for online conflict resolution. To mitigate the effects of disturbances, we further design a generalized Conjugate Action-Precedence Hypergraph (CAPH) that adaptively adjusts precedence relations among robots. Extensive validation experiments, together with a three-day deployment in a warehouse, demonstrate the

[35] arXiv:2607.00666 [pdf, html, other]
Title: Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts
Comments: ECCV 2026. Project page: this https URL
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Vision-Language-Action (VLA) models often fail to perform the same learned tasks under environmental shifts, such as changes in camera pose and shifts to a different but similar robot (e.g., from Panda to UR5e). Adapting these models to the shifted environment (i.e., target domain) often requires training on multiple demonstrations for each task, which are costly to collect. To reduce the burden of data curation and training, we propose an analogy-based method that adapts VLA models under environmental shifts through weight vector arithmetic with domain-specific information addition, named Domain ARiThmetic (DART). Unlike prior approaches, DART requires collecting only a single demonstration, enabling efficient adaptation. To accurately isolate domain-specific information for addition, DART performs subspace alignment between singular components in weight vectors to filter out noisy components. In both simulated and real-world experiments, DART outperforms existing VLA adaptation methods in one-shot scenarios across diverse visual and embodiment shifts. Code is available at this https URL.

[36] arXiv:2607.00673 [pdf, html, other]
Title: Path Planning in Physically Viable World Models
Comments: 18 pages, 7 figures, submitted to CORL
Subjects: Robotics (cs.RO)

Robots deployed in unstructured outdoor environments often plan from scene reconstructions collected before deployment because operators cannot remap large or remote sites before every mission. As a result, robots must make long-horizon planning decisions using stale maps that assume the terrain remains unchanged, even though physical changes to the environment may render previously feasible routes unsafe or unreachable at execution time. We present a physically viable world model for evaluating what-if queries for robot navigation under future terrain change. The system augments reconstructed 3D Gaussian splat scenes with physics-based simulation to generate physically modified versions of the same environment without recollecting sensor data or rebuilding the map. We then implement a terrain-aware planner that accounts for physical events, obstacles, and deformations that are simulated by the world model. This allows robots and human operators to evaluate whether planned routes remain feasible before committing to a planned route, particularly in constrained environments where retreat or recovery may become impossible once conditions change. We evaluate the system on a real outdoor field site in Central Texas using simulated flooding across multiple severity levels. We measure route and mission feasibility as terrain conditions deteriorate under physically simulated interventions. Our results show that physically viable world models expose long-horizon route failures and rerouting behavior that are not apparent when planning only on the original reconstructed environment, allowing robots to evaluate how future terrain changes may affect route feasibility before deployment.

[37] arXiv:2607.00776 [pdf, html, other]
Title: From Prediction Uncertainty to Conformalized Distance Fields for Safe Motion Planning
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

Safe motion planning in dynamic environments requires reasoning about the uncertainty in predicted obstacle motion without sacrificing real-time performance. Existing conformal approaches conformalize a scalar score that aggregates per-obstacle prediction errors, losing spatial coherence and scaling poorly with scene density. We instead conformalize the entire predicted distance field at once. This functional conformal prediction (FCP) framework yields a distribution-free, field-level lower bound, from which safety follows uniformly: any trajectory satisfying the resulting constraint is certified safe, independent of how the control space is sampled. The key enabler is that the residual distance field is empirically low-rank and approximately time-invariant, which makes the bound decomposable in coefficient space. An envelope is fitted offline via functional PCA and a Gaussian-mixture inductive conformal procedure, then refined online by a lightweight adaptive functional conformal (AFCP) update on a low-dimensional vector. This keeps the per-step cost largely insensitive to obstacle count and retains long-run field coverage under distribution shift. We embed the envelope as a tightened safety constraint in a sampling-based model predictive controller, FCP-MPC. On the ETH--UCY pedestrian benchmarks and a dense 3D quadrotor task with up to 280 dynamic obstacles, FCP-MPC attains a favorable balance of safety, feasibility, and efficiency, reaching goals where pointwise and egocentric conformal baselines become too conservative or too expensive, while keeping per-step computation far below online uncertainty-reasoning baselines.

[38] arXiv:2607.00836 [pdf, html, other]
Title: From World Models to World Action Models: A Concise Tutorial for Robotics
Comments: Project page: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)

World models are increasingly used in embodied intelligence and generative simulation, yet their scope remains ambiguous across communities. This tutorial presents a design-space view of world models as action-conditioned predictive models that estimate the future evolution of task-relevant observations or states. We categorize existing methods into observation-space and state-space world models, comparing their trade-offs in visual fidelity, spatial structure, physical interpretability, and control usability. We further introduce world action models, which connect predicted futures with executable robot actions, and summarize four representative paradigms: imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning. The goal of this tutorial is to clarify the conceptual scope of world (action) models and provide a structured taxonomy for embodied prediction and control.

[39] arXiv:2607.00874 [pdf, html, other]
Title: Beyond Line of Sight: Hybrid Validation of V2X Collective Perception in Complex Scenarios
Comments: 6 pages, 4 figures, to be presented in ITS World 2026
Subjects: Robotics (cs.RO)

This paper introduces a probabilistic framework and hybrid validation methodology for V2X-enabled Collective Perception (CP) in complex traffic scenarios. The proposed Bayesian fusion algorithm extends the perceptual horizon of connected and autonomous vehicles by integrating heterogeneous sensor observations from multiple agents into a shared probabilistic occupancy grid. Each cell of this grid encapsulates both occupancy likelihood and uncertainty, enabling explainable and trustworthy situational awareness beyond the ego vehicle's field of view. To bridge the gap between simulation and real-world evaluation, a hybrid testing framework is developed, combining CARLA-based virtual environments with vehicle-in-the-loop experimentation. Experimental results in a roundabout scenario demonstrate a 260 percent increase in field-of-view coverage and a rise in occupied-cell recall from 0.82 (ego-only) to 0.94 (six-agent CP) under nominal localization conditions. Overall, the proposed approach provides a reproducible and interpretable foundation for validating CP systems, supporting the safe and certifiable deployment of cooperative autonomous vehicles.

[40] arXiv:2607.01029 [pdf, html, other]
Title: AMBUSH: Collaborative Capture in Complex Environments with Neural Acceleration
Subjects: Robotics (cs.RO)

Collaborative capture of dynamic targets is common in nature as an essential strategy for weaker species against the strong. Similar concepts have shown to be useful for numerous robotic applications, such as security and surveillance, search and rescue. However, most existing works focus on analytical and geometric solutions or end-to-end reinforcement learning methods, which are largely constrained to obstacle-free environments or scenarios with sparse, regularly distributed obstacles. This work tackles the problem from a unique perspective: the renowned strategy of``ambush'' alone would suffice for multiple slower pursuers to capture one faster evader with different levels of intelligence efficiently in complex environments. A parameterized strategy of ambush (including discrete and continuous parameters) is designed first, which takes into account the topological properties of the workspace, the truncated line-of-sight visibility, the relative speed ratio and the limited capture range. Then, a Hybrid Monte Carlo Tree Search (H-MCTS) algorithm is proposed to optimize the associated parameters through long-term planning, enabling the identification of highly promising parameters for future capture. Lastly, the neural acceleration is trained offline to learn the ranking of different choices of parameters across various environments, and to directly predict scores, replacing the rollout process in H-MCTS. The neural acceleration is adopted during online H-MCTS to accelerate the planning procedure while guaranteeing the planning quality. Its efficiency and effectiveness are validated in extensive simulations and hardware experiments, against evaders with different capabilities and intelligence levels, including two-times higher velocity and human-controlled behavior.

[41] arXiv:2607.01043 [pdf, html, other]
Title: DART-VLN: Test-Time Memory Decay and Anti-Loop Regularization for Discrete Vision-Language Navigation
Comments: Accepted by the 2026 IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC 2026). Camera-ready version
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Memory-based discrete vision-language navigation (VLN) agents must act under partial observability, yet even strong frozen backbones remain vulnerable at test time. Two common failure modes are stale historical evidence at memory readout and inefficient local backtracking during action selection. We present DART-VLN, a training-free test-time control framework for discrete VLN. DART-VLN combines Test-Time Memory Decay, a read-side memory reweighting rule that suppresses stale and redundant evidence without rewriting stored content, with Anti-Loop Regularization, a lightweight next-hop penalty that discourages immediate reversals during action selection. The framework introduces no new learnable parameters and leaves the learned backbone unchanged. Experiments on R2R and REVERIE show a consistent pattern: decay-only provides stable read-side gains, while decay+anti-loop achieves the best overall quality-efficiency trade-off, yielding shorter trajectories, lower runtime, and improved navigation performance in key settings. Behavioral analysis further confirms that anti-loop regularization reduces local backtracking and improves path efficiency under frozen backbones. Overall, the results show that modest test-time control can make memory-based discrete VLN more reliable and efficient without retraining.

[42] arXiv:2607.01044 [pdf, html, other]
Title: Robots Ask the Way: Communication-Enabled Social Navigation
Comments: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026
Subjects: Robotics (cs.RO)

Assistive autonomous robots operating in multi-agent environments require efficient strategies to locate specific individuals among multiple residents. Current social navigation methods focus on reactive collision avoidance and trajectory adaptation, but lack mechanisms to proactively gather information through human-robot communication.
We introduce Communication-enabled Social Navigation (CommNav). In this novel task, robotic agents actively seek assistance from residents to locate target individuals by requesting information about recent sightings, locations, and movements.
To evaluate CommNav, we extend Habitat 3.0 to create Habitat 3.0c, a communication-enabled variant supporting multi-human environments with information exchange protocols. Adding our communication module (COMM) to a state-of-the-art social navigation model yields a 10 percentage-point improvement in Episode Success. We further investigate the transition from structured data to natural language by evaluating models trained on LLM-generated instructions and on colloquial instructions collected from a human study.
Our experiments reveal that: (i) explicit human-robot communication substantially enhances multi-person navigation performance; (ii) pre-training COMM on a communication pretext task effectively addresses the challenge of occasional interaction signals; and (iii) the navigation policy is highly robust to natural, colloquial human language, achieving an episode success statistically similar to the model using perfect structured data.

[43] arXiv:2607.01051 [pdf, html, other]
Title: AutoSpeed: Annotation-Free Stage-Adaptive Motion Speed Learning for Robot Manipulation
Comments: Accepted by ECCV 2026
Subjects: Robotics (cs.RO)

Different stages of manipulation tasks exhibit varying levels of difficulty, suggesting stage-dependent motion speeds and temporal prediction horizons. However, existing IL-based visuomotor policies typically imitate the execution speed of expert demonstrations and operate with a fixed temporal prediction horizon, limiting flexibility and overall task throughput. In this paper, we introduce AutoSpeed, a model-agnostic learning framework that enables existing visuomotor policies to predict trajectories with stage-adaptive motion speeds, without requiring speed or stage annotations. We treat future trajectories at different speeds as candidate optimization targets, evaluate each candidate using a composite cost that trades off prediction error against prediction horizon, and optimize the policy toward the minimum-cost candidate. With a fixed-length action sequence, speed modulation adjusts the effective temporal prediction horizon: simple stages are executed faster with a longer prediction horizon, whereas complex stages are executed more slowly with a shorter prediction horizon. Specifically, we implement speed modulation in the frequency domain via the discrete cosine transform (DCT), which enables smooth, non-integer speed scaling and thus preserves motion continuity. Extensive evaluations show that AutoSpeed substantially reduces task execution time while also improving success rates. Under the AutoSpeed framework, the inferred motion speeds exhibit a strong correspondence with task stages.

[44] arXiv:2607.01060 [pdf, html, other]
Title: RoboWorld: Fast and Reliable Neural Simulators for Generalist Robot Policy Evaluation
Comments: ICML 2026 F2S workshop
Subjects: Robotics (cs.RO)

Video world models are emerging as a scalable alternative for evaluating generalist robot policies, bypassing the physical constraints and engineering burdens of real-world deployment. However, evaluating policies with video world models remains challenging, as world-model errors can make generated rollouts unreliable and slow inference limits large-scale throughput. We introduce RoboWorld, an automated evaluation pipeline that pairs a fast autoregressive video world model with a task-progress-aware vision-language model scoring. To enable reliable long-horizon autoregressive world-model rollouts, we propose Step Forcing, which combines anchored and one-step self-forwarded contexts to reduce train--test mismatch while preserving action--observation dynamics. Together, these components enable RoboWorld to align strongly with real-world robot evaluation across tasks and environments, achieving Pearson's r = 0.989 and Spearman's \r{ho} = 0.970.

[45] arXiv:2607.01067 [pdf, html, other]
Title: Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation
Comments: The first two authors contribute equally. Orders are decided by flipping a coin
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

As an essential modality for dexterous and contact-rich tasks, tactile sensing provides precise force feedback that cannot be reliably inferred from vision. However, limited by hardware and data collection systems, existing datasets with tactility remain small in scale and narrow in contact coverage. Meanwhile, Vision-Language-Action (VLA) models with tactile modality are constrained on dynamics-agnostic post-training, which limits the performance ceiling on downstream tasks. In this paper, we present H-Tac, a large-scale tactile-action dataset with 160-hour egocentric human videos containing more than 300 tasks and 135k episodes. Building upon this, we propose Transferable Tactile Pre-Training (TTP), a system of tactile-based pre-training on human data for fine-grained robotic tasks. To bridge the gap between humans and robots, we use unified tactile and action spaces throughout the pre-training and post-training phases, preserving prior knowledge during human-to-robot transfer. By leveraging a tactile expert for future tactile prediction, our framework explicitly models the contact dynamics and precise physical interactions. Extensive experiments in simulation and on real robots demonstrate that our model achieves superior performance, exhibiting robust generalization and fine-grained manipulation capabilities. TTP paves the way for scalable tactile pre-training via human-to-robot transfer.

[46] arXiv:2607.01079 [pdf, html, other]
Title: Where Am I? Semantic Map Grounding via Vision-Language Models for Multi-Modal Localization
Subjects: Robotics (cs.RO)

We address robot localization in GPS-denied indoor environments by reframing it as a semantic reasoning task rather than a geometric estimation problem. Motivated by how humans localize using object-level cues and labeled maps, we ask whether a vision-language model, given a front camera image, a polar LiDAR scan, and a top-down semantic grid map, can infer the robot pose. We fine-tune Qwen2.5-VL-7B with LoRA and attach a lightweight regression head that predicts continuous pose coordinates (x, y, theta) directly from the final hidden state, bypassing text generation. Training uses a composite position-and-direction loss with curriculum learning on a custom Gazebo dataset of 120,112 samples and 527 scenes. On the in-distribution test set of 18,017 samples, the model achieves 98.23 percent position accuracy, 98.00 percent direction accuracy, 96.75 percent full pose accuracy, a mean position error of 0.11 m, and a mean orientation error of 5.7 degrees at 0.62 s per sample. Position accuracy drops by only 7.2 percentage points on seven unseen object categories, reaching 90.99 percent, supporting semantic spatial reasoning rather than appearance memorization. With incomplete maps, fine-tuning recovers performance to 93.72 percent position accuracy, showing adaptability to stale or partial map information. Two ablations highlight cross-modal complementarity. Without LiDAR, using only camera and map inputs, position accuracy remains 95.06 percent, only 3.2 percentage points below the full system. However, when the camera sees no visible objects in a wall-facing view, LiDAR sustains 92.33 percent position accuracy, compared with 70.74 percent when neither LiDAR nor visible objects are available. This shows that LiDAR becomes the primary localization signal when camera semantics are unavailable and provides a reliable fallback under occlusion or sparse layouts.

[47] arXiv:2607.01088 [pdf, html, other]
Title: ROSA: A Robotics Foundation Model Serving System for Robot Factories
Subjects: Robotics (cs.RO); Distributed, Parallel, and Cluster Computing (cs.DC)

Robotics foundation models (RFMs) are making general-purpose robots increasingly practical for factory deployments. While RFM serving systems are central to this vision, existing systems are largely shaped by a single-robot, single-model assumption: inference is treated as an edge-computing problem handled by an on-robot or dedicated nearby GPU, and the serving objective is to minimize the latency of a single action model. In this paper, we propose ROSA, an RFM serving system for robot factories designed around three key principles. First, ROSA adopts shared GPU-pool serving, allowing a fleet of robots to access powerful server-class GPUs over the network in order to improve inference performance, battery duration, and GPU utilization. Second, ROSA provides a robotics-aware programming abstraction and system design that supports multi-model pipelines, per-task performance requirements, and failure handling. Third, ROSA uses factory-objective-driven scheduling to maximize SLO-qualified factory productivity rather than minimizing individual request latency. We implement ROSA on top of Ray Serve for distributed orchestration, with vLLM, PyTorch, and JAX as model-serving backends, and evaluate it on both real robots and synthetic large-scale workloads. The results show that ROSA improves factory productivity by up to 12.06x over conventional dedicated serving systems.

[48] arXiv:2607.01106 [pdf, html, other]
Title: Technical Report: Asynchronous Distributed Trajectory Estimation of Multi-Robot Systems
Comments: 13 pages, 3 figures
Subjects: Robotics (cs.RO)

Distributed trajectory estimation arises in many applications across robotics, but existing implementations typically do not consider asynchrony in agents' communications and computations. Therefore, we propose an asynchronous block coordinate descent algorithm for distributed trajectory estimation. We consider a team of agents that observes a team of robots and estimates their states over a sliding window. The agents solve an approximation of the maximum a posteriori estimation problem, which we derive. We show this approximation introduces negligible errors and eliminates up to 96.9% of communications among agents. Next, we prove that agents' iterates converge exponentially fast to the optimal estimate of the robots' states. Simulations show that this approach has up to 64% less error than a comparable state-of-the-art algorithm. Experiments on mobile robots show the robustness of this approach to delays whose lengths span three orders of magnitude.

[49] arXiv:2607.01111 [pdf, html, other]
Title: FAR: Failure-Aware Retry for Test-Time Recovery and Continual Policy Improvement
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Robot policies inevitably encounter failures when deployed in real environments. Naive retries often repeat the same mistakes, while many existing recovery methods rely on human intervention. In this paper, we propose Failure-Aware Retry (FAR), a framework that enables robots to learn from previous failures at test time, adapt their behavior accordingly, and eventually complete the task autonomously. FAR combines Failure-Contrastive Preference Adaptation, which constructs preference learning data from failures to steer the policy away from previously unsuccessful behaviors, with lightweight action perturbations during retries to encourage local exploration. We further incorporate successful recovery trajectories into a training loop for continual policy improvement. Experiments in both simulation and real-world manipulation tasks show that FAR substantially improves success rates and robustness, with average gains of 17.6% over the standard diffusion policy in simulation and 11.7% in the real world. In addition, FAR significantly improves data efficiency under both reset and timestep budgets during continual policy improvement by exploiting informative failure cases.

[50] arXiv:2607.01166 [pdf, html, other]
Title: Structured 4D Latent Predictive Model for Robot Planning
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Video predictive models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, long-horizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency. We introduce a Structured 4D Latent Predictive Model, which predicts the evolution of a scene's 3D structure in a structured latent space conditioned on observations and textual instructions. Our representation encodes the scene holistically and can be decoded into diverse 3D formats, enabling a more complete and 3D consistent scene understanding. This structured 4D latent predictive model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics module. Experiments demonstrate that our model generates futures with strong visual quality, substantially better 3D consistency and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms. Our website is available at this https URL.

[51] arXiv:2607.01200 [pdf, html, other]
Title: FastBridge: Closing the Model-Based Realization Gap in Safety Filters on 3D Gaussian Splatting for Fast Quadrotor Flight
Comments: preprint, 9 pages, 4 figures
Subjects: Robotics (cs.RO)

Fast quadrotor flight requires safe obstacle avoidance under tight onboard compute limits. While 3D Gaussian Splatting (3DGS) provides a continuous, geometry-aware scene representation for perception-driven navigation, existing 3DGS safety filters use reduced-order models such as single- and double-integrators that ignore actuator limits and assume commanded accelerations are realized instantaneously. Building on an analytic collision cone barrier for 3DGS, we introduce a nonlinear, actuator-aware safety filter enforced through the full quadrotor dynamics. We derive a high-relative-degree collision cone exponential CBF and a backup CBF that preserves QP feasibility under input constraints using a forward-simulated backup policy. Compared with a state-of-the-art 3DGS safety filter, our approach reduces trajectory jerk by 47% and runs 2.25 times faster. We validate the method in simulation and on hardware for real-time navigation in cluttered, perception-derived environments.

[52] arXiv:2607.01201 [pdf, html, other]
Title: Sensorless Four-Channel Control Architecture Using Inverse Dynamics Modeling for Human-Scale Bilateral Teleoperation
Subjects: Robotics (cs.RO)

The four-channel teleoperation architecture is a well-established framework for achieving transparency in bilateral systems. However, its performance in human-scale teleoperation is limited by high inertia, modeling challenges, and reliance on noisy and costly force/torque sensors. This paper introduces a sensorless four-channel architecture based on inverse dynamics modeling. The controller is implemented and validated on a customized WAM bilateral teleoperation setup. Experiments demonstrate that the proposed approach outperforms conventional two- and four-channel schemes as well as transparency-enhancement methods, improving position and force tracking, reducing operator effort, and increasing maximum transmittable impedance without external sensors. A door-opening case study involving sustained whole-body contact along the manipulator further demonstrates the effectiveness of the method in realistic human-scale manipulation tasks.

[53] arXiv:2607.01212 [pdf, html, other]
Title: FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model
Comments: Project Page: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Current work on robot furniture assembly mostly focuses on toy-scale settings or single-arm manipulation. We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs). We formalize the task, develop a scalable simulation pipeline for expert data generation and evaluation, and build a VR teleoperation system for single-operator bimanual control to collect high-quality real-world demonstrations. To address extreme long-horizon assembly with up to 7 subtasks and 1550 control steps, we propose a progress-enhanced VLA, finetuned on semantically grounded subtasks, that jointly predicts actions and a continuous progress signal, enabling automatic subtask transitions and reducing compounding errors during inference. We further study perception and control design factors that critically affect precision in real-scale assembly. FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types, with an additional 21% gain from our design factor study. We validate on a real Kinova Gen3 platform with only 16% drop on the hardest task.

Cross submissions (showing 9 of 9 entries)

[54] arXiv:2607.00064 (cross-list from cs.AI) [pdf, html, other]
Title: Solution space path planning for supporting en-route air traffic control
Comments: 37 pages, 16 figures
Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)

As technology advances, many path-planning algorithms have been proposed for Air Traffic Management, yet their operational adoption in tactical control remains limited, revealing a misalignment between algorithmic design priorities and air traffic controllers' needs. This underscores the need for decision-support solutions that are inherently interpretable, computationally efficient, and explicitly designed for human use. Focusing on this design challenge, this study develops a conflict-free path-planning algorithm for en-route Air Traffic Control (ATC) designed to be compatible with two guiding considerations: (1) the interpretability and flexibility offered by solution-space displays, which motivate constructing an algorithm that exposes all feasible safe actions and accommodates shifting optimization goals; and (2) the decision logic controllers naturally apply when enforcing operational constraints, such as separation standards, maneuverability limits, waypoint minimization, and routing practicality. Centered on these principles, the algorithm integrates three intent-based conflict detection methods -- distance-based, time-interval-based, and zone-based -- within a solution-space framework to identify conflict-free paths in computationally efficient ways. Additionally, vertex-based and edge-based search nodes are proposed for solution space path planning (SSPP), resulting in two variants -- SSPPV and SSPPE, respectively, which are evaluated in terms of computational speed and solution quality. Empirical results show that SSPPV paired with zone-based conflict detection achieves the best performance, computing paths in 3.69 ms on average in operational-relevant scenarios based on the Delta sector of the Maastricht Upper Area Control Centre (MUAC) using a 5 nmi grid.

[55] arXiv:2607.00221 (cross-list from cs.CG) [pdf, other]
Title: Guaranteed Escape for a Bouncing Robot in Pipe Chains
Comments: Accepted into CCCG2026
Subjects: Computational Geometry (cs.CG); Robotics (cs.RO)

We study the symmetric bouncing of a point robot within orthogonally-joined rectangles with equal width, which we refer to as pipes. We provide an exhaustive case analysis of every trajectory pattern inside a single rectangular pipe segment, identifying the conditions under which the robot exits. We then extend the analysis to L-shaped pipes and, more generally, to linear chains of $k$ orthogonally connected pipe segments. We prove exit guarantees for the special angle $\alpha = \pi/4$. Furthermore, these results extend to pipes with curved joints.

[56] arXiv:2607.00302 (cross-list from cs.CV) [pdf, html, other]
Title: Wake up for Touch! Mask-isolated Tactile Alignment Learning in MLLMs
Comments: ECCV 2026, Project page: this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)

Touch supplies the physical grounding needed to perceive intrinsic material properties, such as friction and compliance, that vision alone often cannot resolve. Recent efforts for equipping multimodal LLMs with this tactile sense, however, expose a zero-sum trade-off: the limited parameter budget of compact models forces a choice between acquiring the new sensory modality and preserving the established vision-language reasoning. We present Splash, a mask-isolated tactile alignment learning framework for MLLMs. Splash quantifies the significance of each pretrained parameter, and partitions the parameter space into a dormant and critical subspace. While the frozen critical subspace acts as a stable anchor to safeguard general visual knowledge, Splash updates the isolated dormant subspace to internalize tactile alignment towards LLMs. This selective, non-destructive expansion effectively prevents catastrophic forgetting and ensures non-destructive modality expansion. Extensive experiments show that Splash effectively achieves tactile reasoning without additional inference overhead in the LLM part, demonstrating state-of-the-art performance on visuo-tactile benchmarks, including SSVTP, TVL, and TacQuad, while preserving its original general-purpose capabilities.

[57] arXiv:2607.00678 (cross-list from cs.CV) [pdf, html, other]
Title: ABot-M0.5: Unified Mobility-and-Manipulation World Action Model
Comments: Code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Mobile manipulation is a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existing World Action Models (WAMs) are still poorly aligned with the structure of mobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and train inverse dynamics under supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight that mobile manipulation requires alignment at three levels: temporal granularity, action space, and train-test consistency. To align temporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridging action space between video latents and embodiment-specific controls. To align action space, we design a dual-level Mixture-of-Transformers architecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose the dream-forcing training strategy that progressively trains inverse dynamics on model-predicted videos, improving train-test alignment and robustness during autoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.

[58] arXiv:2607.00710 (cross-list from cs.CV) [pdf, html, other]
Title: Creating Impactful Autonomous Driving Datasets: A Strategic Guide from Research Gap to Benchmark
Comments: Keywords: Autonomous Driving, Dataset Design, Benchmarks, Research Gap Identification. 14 pages, 3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Well-designed autonomous driving datasets have fundamentally shaped research progress, yet existing literature primarily describes what datasets contain rather than how to strategically design impactful ones. This is especially limiting for small and medium-sized labs and startups that cannot afford to misallocate scarce resources. We argue that impactful dataset creation begins with a diagnosis: whether a research question is blocked by a data problem or an evaluation problem, and proceeds by selecting the minimal data operator(s) that closes the resulting gap, recording new data only when no cheaper operator(s) suffices. We analyze the evolution of major autonomous driving (AD) datasets through this lens and distill a strategic framework spanning gap identification, operator choice, sensor suite design, and annotation strategy. We ground the framework in a running case study of our KITScenes dataset family. The datasets are available at: this https URL

[59] arXiv:2607.00978 (cross-list from cs.CV) [pdf, html, other]
Title: Privacy-Preserving Depth-Only Open-Vocabulary 3D Semantic Segmentation Via Uncertainty-Guided Test-Time Optimization
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Privacy-preserving perception is a critical requirement for deploying 3D scene understanding systems in real-world indoor environments, yet it remains underexplored in open-vocabulary 3D semantic segmentation. Existing methods typically rely on obtaining rich semantic cues from RGB images, which may expose privacy-sensitive visual information. Depth-only 3D geometry provides a privacy-preserving alternative, but the absence of appearance-based semantic cues makes open-vocabulary predictions highly uncertain and less reliable. Under this setting, we propose to convert uncertainty into a guidance signal to identify unreliable semantic responses and use semantic priors from foundation models to regularize their refinement. We present UTTO, an uncertainty-guided test-time optimization framework for depth-only open-vocabulary 3D semantic segmentation. Without additional training, experiments on ScanNet20, ScanNet40, and ScanNet200 demonstrate that UTTO consistently improves depth-only open-vocabulary 3D segmentation and outperforms representative baselines under privacy-preserving conditions.

[60] arXiv:2607.01008 (cross-list from eess.IV) [pdf, html, other]
Title: Image-Domain Tilt Constrained Distributed Fusion for Maneuvering UAV Tracking with Multi-Camera Electro-Optical Observations
Comments: 24 pages, 20 figures
Subjects: Image and Video Processing (eess.IV); Robotics (cs.RO); Signal Processing (eess.SP)

Short-horizon prediction is essential for electro-optical UAV tracking, especially when the target is small, maneuvering, or intermittently observed. Image center, line-of-sight, and range measurements provide direct constraints on target position, but their constraints on acceleration are weak. As a result, prediction can lag during aggressive maneuvers.
This paper proposes an image-domain tilt constrained distributed fusion method for maneuvering UAV tracking. The method uses the apparent roll and pitch of a rotorcraft target in the image as low-level maneuver cues. A weak-prior auto-labeling pipeline first generates oriented bounding box and image-domain tilt labels from synchronized video, gimbal IMU, and UAV IMU data. A YOLO-OBB detector is then trained to provide online target position and tilt measurements. The front-end Python implementation is publicly available at this http URL.
In the fusion stage, the UAV state is modeled by position, velocity, and acceleration. Image-domain roll and pitch are introduced as acceleration-related pseudo-observations. For distributed tracking, one mobile gimbal camera and two fixed ground cameras are fused asynchronously. Camera attitude error states are augmented into the filter to absorb extrinsic drift and cross-camera systematic inconsistency. A Mahalanobis gate with time-since-last-valid covariance widening is used to reject false detections and handle dropouts.
In simulation, adding roll/pitch observations reduces the prediction RMSE from 1.991 m to 0.821 m and decreases the cumulative prediction error by 60.75\%. In real distributed experiments, a self-consistency evaluation shows an 18.10\% reduction in cumulative prediction error. The results show that image-domain tilt can provide useful acceleration constraints for robust short-horizon UAV prediction.

[61] arXiv:2607.01133 (cross-list from cs.CV) [pdf, html, other]
Title: Towards Metric-Agnostic Trajectory Forecasting
Comments: ECCV 2026. Project page at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Accurate trajectory forecasting of surrounding traffic participants is a core capability for autonomous driving, enabling vehicles to anticipate behavior and plan safe maneuvers. We observe that current state-of-the-art forecasting models on Argoverse 2 and the Waymo Open Motion Dataset tailor their training objectives to the different benchmark metrics. Because these metrics encourage conflicting behavior, we propose a paradigm change for trajectory forecasting: training models with metric-agnostic probabilistic objectives and treating metric optimization as a downstream task applied to the predictive distribution. Concretely, we introduce Trajectory Distribution Evaluation (TraDiE) policies, metric-specific policies that map a predictive distribution to the set of $K$ trajectories and confidences required by trajectory forecasting metrics. We evaluate this framework by introducing DONUT-NLL, which adapts the training objective of the state-of-the-art trajectory forecasting model DONUT to directly optimize the predictive distribution. Using our policies, DONUT-NLL achieves state-of-the-art results on all metrics of the Waymo motion prediction benchmark.

[62] arXiv:2607.01203 (cross-list from eess.SY) [pdf, html, other]
Title: GPU-Parallel Linearization Error Bounds for Real-Time Robust Optimal Control of Nonlinear and Neural Network Dynamics
Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)

This paper studies real-time robust optimal control for uncertain nonlinear systems, where linear time-varying (LTV) approximations make planning tractable but require sound linearization error bounds (LEBs) to guarantee robust constraint satisfaction. We develop tight, differentiable, GPU-parallel LEBs for LTV approximations of nonlinear and neural network (NN) dynamics. For analytic dynamics, we introduce path-based Hessian bounds that are tighter than standard interval methods. For NN dynamics, we derive certified LEBs using NN verifier-generated affine relaxations and local Jacobian corrections. We adapt a GPU-parallel system-level synthesis LTV-based robust control solver to be compatible with these LEBs by extending it to handle right-invertible disturbance matrices and non-zero-centered disturbance sets for tight zonotopic uncertainty propagation. Our method, GPUSLS-LEO, enables online optimization of robust feedback policies that account for linearization error, producing tight, formally verified reachable tubes. On complex nonlinear and NN dynamics up to 168 state dimensions, our method can compute robust control policies on the GPU at rates up to 67 Hz, reducing solve times and conservativeness relative to baselines while preserving formal guarantees and real-time performance.

Replacement submissions (showing 32 of 32 entries)

[63] arXiv:2505.20857 (replaced) [pdf, html, other]
Title: Multi-Embodiment Robotic Retargeting via Guided Diffusion Model
Subjects: Robotics (cs.RO)

Motion retargeting for specific robot from existing motion datasets is one critical step in transferring motion patterns from human behaviors to and across various robots. However, inconsistencies in topological structure, geometrical parameters as well as joint correspondence make it difficult to handle diverse embodiments with a unified retargeting architecture. In this work, we propose a novel unified graph-conditioned diffusion-based motion generation framework for retargeting reference motions across diverse embodiments. The intrinsic characteristics of heterogeneous embodiments are represented with graph structure that effectively captures topological and geometrical features of different robots. Such a graph-based encoding further allows for knowledge exploitation at the joint level with a customized attention mechanisms developed in this work. For lacking ground truth motions of the desired embodiment, we utilize an energy-based guidance formulated as retargeting losses to train the diffusion model. As one of the first cross-embodiment motion retargeting methods in robotics, our experiments validate that the proposed model can retarget motions across heterogeneous embodiments in a unified manner. Moreover, it demonstrates a certain degree of generalization to both diverse skeletal structures and similar motion patterns.

[64] arXiv:2506.12851 (replaced) [pdf, html, other]
Title: KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills
Comments: NeurIPS 2025. Project Page: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Humanoid robots are promising to acquire various skills by imitating human behaviors. However, existing algorithms are only capable of tracking smooth, low-speed human motions, even with delicate reward and curriculum design. This paper presents a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking. For motion processing, we design a pipeline to extract, filter out, correct, and retarget motions, while ensuring compliance with physical constraints to the maximum extent. For motion imitation, we formulate a bi-level optimization problem to dynamically adjust the tracking accuracy tolerance based on the current tracking error, creating an adaptive curriculum mechanism. We further construct an asymmetric actor-critic framework for policy training. In experiments, we train whole-body control policies to imitate a set of highly-dynamic motions. Our method achieves significantly lower tracking errors than existing approaches and is successfully deployed on the Unitree G1 robot, demonstrating stable and expressive behaviors. The project page is this https URL.

[65] arXiv:2511.03591 (replaced) [pdf, html, other]
Title: Manifold-constrained Hamilton-Jacobi Reachability Learning for Decentralized Multi-Agent Motion Planning
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

Safe multi-agent motion planning (MAMP) under task-induced constraints is a critical challenge in robotics. Many real-world scenarios require robots to navigate dynamic environments while adhering to manifold constraints imposed by tasks. For example, service robots must carry cups upright while avoiding collisions with humans or other robots. Despite recent advances in decentralized MAMP for high-dimensional systems, incorporating manifold constraints remains difficult. To address this, we propose a manifold-constrained Hamilton-Jacobi reachability (HJR) learning framework for decentralized MAMP. Our method solves HJR problems under manifold constraints to capture task-aware safety conditions, which are then integrated into a decentralized trajectory optimization planner. This enables robots to generate motion plans that are both safe and task-feasible without requiring assumptions about other agents' policies. Our approach generalizes across diverse manifold-constrained tasks and scales effectively to high-dimensional multi-agent manipulation problems. Experiments show that our method outperforms existing constrained motion planners and operates at speeds suitable for real-world applications. Video demonstrations are available at this https URL .

[66] arXiv:2512.11173 (replaced) [pdf, html, other]
Title: Learning Category-level Last-meter Navigation from RGB Demonstrations of a Single-instance
Subjects: Robotics (cs.RO)

Achieving precise positioning of the mobile manipulator's base is essential for successful manipulation actions that follow. Most of the RGB-based navigation systems only guarantee coarse, meter-level accuracy, making them less suitable for the precise positioning phase of mobile manipulation. This gap prevents manipulation policies from operating within the distribution of their training demonstrations, resulting in frequent execution failures. We address this gap by introducing an object-centric imitation learning framework for last-meter navigation, enabling a quadruped mobile manipulator robot to achieve manipulation-ready positioning using only RGB observations from its onboard cameras. Our method conditions the navigation policy on three inputs: goal images, multi-view RGB observations from the onboard cameras, and a text prompt specifying the target object. A language-driven segmentation module and a spatial score-matrix decoder then supply explicit object grounding and relative pose reasoning. Using real-world data from a single object instance within a category, the system generalizes to unseen object instances across diverse environments with challenging lighting and background conditions. To comprehensively evaluate this, we introduce two metrics: an edge-alignment metric, which uses ground truth orientation, and an object-alignment metric, which evaluates how well the robot visually faces the target. Under these metrics, our policy achieves 74.58% success in edge-alignment and 89.42% success in object-alignment when positioning relative to unseen target objects. These results show that precise last-meter navigation can be achieved at a category-level without depth, LiDAR, or map priors, enabling a scalable pathway toward unified mobile manipulation. Project page: this https URL

[67] arXiv:2603.10263 (replaced) [pdf, html, other]
Title: From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)

We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion- or flow-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot. Project website: this https URL.

[68] arXiv:2603.17720 (replaced) [pdf, html, other]
Title: VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning
Comments: Accepted to IROS 2026
Subjects: Robotics (cs.RO)

Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code and videos are available on the project page: this https URL

[69] arXiv:2603.19124 (replaced) [pdf, html, other]
Title: Tendon-Actuated Robots with a Tapered, Flexible Polymer Backbone: Design, Fabrication, and Modeling
Subjects: Robotics (cs.RO)

This paper presents the design, modeling, and fabrication of 3D-printed, tendon-actuated continuum robots featuring a flexible, tapered backbone constructed from thermoplastic polyurethane (TPU). Our scalable design incorporates an integrated electronics base housing that enables direct tendon tension control and sensing via actuators and compression load cells. Unlike many continuum robots that are single-purpose and costly, the proposed design prioritizes customizability, rapid assembly, and low cost while enabling high curvature and enhanced distal compliance through geometric tapering, thereby supporting a broad range of compliant robotic inspection and manipulation tasks. We develop a generalized forward kinetostatic model of the tapered backbone based on Cosserat rod theory using a Newtonian approach, extending existing tendon-actuated Cosserat rod formulations to explicitly account for spatially varying backbone cross-sectional geometry. The model captures the graded stiffness profile induced by the tapering and enables systematic exploration of the configuration space as a function of the geometric design parameters. Specifically, we analyze how the backbone taper angle influences the robot's configuration space and manipulability. The model is validated against motion capture data, achieving centimeter-level shape prediction accuracy after calibrating Young's modulus via a line search that minimizes modeling error. We further demonstrate teleoperated grasping using an endoscopic gripper routed along the continuum robot, mounted on a 6-DoF robotic arm. Parameterized iLogic/CAD scripts are provided for rapid geometry generation and scaling. The presented framework establishes a simple, rapid, and reproducible pathway from parametric design to controlled tendon actuation for tapered, tendon-driven continuum robots manufactured using fused deposition modeling 3D printers.

[70] arXiv:2603.25418 (replaced) [pdf, html, other]
Title: Visualizing Impedance Control in Augmented Reality for Teleoperation: Design and User Evaluation
Comments: 6 pages, 5 figures. Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
Subjects: Robotics (cs.RO)

Teleoperation for contact-rich manipulation remains challenging, especially when using low-cost, motion-only interfaces that provide no haptic feedback. Virtual reality controllers enable intuitive motion control but do not allow operators to directly perceive or regulate contact forces, limiting task performance. To address this, we propose an augmented reality (AR) visualization of the impedance controller's target pose and its displacement from each robot end effector. This visualization conveys the forces generated by the controller, providing operators with intuitive, real-time feedback without expensive haptic hardware. We evaluate the design in a dual-arm manipulation study with 17 participants who repeatedly reposition a box with and without the AR visualization. Results show that AR visualization reduces completion time by 24% for force-critical lifting tasks, with no significant effect on sliding tasks where precise force control is less critical. These findings indicate that making the impedance target visible through AR is a viable approach to improve human-robot interaction for contact-rich teleoperation.

[71] arXiv:2603.25623 (replaced) [pdf, html, other]
Title: Neural Surface and Reflectance Modelling from 3D Radar Data
Comments: Accepted for publication at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026
Subjects: Robotics (cs.RO)

Robust scene representation is essential for autonomous systems to safely operate in challenging low-visibility environments. In these conditions, radar has a clear advantage over cameras and lidars due to its resilience to environmental factors such as fog, smoke, or dust. However, radar data is inherently sparse and noisy, making reliable 3D surface reconstruction challenging. To address this, we propose a neural implicit approach for 3D mapping from radar point clouds that jointly models scene geometry and view-dependent radar intensities. Our method leverages a memory-efficient hybrid feature encoding to learn a continuous Signed Distance Field (SDF) for surface reconstruction, while also capturing radar-specific reflective properties. We show that our approach produces smoother, more accurate 3D surface reconstructions compared to existing lidar-based reconstruction methods applied to radar data and can reconstruct view-dependent radar intensities. We also show that, in general, as input point clouds get sparser, neural implicit representations render more faithful surfaces than traditional explicit SDFs and meshing techniques.

[72] arXiv:2605.26284 (replaced) [pdf, html, other]
Title: PhyPush: One Push is All You Need for Sensorless Physical Property Estimation with Physics-Guided Transformers
Comments: Accepted to 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems
Subjects: Robotics (cs.RO)

Accurately estimating object mass and friction is fundamental to reliable robotic manipulation. While interactive perception is powerful, most approaches rely on specialized hardware like force/torque sensors, limiting scalability. This paper introduces PhyPush, a physics-guided Transformer that estimates an object's mass and friction coefficient using only end-effector velocity from a single push, data readily available on standard robotic arms. By incorporating Newton's second law and the Coulomb friction model through a physics-guided loss, the model improves physical consistency and generalizes to unseen objects and surfaces. Across diverse setups, PhyPush consistently achieves highly accurate estimations in challenging out-of-domain conditions. In simulation, it reduces error by over 10% compared to a baseline with privileged force data, while in real-world experiments, it successfully zero-shot transfers from simulation to outperform a purely data-driven baseline.

[73] arXiv:2606.23531 (replaced) [pdf, html, other]
Title: BiliVLA: Scene-Aware Vision-Language-Action Model with Reinforcement Learning for Autonomous Biliary Endoscopic Navigation
Subjects: Robotics (cs.RO)

Endoscopic retrograde cholangiopancreatography (ERCP) demands precise endoscopic navigation and stable biliary cannulation within a narrow monocular field characterized by specular reflections, partial occlusions, and frequent tissue contact. Although recent robotic systems and vision-based assistance techniques improve operator ergonomics and provide perceptual cues, their performance degrades under pronounced anatomical variability and safety-critical visual artifacts, which hinders reliable autonomy in cannulation-grade procedures. Here, we present BiliVLA, a scene-aware Vision-Language-Action (VLA) framework that formulates biliary endoscopic navigation as an instruction-conditioned visuomotor learning problem. Given an endoscopic observation and a stage-specific language instruction, BiliVLA jointly predicts the target category, a grounded bounding box, and a discrete three degrees of freedom (DoF) motor command for a continuum endoscope. The proposed framework incorporates scene-aware supervision to enhance semantic target consistency and safety-aware recovery supervision to induce conservative retreat behaviors under luminal wall contact. A key component of BiliVLA is a two-stage training paradigm that combines grounding-enhanced supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO), which significantly improves action reliability and decision consistency during closed-loop navigation. Across three ERCP subtasks, BiliVLA achieves an average action precision of 91.96\% and an overall success rate (SR) of 84.85\% in real-world phantom experiments. These results indicate that integrating semantic grounding, scene-aware learning, and reward-guided optimization improves perception-action alignment and enables robust autonomous endoscopic navigation.

[74] arXiv:2606.25111 (replaced) [pdf, html, other]
Title: ADM-Fusion: Adaptive Deep Multi-Sensor Fusion for Robust Ego-Motion Estimation in Diverse Conditions
Comments: 8 pages, 4 figures
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Robust multi-sensor fusion is essential for reliable autonomy in diverse and degraded environments, where sensor reliability can fluctuate rapidly. Because different modalities fail in distinct ways, effective fusion should adaptively balance complementary cues rather than rely on fixed weighting. This adaptability is particularly important for ego-motion estimation, since accurate updates depend on the consistent integration of complementary sensor information. We propose ADM-Fusion, an end-to-end deep learning based multi-sensor fusion method designed to adapt to environmental changes and sensor degradation. ADM-Fusion employs an adaptive sensor mixture-of-experts framework with content-aware routing to dynamically assign weights to sensor inputs in real time. The system further incorporates separate translation and rotation branches, coupled through a cross-task attention mechanism to preserve task-specific specialization while enabling information sharing. ADM-Fusion is trained on the CARLA-LOC simulated dataset and subsequently fine-tuned on KITTI real-world data, demonstrating effective simulation-to-real transfer. Experiments show that ADM-Fusion remains robust under degraded conditions while maintaining competitive performance against existing methods.

[75] arXiv:2606.25965 (replaced) [pdf, html, other]
Title: Mixture-of-Experts RL for Fault-Tolerant Legged Locomotion
Subjects: Robotics (cs.RO)

Legged robots deployed in planetary exploration and other remote environments must maintain reliable locomotion despite actuator failures and challenging terrain conditions. Although reinforcement learning has achieved strong results in legged locomotion, monolithic policies can struggle to efficiently represent the diverse control strategies required to compensate for different fault conditions. In this work, we propose a fault-aware modular control architecture that explicitly leverages fault-diagnosis information to activate specialized control experts associated with distinct actuator failure modes. Experimental results show that explicit fault-conditioned modular policies consistently outperform monolithic policies of comparable size, achieving higher locomotion performance across failure scenarios. Moreover, the proposed modular architecture retains competitive performance even under significantly reduced network capacity, highlighting its suitability for compute-constrained robotic platforms, such as those typically employed in space applications. The code associated with this work is available at: this https URL.

[76] arXiv:2606.27962 (replaced) [pdf, html, other]
Title: Building a Scalable, Reproducible, Evaluatable, and Closed-Loop Simulation Environment Foundation for Embodied Intelligence
Subjects: Robotics (cs.RO)

This paper presents a cloud-native simulation infrastructure framework for embodied intelligence that supports large-scale training, standardized evaluation, and simulation-based data collection. The framework unifies simulation environment generation, task execution, trajectory collection, model evaluation, data management, and cloud services into a scalable and reproducible platform. To address the high cost, limited scalability, and poor reproducibility of real-world robotic data collection, the framework adopts cloud-native technologies including elastic resource scheduling, containerized simulation, unified data management, and service-oriented system design, enabling efficient large-scale simulation for multi-model and multi-task workloads. Built on a four-layer architecture, the framework provides standardized environment assets, automated task generation, trajectory collection, benchmark evaluation, and closed-loop data optimization. It further integrates representative systems including D-VLA, RL-VLA3, Sword, and Pre-VLA to support scalable simulation, dynamic scheduling, visual augmentation, and real-time data filtering. We argue that cloud-native simulation infrastructure provides a unified foundation for data generation, model training, standardized evaluation, and real-world deployment, and will play a key role in the future development of embodied intelligence.

[77] arXiv:2606.28805 (replaced) [pdf, html, other]
Title: Physics Models for Sim-to-Real Transfer in Professional-Level Robot Table Tennis
Christian Conti (1), Bilan Yang (1), Alexander Sigrist (2), Lorenzo Miele (2), Yamen Saraiji (1), Peter Dürr (2), Naoya Takahashi (2) ((1) Sony AI, Tokyo, Japan, (2) Sony AI, Zurich, Switzerland)
Comments: 8 pages, 7 figures, additional information: this https URL, Submitted to IEEE Robotics and Automation Letters (RA-L)
Subjects: Robotics (cs.RO)

At competitive speeds and spins, a table tennis ball follows complex, counterintuitive trajectories that a robot must track and precisely counter within fractions of a second. Training a reinforcement learning policy capable of these skills is prohibitively expensive and dangerous in the real world, making high-fidelity simulation essential. Transferability of such policies, however, critically depends on how faithfully the simulation captures real-world dynamics - a requirement made even more stringent by the adversarial nature of the game, where any modeling inaccuracy becomes an exploitable weakness for the opponent. Prior state-of-the-art in robot table tennis generally focuses on a limited range of velocities and spins and fails to capture the richness of ball behaviors encountered in professional-level play. In this work, we present physics models for aerodynamic ball flight, ball-table contact, and ball-racket contact. that accurately capture the ball behavior over a vast range of speeds and spins relevant to the game. Specifically, we model drag and Magnus force coefficients as functions of Reynolds number and spin ratio in the aerodynamics equations. For the table contact model we model effects of ball buckling on the coefficient of restitution and incorporate residuals into the instantaneous point-contact models. For the racket contact model, we introduce a residual neural network component to complement coefficients related to normal and tangential coefficients of restitution as well as torsional spin damping. Evaluated on an unprecedentedly large dataset of competitive matches (277 games), the proposed models significantly reduces prediction errors (e.g., 59% median landing-position error reduction). The resulting models were used to train the RL policies for the first real-world robot table tennis AI agent capable of competing against professional players.

[78] arXiv:2606.30552 (replaced) [pdf, html, other]
Title: Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Cross-embodiment transfer in vision-language-action (VLA) models remains challenging because low-level state and action spaces differ fundamentally across robot platforms. We observe that the high-level cognitive process underlying manipulation, including scene perception, object identification, task planning, and sub-task decomposition, is largely shared across embodiments. Based on this observation, we present ZR-0, a 2.6 billion parameter end-to-end VLA model that uses dense Embodied Chain-of-Thought (ECoT) supervision to align cross-embodiment representations within the vision-language model (VLM). ZR-0 adopts a dual-stream architecture: a pre-trained VLM (System 2) generates structured ECoT reasoning during training, while a Diffusion Transformer-based action expert (System 1) produces continuous action chunks via flow matching. The two components are coupled through cross-attention, with an attention mask that restricts the action expert to input prompt features only, enabling ECoT generation to be entirely skipped at inference without any performance loss. ZR-0 is pre-trained on ProcCorpus-60M, a large-scale dataset comprising approximately 60 million frames (approximately 1,000 hours) from over 400K trajectories, with dense ECoT annotations covering 96.8% of all frames. We evaluate ZR-0 on three simulation benchmarks spanning single-arm (LIBERO), bimanual (RoboTwin 2.0), and humanoid (RoboCasa GR-1 Tabletop) embodiments, as well as real-world experiments on the xArm platform, demonstrating strong performance across all settings. Code and model checkpoints are available at this https URL.

[79] arXiv:2606.30900 (replaced) [pdf, html, other]
Title: The Quadruped Soft Tail: Compliant Grasping and Swabbing for Contamination Surveys in Harsh Environments
Subjects: Robotics (cs.RO)

Beryllium contamination surveys in radioactive areas are challenging for robots in environments cluttered with cables and electronics. To address this problem, we have developed a novel quadruped system augmentation: A lightweight, soft, and compliant tendon-actuated robotic tail mounted on a quadruped robot. The tail features a hollow, flexible backbone and a tendon-actuated soft gripper that enables the robot to pick up sampling tissues, swab contaminated surfaces, and release the tissues at designated collection locations for subsequent beryllium analysis. To enable intuitive teleoperation, a closed-form kinematic model and a singularity-robust task-space controller are developed. Experimental results demonstrate that gripper actuation has a negligible effect on robot shape, while common-mode tendon actuation provides an effective mechanism for stiffness modulation and preload control. Furthermore, experimental validation indicates that the proposed kinematic model provides a suitable basis for real-time task-space control. The proposed system combines the agility of legged locomotion with the compliance of soft robotic manipulation, enabling the complete contamination-survey procedure to be performed without human exposure. While motivated by beryllium contamination surveys at CERN, the proposed quadruped soft-tail concept is broadly applicable to legged robots operating in cluttered, confined, or hazardous environments where conventional rigid-link manipulators are undesirable.

[80] arXiv:2606.30988 (replaced) [pdf, html, other]
Title: Multisensory Continual Learning: Adapting Pretrained Visuomotor Policies to Force
Subjects: Robotics (cs.RO)

Robot manipulation often relies on sensory feedback beyond vision, particularly in contact-rich settings where force, tactile, or audio signals reveal interaction states that are not directly observable from images. However, these modalities are often hardware- and task-specific, and large-scale multisensory robot datasets remain scarce. As a result, it is impractical to pretrain policies with every sensor they may encounter. We study multisensory continual learning: adapting a pretrained robot policy to new tasks with newly introduced modalities while preserving performance under the original sensor suite. We propose MuSe, which incorporates limited multisensory data into pretrained vision-only policies through multi-stage fusion, multisensory future prediction, and experience replay over pretraining data. We instantiate MuSe by augmenting a pretrained vision-only policy with force-torque sensing and evaluate it on real-world manipulation tasks. Our experiments show that MuSe performs strongly on contact-rich finetuning tasks while preserving, and in some cases improving, performance on the original pretraining tasks. These results suggest that a modest multisensory dataset can improve general robot capabilities beyond the finetuning distribution. Project website: this https URL

[81] arXiv:2606.31037 (replaced) [pdf, html, other]
Title: Labimus: A Simulation and Benchmark for Humanoid Dexterous Manipulation in Chemical Laboratory
Comments: Project page: this https URL
Subjects: Robotics (cs.RO)

Laboratory automation has made remarkable progress through robotic platforms and AI-driven scientific reasoning. However, many laboratory operations (e.g., solid--solid transfer) remain inherently dynamic and require real-time adaptation to different materials and experimental conditions. Such precision-critical manipulations are difficult to standardize, motivating the use of humanoid robots with dexterous hands. Despite this opportunity, no existing benchmark evaluates humanoid manipulation in precision-critical laboratory environments. We present Labimus, to our knowledge, the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories. Labimus reconstructs over 30 functionally faithful assets from real organic chemistry workstations through real-to-sim modeling, collectively covering the core operations of routine organic chemistry experiments. The benchmark integrates articulated laboratory instruments, particle-based powder physics, and closed-loop instrument readouts, enabling a complete manipulation-to-measurement pipeline. It further defines six atomic operations and a seven-step solid-weighing workflow derived from real laboratory standard operating procedures. We introduce a precision-aware evaluation protocol designed to jointly measure task completion, experimental precision, and long-horizon execution. We benchmark three representative policies under procedural layouts and environmental perturbations. Results reveal a precision gap: policies that successfully complete laboratory tasks can still fail to satisfy the quantitative tolerances required by experimental protocols. Our benchmark exposes a fundamental disconnect between task completion and experimental validity, providing a new testbed for developing reliable humanoid robots for scientific laboratories.

[82] arXiv:2606.31329 (replaced) [pdf, html, other]
Title: 3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance
Comments: Published in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026. Code: this https URL. Project page: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Hierarchical Vision-Language-Action (VLA) models decouple high-level planning from low-level control to improve generalization in robot manipulation. Recent work in this paradigm uses 2D end-effector trajectories predicted by a Vision-Language Model (VLM) as explicit guidance for a downstream policy. However, state-of-the-art low-level policies operate in 3D metric space on point clouds, and feeding them 2D guidance that lacks depth forces each waypoint to be assigned the depth of whatever scene surface lies beneath it, producing geometrically distorted trajectories. We propose 3D HAMSTER, a hierarchical framework that closes this gap by having the planner directly output metrically reliable 3D trajectories. We augment a VLM with a dedicated depth encoder and a dense depth reconstruction objective to predict 3D waypoint sequences, which are directly integrated into a pointcloudbased low-level policy. Across 3D trajectory prediction, simulation, and real-world manipulation, 3D HAMSTER consistently outperforms proprietary VLMs and 2D-guided baselines, with the largest gains under appearance-altering shifts and unseen language, spatial, and visual conditions. The project page is available at this https URL.

[83] arXiv:2606.31483 (replaced) [pdf, html, other]
Title: A Large-Language-Model Supported Personalized Driving Framework for Lane Change in Highway Scenarios
Subjects: Robotics (cs.RO)

Personalized driving can improve the user acceptance of automated driving systems. However, existing methods still provide limited support for translating natural-language driving preferences, especially when such preferences are expressed implicitly, into executable and distinguishable driving behaviors. This paper proposes a large language model (LLM)-supported personalized driving framework for highway lane-change scenarios. The framework maps natural-language driving commands to executable planning parameters in the open-source Apollo automated driving stack according to three driving styles: aggressive, normal, and conservative. To establish this mapping, candidate planning parameters are evaluated based on the resulting lane-change behaviors, and style-specific parameter sets are constructed through clustering and style-intensity ranking. For command interpretation, a retrieval dataset is constructed to support retrieval-augmented generation (RAG), enabling LLM-based interpretation of implicit user commands. Experimental results show that the derived parameter sets generate distinguishable personalized lane-change behaviors, while RAG consistently improves preference interpretation, particularly for implicit commands. These results indicate the potential of integrating LLM-based natural-language interaction with Apollo to support personalized lane-change behavior generation. The source code and the relevant datasets are available at: this https URL.

[84] arXiv:2407.15283 (replaced) [pdf, html, other]
Title: Enhancing Hardware Fault Tolerance in Machines with Reinforcement Learning Policy Gradient Algorithms
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Industry is moving toward autonomous, network-connected machines that detect and adapt to changing conditions, including hardware faults. Conventional fault-tolerant design duplicates hardware and reroutes control logic; reinforcement learning (RL) offers a learning-based alternative. This paper presents the first systematic comparison of two RL algorithms -- Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) -- for integrating fault tolerance into control. Beyond algorithm choice, we investigate four knowledge-transfer strategies: retaining or discarding model parameters, and retaining or discarding storage contents. Performance is evaluated in two Gymnasium environments: Ant-v5 and FetchReachDense-v3. Results show rapid, fault-specific recovery with clear trade-offs. In Ant-v5, retaining PPO's parameters boosts early returns and remains the safest choice across all faults, while retaining SAC's parameters yields mixed outcomes. SAC's early performance further depends on whether the replay buffer is retained: beneficial when prior experiences match current dynamics, but harmful when they diverge. In FetchReachDense-v3, discarding both PPO's and SAC's parameters was most effective under sensor corruption. Across tasks, both algorithms recover near-normal performance within minutes in low-dimensional settings and within days in high-dimensional settings, highlighting a clear trade-off between adaptation speed and asymptotic performance. These findings demonstrate that RL can deliver robust fault tolerance and offer practical guidelines.

[85] arXiv:2505.07254 (replaced) [pdf, html, other]
Title: Towards Accurate State Estimation: Motion Dynamics Kalman Filter for 3D Multi-Object Tracking
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Precise 3D state estimation in multi-object tracking (MOT) is critical for self-driving cars, particularly for objects occluded. Motion modeling in the Kalman filter with a constant motion assumption is widely used in MOT methods, but it neglects the continuous changes in objects' motion caused by traffic in urban environments. Although recent research introduces a multimodel Kalman filter that incorporates multiple motion models, these approaches incur significant computational overhead from the simultaneous processing of multiple models. To this end, this work introduces a motion-dynamics Kalman filter (MD-KF) that overcomes the constant-motion assumption while preserving the singularity of the motion model. MD-KF models the changes in objects' motion over successive measurements as Gaussian distributions, and adaptively adjusts a weighted motion model to account for these variations. MD-KF consistently outperforms constant and multimodel KF across multiple datasets with a significant reduction in computation latency compared to multimodel approaches. The proposed approach demonstrates its superiority in trajectory estimation during occlusion and state estimation stability for stationary objects.

[86] arXiv:2603.06254 (replaced) [pdf, html, other]
Title: NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving
Comments: Accepted to IROS 2026. Code will be available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an autoregressive association formulation that shifts the data association stage from fragmented distance-based matching toward trajectory-conditioned spatio-semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at this https URL.

[87] arXiv:2603.14354 (replaced) [pdf, html, other]
Title: Deconfounded Lifelong Learning for Autonomous Driving via Dynamic Knowledge Spaces
Comments: Accepted by ECCV 2026
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)

End-to-End autonomous driving (E2E-AD) systems face challenges in lifelong learning, including catastrophic forgetting, difficulty in knowledge transfer across diverse scenarios, and spurious correlations between unobservable confounders and true driving intents. To address these issues, we propose DeLL, a Deconfounded Lifelong Learning framework that integrates a Dirichlet process mixture model (DPMM) with the front-door adjustment mechanism from causal inference. The DPMM is employed to construct two dynamic knowledge spaces: a trajectory knowledge space for clustering explicit driving behaviors and an implicit feature knowledge space for discovering latent driving abilities. Leveraging the non-parametric Bayesian nature of DPMM, our framework enables adaptive expansion and incremental updating of knowledge without predefining the number of clusters, thereby mitigating catastrophic forgetting. Meanwhile, the front-door adjustment mechanism utilizes the DPMM-derived knowledge as mediators to deconfound spurious correlations, such as those induced by sensor noise or environmental changes, and enhances the causal expressiveness of the learned representations. Additionally, we introduce an evolutionary trajectory decoder that enables non-autoregressive planning. To evaluate the lifelong learning performance of E2E-AD, we propose new evaluation protocols and metrics based on Bench2Drive. Extensive evaluations in the closed-loop CARLA simulator demonstrate that our framework significantly improves adaptability to new driving scenarios and overall driving performance, while effectively retaining previously acquired knowledge. Code: this https URL

[88] arXiv:2603.22055 (replaced) [pdf, other]
Title: MineRobot: An Actuator-Centered Kinematic Modeling and Solving Framework for Underground Mining Robots
Subjects: Graphics (cs.GR); Robotics (cs.RO)

Underground mining robots are increasingly modeled for planning, operator training, and digital-twin workflows, where reliable actuator-level kinematics is needed to reduce hazardous in situ trials. Unlike typical open-chain industrial manipulators, representative mining machines are often linear-actuator-driven closed-chain mechanisms with planar four-bar linkages, making reusable kinematic modeling and real-time FK/IK solving challenging. We present \textit{\hl{MineRobot}}, an actuator-centered framework for modeling and solving the kinematics of this representative mechanism class. MineRobot introduces the Mining Robot Description Format (MRDF), a domain-specific representation that parameterizes mining-robot kinematics with native semantics for actuators and loop closures. It then contracts planar four-bar substructures into generalized joints and extracts, for each actuator, an Independent Topologically Equivalent Path (ITEP) classified into four canonical types. Based on this decomposition, per-type solvers are composed into a sequential forward-kinematics (FK) pipeline, while inverse kinematics (IK) is formulated as a bound-constrained actuator-length optimization solved by a Gauss--Seidel-style update scheme. By converting coupled closed-chain kinematics into small topology-aware solves, MineRobot reduces robot-specific hand derivations and supports efficient repeated FK/IK computation without treating each query as a full coupled constraint-solving problem. Experiments on representative underground mining robots demonstrate real-time FK performance and robust IK convergence within the tested operating ranges, supporting the use of MineRobot as an actuator-centered kinematic layer for planning, training, and digital-twin workflows.

[89] arXiv:2603.23405 (replaced) [pdf, html, other]
Title: Planning over MAPF Agent Dependencies via Multi-Dependency PIBT
Comments: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Modern Multi-Agent Path Finding (MAPF) algorithms must plan for hundreds to thousands of agents in congested environments within a second, requiring highly efficient algorithms. Priority Inheritance with Backtracking (PIBT) is a popular algorithm capable of effectively planning in such situations. However, PIBT, and its variants like Enhanced PIBT (EPIBT), is constrained by its rule-based planning procedure and lacks generality because it restricts its search to paths that collide with at most one other agent. In this paper, we describe a new perspective on solving MAPF by planning over agent dependencies. Taking inspiration from PIBT's priority inheritance logic, we define the concept of agent dependencies and propose Multi-Dependency PIBT (MD-PIBT) that searches over agent dependencies. MD-PIBT is a general framework where specific parameterizations can reproduce PIBT and EPIBT. At the same time, alternative configurations generalize PIBT and EPIBT to multi-step planning capable of reasoning paths that collide with more than one other agent. Our experiments demonstrate that MD-PIBT effectively plans for as many as 10,000 homogeneous agents under various kinodynamic constraints, including pebble motion, rotation motion, and differential drive robots with speed and acceleration limits. We perform thorough evaluations on different variants of MAPF and find that MD-PIBT is particularly effective in MAPF with large agents. Our code is available at this https URL.

[90] arXiv:2603.27757 (replaced) [pdf, html, other]
Title: E-TIDE: Fast, Structure-Preserving Motion Forecasting from Event Sequences
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Event-based cameras capture visual information as asynchronous streams of per-pixel brightness changes, generating sparse, temporally precise data. Compared to conventional frame-based sensors, they offer significant advantages in capturing high-speed dynamics while consuming substantially less power. Predicting future event representations from past observations is an important problem, enabling downstream tasks such as future semantic segmentation or object tracking without requiring access to future sensor measurements. While recent state-of-the-art approaches achieve strong performance, they often rely on computationally heavy backbones and, in some cases, large-scale pretraining, limiting their applicability in resource-constrained scenarios. In this work, we introduce E-TIDE, a lightweight, end-to-end trainable architecture for event-tensor prediction that is designed to operate efficiently without large-scale pretraining. Our approach employs the TIDE module (Temporal Interaction for Dynamic Events), motivated by efficient spatiotemporal interaction design for sparse event tensors, to capture temporal dependencies via large-kernel mixing and activity-aware gating while maintaining low computational complexity. Experiments on standard event-based datasets demonstrate that our method achieves competitive performance with significantly reduced model size and training requirements, making it well-suited for real-time deployment under tight latency and memory budgets.

[91] arXiv:2604.04198 (replaced) [pdf, html, other]
Title: DriveVA: Video Action Models are Zero-Shot Drivers
Comments: Accepted to ECCV 2026. 30 pages, 12 figures, 11 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive PDM-based planning performance of 90.9 PDM score on the NAVSIM benchmark. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2Drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.

[92] arXiv:2604.16993 (replaced) [pdf, html, other]
Title: Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance. However, current agents suffer from a "goal-driven trap", prioritizing physical geometry ("can I go?") over semantic rules ("may I go?"), frequently overlooking subtle regulatory constraints. To bridge this gap, we establish Rule-VLN, the first large-scale urban benchmark for rule-compliant navigation. Spanning a massive 29k-node environment, it injects 177 diverse regulatory categories into 8k constrained nodes across four curriculum levels, challenging agents with fine-grained visual and behavioral constraints. We further propose the Semantic Navigation Rectification Module (SNRM), a universal, zero-shot module designed to equip pre-trained agents with safety awareness. SNRM integrates a coarse-to-fine visual perception VLM framework with an epistemic mental map for dynamic detour planning. Experiments demonstrate that while Rule-VLN challenges state-of-the-art models, SNRM significantly restores navigation capabilities, reducing CVR by 19.26% and boosting TC by 5.97%.

[93] arXiv:2605.00271 (replaced) [pdf, html, other]
Title: REALM: An RGB- and Event-Aligned Latent Manifold for Cross-Modal Perception
Comments: Accepted to the European Conference on Computer Vision (ECCV), Malmö, SE, 2026
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Event cameras provide several unique advantages over standard frame-based sensors, including high temporal resolution, low latency, and robustness to extreme lighting. However, existing learning-based approaches for event processing are typically confined to narrow, task-specific silos and lack the ability to generalize across modalities. We address this gap with REALM, a cross-modal framework that learns an RGB- and Event-Aligned Latent Manifold by projecting event representations into the pretrained latent space of RGB foundation models. Instead of task-specific training, we leverage low-rank adaptation (LoRA) to bridge the modality gap, effectively unlocking the geometric and semantic priors of frozen RGB backbones for asynchronous event streams. We demonstrate that REALM effectively maps events into the ViT-based foundation latent space. Our method performs downstream tasks, such as depth estimation and semantic segmentation, by simply transferring linear heads trained on the RGB teacher. Most significantly, REALM enables the direct, zero-shot application of complex, frozen image-trained decoders, such as MASt3R, to raw event data. We demonstrate state-of-the-art performance in wide-baseline feature matching, significantly outperforming specialized architectures. Code and models are available at this https URL.

[94] arXiv:2606.31043 (replaced) [pdf, html, other]
Title: Warp RL: Reshaping Base Policy Distributions for Dynamics Adaptation
Comments: 17 pages, 7 figures
Subjects: Machine Learning (cs.LG); Robotics (cs.RO)

Residual reinforcement learning adapts a pretrained robot policy by learning an additive correction to its actions. While effective when adaptation amounts to shifting the base policy's action distribution, additive corrections cannot change the distribution's shape, scale, or state-dependent geometry -- limitations we formalize as wrong variance, miscalibrated confidence, and non-uniform correction. We show that these matter under dynamics shift: when the base distribution is geometrically mismatched to the shifted system, residual correction can underperform even the unadapted policy. We propose Warp RL, a policy adaptation method that replaces additive residuals with an invertible, state-conditioned transformation of the base policy's action distribution. Instantiated with monotonic rational-quadratic spline flows (arXiv:1906.04032), Warp RL preserves identity initialization, strictly generalizes additive residual correction, and exposes a structured adaptation space suitable for both policy-gradient and gradient-free optimization. Across a variety of ManiSkill3 manipulation tasks with controlled dynamics shifts, Warp RL matches residual correction when translation is sufficient and substantially outperforms it when adaptation requires distributional reshaping. We further demonstrate that warping can replace additive correction in an off-policy sim-to-real pipeline, achieving comparable success rate with 30% faster task completion on a real-robot peg-insertion task.

Total of 94 entries
Showing up to 2000 entries per page: fewer | more | all

AltStyle によって変換されたページ (->オリジナル) /