MotionLM: Multi-Agent Motion Forecasting as Language Modeling

Authors

Ari Seff
Brian Cera
Dian Chen
Mason Ng
Aurick Zhou
Nigamaa Nayakanti
Khaled S. Refaat
Rami Al-Rfou
Benjamin Sapp

Abstract

Reliable forecasting of the future behavior of road agents is a critical component to safe planning in autonomous vehicles. Here, we represent continuous trajectories as sequences of discrete motion tokens and cast multi-agent motion prediction as a language modeling task over this domain. Our model, MotionLM, provides several advantages: First, it does not require anchors or explicit latent variable optimization to learn multimodal distributions. Instead, we leverage a single standard language modeling objective, maximizing the average log probability over sequence tokens. Second, our approach bypasses post-hoc interaction heuristics where individual agent trajectory generation is conducted prior to interactive scoring. Instead, MotionLM produces joint distributions over interactive agent futures in a single autoregressive decoding process. In addition, the model's sequential factorization enables temporally causal conditional rollouts. The proposed approach establishes new state-of-the-art performance for multi-agent motion prediction on the Waymo Open Motion Dataset, ranking 1st on the interactive challenge leaderboard.

Overview

[画像:Overall framework of MotionLM displaying continuous trajectories represented as sequences of discrete motion tokens.]

MotionLM autoregressively generates sequences of discrete tokens for a set of agents to produce interactive trajectory forecasts. At each timestep, a token is sampled for each agent from a finite vocabulary and appended to the global sequence.

[画像:MotionLM architecture diagram displaying the autoregressive transformer decoder sampling sequences of motion tokens.]

Bypassing geometric anchors and latent variable optimization, multimodal distributions emerge solely via per-step sampling. Meanwhile, the training objective is kept simple with minimal assumptions — just next-token prediction.

The resulting model can perform marginal, joint, and conditional forecasting. MotionLM establishes new state-of-the-art performance on both the Waymo Open Motion Dataset motion prediction and interaction prediction benchmarks.

Examples

Marginal vs. Joint

Attention-based interactive modeling during decoding allows for scene-level consistency. While marginal (independent per agent) predictions may lead to unrealistic overlap (left), joint predictions exhibit appropriate reactions across agents (right).

Marginal	Joint
[画像:Abstract depiction of marginal predictions for road agents leading to unrealistic overlap/collisions.]	[画像:Abstract depiction of joint predictions for road agents leading to realistic interactions.]
[画像:Abstract depiction of marginal predictions for road agents leading to unrealistic overlap/collisions.]	[画像:Abstract depiction of joint predictions for road agents leading to realistic interactions.]
[画像:Abstract depiction of marginal predictions for road agents leading to unrealistic overlap/collisions.]	[画像:Abstract depiction of joint predictions for road agents leading to realistic interactions.]

Marginal vs. Conditional

When conditioning on a query agent trajectory (magenta), the predicted agent trajectory (cyan) can appropriately respond.

Marginal	Conditional
[画像:Abstract depiction of a marginal prediction for a single road agent.] The marginal prediction for the pedestrian (cyan) crosses the street as the vehicle turns, leading to a collision.	[画像:When conditioning on the turning vehicle’s trajectory (magenta), the pedestrian is predicted to yield.] When conditioning on the turning vehicle’s trajectory (magenta), the pedestrian is predicted to yield.
[画像:Abstract depiction of a marginal prediction for a single road agent.] The marginal prediction for the modeled vehicle (cyan) collides with the lead vehicle.	[画像:Abstract depiction of a marginal prediction for a single road agent, conditioned on a query trajectory for a nearby agent.] When conditioning on the lead vehicle’s trajectory (magenta), the modeled vehicle (cyan) comes to an appropriate stop.