Why are large language models so terrible at video games?!

DEV Community

Causality: Understanding that "shooting a barrel causes an explosion" requires more than just co-occurrence in text. It requires a causal model that LLMs do not inherently possess.

State Representation: The internal state of an LLM is primarily its hidden activations, which are not directly interpretable as game states (e.g., player coordinates, object properties).

To overcome this, researchers often combine LLMs with other AI components:

State Trackers: Explicit modules that monitor and interpret the game state.
World Simulators: External physics engines or game logic simulators.
Planning Modules: AI planners that use the LLM's high-level understanding to generate strategic goals.

Examples and Current Research Directions

Despite these challenges, significant research is underway to bridge the gap. These efforts often involve hybrid architectures:

LLM-as-a-Planner/Advisor: Using an LLM to generate high-level strategies or advice, which are then translated into executable actions by a lower-level controller or RL agent. For instance, in a strategy game, an LLM might suggest "focus on building defenses and researching technology," and a separate AI agent would manage the micro-level unit production and research queues.

# Conceptual example of LLM as a high-level planner
def get_strategic_advice(game_state_description):
 prompt = f"""
 You are an expert RTS player. Based on the current game situation,
 provide a concise, high-level strategic recommendation.
 Game State: {game_state_description}
 Recommendation:
 """
 recommendation = llm_model.generate_text(prompt)
 return recommendation
def translate_recommendation_to_actions(recommendation, current_game_state):
 # Logic to map high-level recommendation to specific game commands
 if "focus on defenses" in recommendation:
 return ["build_turret(location='base')", "research_armor_upgrade()"]
 elif "attack enemy base" in recommendation:
 return ["gather_army('infantry', 'tanks')", "move_army(target='enemy_base')"]
 # ... more complex translation logic
 return []
# In the game loop:
game_state_text = describe_game_state(current_state) # Function to convert game state to text
strategy = get_strategic_advice(game_state_text)
actions = translate_recommendation_to_actions(strategy, current_state)
execute_actions(actions)

Multimodal LLMs for Game Understanding: Employing models like GPT-4V, LLaVA, or specialized vision-language models that can directly process image inputs alongside text. These models can interpret visual cues and game state information simultaneously.

# Conceptual example using a multimodal LLM
from multimodal_llm_api import MultiModalLLMClient
client = MultiModalLLMClient(api_key="YOUR_API_KEY")
def decide_action_multimodal(image_frame, text_overlay, game_state_dict):
 prompt = """
 You are an AI playing this game. Analyze the screen and game state.
 What is the best action to take right now?
 Current Game State: {game_state_dict}
 Visual Input: (image)
 Text Overlay: {text_overlay}
 Action:
 """
 response = client.generate_response(
 prompt=prompt.format(game_state_dict=game_state_dict, text_overlay=text_overlay),
 images=[image_frame]
 )
 return response.text # e.g., "Move right and shoot"

LLMs as Knowledge Bases for Game AI: Using LLMs to provide game-specific knowledge, lore, or character motivations that can inform the decision-making of traditional AI agents, making them more believable or strategic.
LLM-driven Level Generation or Narrative: LLMs are well-suited for generating content. They can be used to create game levels, dialogue, quests, or storylines, which are then populated and made playable by other game systems.

Conclusion: Not "Terrible," but Fundamentally Mismatched for Direct Control

Large language models are not inherently "terrible" at video games in the sense of being incapable of processing game-related information. Instead, their current architecture and training paradigms present significant challenges for direct, real-time control and decision-making in dynamic, multimodal environments. The sequential, token-based nature of LLMs struggles with the high-dimensional visual input, real-time reactivity, continuous action spaces, and sparse reward structures inherent to most video games.

However, LLMs are proving to be powerful components within broader AI systems for games. Their strengths in understanding context, generating coherent sequences, and reasoning about abstract concepts can be leveraged for high-level planning, narrative generation, and providing strategic advice. Future advancements will likely focus on more efficient multimodal integration, improved temporal reasoning, and seamless combination with reinforcement learning and traditional game AI techniques to unlock their full potential in interactive entertainment.

The limitations observed are not necessarily an indictment of LLMs' intelligence but a reflection of their design being optimized for a different modality and task. As research progresses, we can expect to see more sophisticated architectures that harness the power of LLMs within the complex domain of video games.

For organizations seeking to navigate the complexities of AI integration, including advanced applications in gaming, simulation, and interactive systems, expert guidance is invaluable. Visit https://www.mgatc.com for consulting services.

Originally published in Spanish at www.mgatc.com/blog/why-are-large-language-models-so-terrible-at-video-games/