Bachelor thesis: Level generation with artificial intelligence

By Moritz Leopold

Created on June 10, 2025


As part of my bachelor’s thesis at the University of Applied Sciences Augsburg, I explored the combination of two modern approaches in AI-driven game development: text-based generation of game content using Large Language Models (LLMs) and automated evaluation of playable content using reinforcement learning (RL).

An agent plays generated levels and collects metrics for validation.

An agent plays generated levels and collects metrics for validation.


🔍 State of the art

Procedural Content Generation

Procedural level generation is an established technique in game development. Traditional methods are rule-based or random, and are rarely context-sensitive.

LLMs for level generation

Recent work demonstrates that language models like GPT-4 or domain models can generate structured game content (e.g., JSON, XML, or Doom WAD) from prompt text.

RL for evaluation

Instead of manual testing, an RL agent was trained to assess playability, difficulty and navigability of a level purely through interaction with the generated environment.

ViZDoom as platform

ViZDoom provides a controllable FPS environment based on the Doom engine — with API access, .wad/.cfg level support and high modifiability.


🧠 Design & methodology

Objective

How feasible and efficient is it to generate levels with LLMs and evaluate them automatically?

System architecture

  • LLM generates Doom levels in a textual format
  • Levels are automatically converted to .wad/.cfg
  • RL agent plays levels via Sample Factory (APPO)

💡 Implementation

Tech stack

  • LLMs: DeepSeek-1, Gemini-2.0-flash, CodeLLaMA
  • RL engine: Sample Factory, APPO
  • Game engine: ViZDoom
  • Tools: Python, PyTorch, Git
import json
from openai import OpenAI


TOOL = {...}

user_prompt = f"""
Generate a valid 15x15 ASCII-based ViZDoom level.   
### HARD CONSTRAINTS (must never be violated):
1. There must be **exactly one player**, placed in the **second row** (index 1). Do not place the player anywhere else.
2. There must be **exactly one goal item**, placed in the **second last row** (index 13). Do not place the goal anywhere else.
3. There must be **exactly 2 to 4 enemies**, located anywhere except player and goal positions.
...
"""

role_prompt = f"""
You are an AI designed to generate ASCII-based ViZDoom levels in a strict 15x15 format.  
The first and last line must be only walls.
Allowed characters:
"#": "wall"
" ": "space for open areas"
"P": "player"
"G": "goal item"
"E": "enemy"
"""

response = AsyncOpenAI(
        api_key=os.getenv("GOOGLE_API_KEY"), 
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
    )
    .chat.completions.create(
    model=llm_model,
    messages=[
        { "role": "system", "content": role_prompt },
        { "role": "user", "content": user_prompt }
    ],
    tools=[TOOL],
    tool_choice={"type": "function", "function": {"name": "generate_level"}}
)

arguments_json = response.choices[0].message.tool_calls[0].function.arguments
return json.loads(arguments_json)

Simplified example prompt for level generation using gemini-2.0-flash.


📊 Evaluation

Test design

For quantitative evaluation, each generated level was tested with a fixed agent configuration (Sample Factory, 100 episodes per level). The following metrics were computed:

  • avg_reward: average total reward per episode
  • avg_hit_rate: ratio of successful hits to attempts
  • avg_damage_taken: average damage taken per run
  • avg_damage_per_kill: damage per eliminated enemy
  • avg_ammo_efficiency: ratio of damage to ammo consumption
  • avg_health: average health at end of episode
  • survival_rate: share of successful runs without death

These metrics were extracted automatically from trajectories and aggregated (mean over episodes) as a score for each level.

Results

  • Playability (survival_rate > 0.5) was achieved in approximately 70% of levels
  • Ammo efficiency proved to be a sensitive indicator for map density and enemy placement
  • avg_reward correlated strongly with health kit distribution and map complexity

The metric combination allowed a nuanced evaluation, detecting playable but poorly balanced levels.


🧩 Discussion

LLMs can generate Doom levels but not always in a usable format. The agent-based evaluator provides functional feedback and reveals balancing issues. Together they form a viable loop for automated level validation.


✅ Conclusion & Outlook

The combination of text-based generation and RL-based evaluation offers an exciting basis for future tools in game development – for example, for automatic test levels, difficulty analysis, or assisted level design.