Bachelor thesis: Level generation with artificial intelligence

By Moritz Leopold

Created on June 10, 2025

As part of my bachelor’s thesis at the University of Applied Sciences Augsburg, I explored the combination of two modern approaches in AI-driven game development: text-based generation of game content using Large Language Models (LLMs) and automated evaluation of playable content using reinforcement learning (RL).

An agent plays generated levels and collects metrics for validation.

🔍 State of the art

Procedural Content Generation

Procedural level generation is an established technique in game development. Traditional methods are rule-based or random, and are rarely context-sensitive.

LLMs for level generation

Recent work demonstrates that language models like GPT-4 or domain models can generate structured game content (e.g., JSON, XML, or Doom WAD) from prompt text.

RL for evaluation

Instead of manual testing, an RL agent was trained to assess playability, difficulty and navigability of a level purely through interaction with the generated environment.

ViZDoom as platform

ViZDoom provides a controllable FPS environment based on the Doom engine — with API access, .wad/.cfg level support and high modifiability.

🧠 Design & methodology

Objective

How feasible and efficient is it to generate levels with LLMs and evaluate them automatically?

System architecture

LLM generates Doom levels in a textual format
Levels are automatically converted to .wad/.cfg
RL agent plays levels via Sample Factory (APPO)

💡 Implementation

Tech stack

LLMs: DeepSeek-1, Gemini-2.0-flash, CodeLLaMA
RL engine: Sample Factory, APPO
Game engine: ViZDoom
Tools: Python, PyTorch, Git

import json
from openai import OpenAI


TOOL = {...}

user_prompt = f"""
Generate a valid 15x15 ASCII-based ViZDoom level.   
### HARD CONSTRAINTS (must never be violated):
1. There must be **exactly one player**, placed in the **second row** (index 1). Do not place the player anywhere else.
2. There must be **exactly one goal item**, placed in the **second last row** (index 13). Do not place the goal anywhere else.
3. There must be **exactly 2 to 4 enemies**, located anywhere except player and goal positions.
...
"""

role_prompt = f"""
You are an AI designed to generate ASCII-based ViZDoom levels in a strict 15x15 format.  
The first and last line must be only walls.
Allowed characters:
"#": "wall"
" ": "space for open areas"
"P": "player"
"G": "goal item"
"E": "enemy"
"""

response = AsyncOpenAI(
        api_key=os.getenv("GOOGLE_API_KEY"), 
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
    )
    .chat.completions.create(
    model=llm_model,
    messages=[
        { "role": "system", "content": role_prompt },
        { "role": "user", "content": user_prompt }
    ],
    tools=[TOOL],
    tool_choice={"type": "function", "function": {"name": "generate_level"}}
)

arguments_json = response.choices[0].message.tool_calls[0].function.arguments
return json.loads(arguments_json)

Simplified example prompt for level generation using gemini-2.0-flash.

import json
from omg import *


class WADBuilder:
    ...

    def build(self):
        for h, row in enumerate(self.level):
            for w, block in enumerate(row.strip()):              
                if block == '#':
                    self.__add_vertex(w, h)
                elif block in self.thing_ids:
                    self.__add_thing(w, h, block)

        corners = [(0, 0), (self.max_w, 0), (self.max_w, self.max_h), (0, self.max_h)]
        for v in corners:
            self.__add_vertex(*v)

        for i in range(len(corners)):
            if i != len(corners) - 1:
                self.__add_line(corners[i], corners[i + 1], True)
            else:
                self.__add_line(corners[i], corners[0], True)
                
        # Now connect the walls
        for h, row in enumerate(self.level):
            for w, _ in enumerate(row):
                if (w, h) not in self.v_indexes:
                    continue

                if (w + 1, h) in self.v_indexes:
                    self.__add_line((w, h), (w + 1, h))

                if (w, h + 1) in self.v_indexes:
                    self.__add_line((w, h), (w, h + 1))


        return self.things, self.vertexes, self.linedefs

The conversion from JSON level representations to playable .wad format is based on a modified version of MazeExplorer.

Training:

python train_custom_vizdoom_env.py --env=perfect_example --train_for_env_steps=100000000 --algo=APPO \
--env_frameskip=4 --use_rnn=True --batch_size=2048 --wide_aspect_ratio=False --num_workers=20 \
--num_envs_per_worker=20 --num_policies=1 --experiment=ml_bachelor_doom --save_every_sec=300 \
--experiment_summaries_interval=10

Evaluating:

python eval_custom_vizdoom_env.py --algo=APPO --env=perfect_example --experiment=ml_bachelor_doom \
--eval_scenario_dir=../doom-map-converter/generated_levels/ --num_episodes_per_scenario=100

Enjoying

python enjoy_custom_vizdoom_env.py --env=perfect_example --algo=APPO --experiment=ml_bachelor_doom

📊 Evaluation

Test design

For quantitative evaluation, each generated level was tested with a fixed agent configuration (Sample Factory, 100 episodes per level). The following metrics were computed:

avg_reward: average total reward per episode
avg_hit_rate: ratio of successful hits to attempts
avg_damage_taken: average damage taken per run
avg_damage_per_kill: damage per eliminated enemy
avg_ammo_efficiency: ratio of damage to ammo consumption
avg_health: average health at end of episode
survival_rate: share of successful runs without death

These metrics were extracted automatically from trajectories and aggregated (mean over episodes) as a score for each level.

Results

Playability (survival_rate > 0.5) was achieved in approximately 70% of levels
Ammo efficiency proved to be a sensitive indicator for map density and enemy placement
avg_reward correlated strongly with health kit distribution and map complexity

The metric combination allowed a nuanced evaluation, detecting playable but poorly balanced levels.

🧩 Discussion

LLMs can generate Doom levels but not always in a usable format. The agent-based evaluator provides functional feedback and reveals balancing issues. Together they form a viable loop for automated level validation.

✅ Conclusion & Outlook

The combination of text-based generation and RL-based evaluation offers an exciting basis for future tools in game development – for example, for automatic test levels, difficulty analysis, or assisted level design.