As part of my bachelor’s thesis at the University of Applied Sciences Augsburg, I explored the combination of two modern approaches in AI-driven game development: text-based generation of game content using Large Language Models (LLMs) and automated evaluation of playable content using reinforcement learning (RL).

An agent plays generated levels and collects metrics for validation.
🔍 State of the art
Procedural Content Generation
Procedural level generation is an established technique in game development. Traditional methods are rule-based or random, and are rarely context-sensitive.
LLMs for level generation
Recent work demonstrates that language models like GPT-4 or domain models can generate structured game content (e.g., JSON, XML, or Doom WAD) from prompt text.
RL for evaluation
Instead of manual testing, an RL agent was trained to assess playability, difficulty and navigability of a level purely through interaction with the generated environment.
ViZDoom as platform
ViZDoom provides a controllable FPS environment based on the Doom engine — with API access, .wad/.cfg level support and high modifiability.
🧠 Design & methodology
Objective
How feasible and efficient is it to generate levels with LLMs and evaluate them automatically?
System architecture
- LLM generates Doom levels in a textual format
- Levels are automatically converted to
.wad/.cfg - RL agent plays levels via Sample Factory (APPO)
💡 Implementation
Tech stack
- LLMs: DeepSeek-1, Gemini-2.0-flash, CodeLLaMA
- RL engine: Sample Factory, APPO
- Game engine: ViZDoom
- Tools: Python, PyTorch, Git
import json
from openai import OpenAI
TOOL = {...}
user_prompt = f"""
Generate a valid 15x15 ASCII-based ViZDoom level.
### HARD CONSTRAINTS (must never be violated):
1. There must be **exactly one player**, placed in the **second row** (index 1). Do not place the player anywhere else.
2. There must be **exactly one goal item**, placed in the **second last row** (index 13). Do not place the goal anywhere else.
3. There must be **exactly 2 to 4 enemies**, located anywhere except player and goal positions.
...
"""
role_prompt = f"""
You are an AI designed to generate ASCII-based ViZDoom levels in a strict 15x15 format.
The first and last line must be only walls.
Allowed characters:
"#": "wall"
" ": "space for open areas"
"P": "player"
"G": "goal item"
"E": "enemy"
"""
response = AsyncOpenAI(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
.chat.completions.create(
model=llm_model,
messages=[
{ "role": "system", "content": role_prompt },
{ "role": "user", "content": user_prompt }
],
tools=[TOOL],
tool_choice={"type": "function", "function": {"name": "generate_level"}}
)
arguments_json = response.choices[0].message.tool_calls[0].function.arguments
return json.loads(arguments_json)
Simplified example prompt for level generation using gemini-2.0-flash.
import json
from omg import *
class WADBuilder:
...
def build(self):
for h, row in enumerate(self.level):
for w, block in enumerate(row.strip()):
if block == '#':
self.__add_vertex(w, h)
elif block in self.thing_ids:
self.__add_thing(w, h, block)
corners = [(0, 0), (self.max_w, 0), (self.max_w, self.max_h), (0, self.max_h)]
for v in corners:
self.__add_vertex(*v)
for i in range(len(corners)):
if i != len(corners) - 1:
self.__add_line(corners[i], corners[i + 1], True)
else:
self.__add_line(corners[i], corners[0], True)
# Now connect the walls
for h, row in enumerate(self.level):
for w, _ in enumerate(row):
if (w, h) not in self.v_indexes:
continue
if (w + 1, h) in self.v_indexes:
self.__add_line((w, h), (w + 1, h))
if (w, h + 1) in self.v_indexes:
self.__add_line((w, h), (w, h + 1))
return self.things, self.vertexes, self.linedefs
The conversion from JSON level representations to playable .wad format is based on a modified version of MazeExplorer.
Training:
python train_custom_vizdoom_env.py --env=perfect_example --train_for_env_steps=100000000 --algo=APPO \
--env_frameskip=4 --use_rnn=True --batch_size=2048 --wide_aspect_ratio=False --num_workers=20 \
--num_envs_per_worker=20 --num_policies=1 --experiment=ml_bachelor_doom --save_every_sec=300 \
--experiment_summaries_interval=10
Evaluating:
python eval_custom_vizdoom_env.py --algo=APPO --env=perfect_example --experiment=ml_bachelor_doom \
--eval_scenario_dir=../doom-map-converter/generated_levels/ --num_episodes_per_scenario=100
Enjoying
python enjoy_custom_vizdoom_env.py --env=perfect_example --algo=APPO --experiment=ml_bachelor_doom
📊 Evaluation
Test design
For quantitative evaluation, each generated level was tested with a fixed agent configuration (Sample Factory, 100 episodes per level). The following metrics were computed:
avg_reward: average total reward per episodeavg_hit_rate: ratio of successful hits to attemptsavg_damage_taken: average damage taken per runavg_damage_per_kill: damage per eliminated enemyavg_ammo_efficiency: ratio of damage to ammo consumptionavg_health: average health at end of episodesurvival_rate: share of successful runs without death
These metrics were extracted automatically from trajectories and aggregated (mean over episodes) as a score for each level.
Results
- Playability (survival_rate > 0.5) was achieved in approximately 70% of levels
- Ammo efficiency proved to be a sensitive indicator for map density and enemy placement
- avg_reward correlated strongly with health kit distribution and map complexity
The metric combination allowed a nuanced evaluation, detecting playable but poorly balanced levels.
🧩 Discussion
LLMs can generate Doom levels but not always in a usable format. The agent-based evaluator provides functional feedback and reveals balancing issues. Together they form a viable loop for automated level validation.
✅ Conclusion & Outlook
The combination of text-based generation and RL-based evaluation offers an exciting basis for future tools in game development – for example, for automatic test levels, difficulty analysis, or assisted level design.