Grand Strategy Game - Data Linking & Reference Resolution Architecture

Implementation Status: ✅ Implemented (CrossReferenceBuilder, ReferenceResolver, DataValidator exist)

Recent Update (2025-10-09): ProvinceState refactored for engine-game separation. Game-specific fields moved to HegemonProvinceData. See phase-3-complete-scenario-loader-bug-fixed.md.

Executive Summary

Challenge: Loaded data has string references that need linking to actual game objects Solution: Multi-phase loading with efficient ID mapping and reference resolution Key Principle: Convert strings to IDs once at load time, never use strings at runtime Result: Fast lookups, type-safe references, clear data relationships

The Core Problem

When you load Paradox-style data files, you get string references everywhere (owner="ENG", religion="catholic", trade_good="grain"). These strings need to be resolved to actual runtime IDs and validated for consistency.

Three-Phase Loading Architecture

Phase 1: Discovery & Registration

First pass - discover all entities and assign IDs

Phase 2: Loading & Parsing

Load actual data with string references intact

Phase 3: Linking & Resolution

Convert all string references to runtime IDs

Phase 4: Validation

Validate data integrity after linking

The Registry Pattern

Central Registry System

Registries provide bidirectional mapping between string tags and numeric IDs:

Register(key, item) → assigns ID
Get(id) → retrieves item
GetId(key) → retrieves ID from string
TryGet(key, out item) → safe lookup

Game-Specific Registries

Common registries include Countries, Religions, Cultures, TradeGoods, Buildings, Technologies, Governments, Terrains, Units. Provinces use special handling due to value type storage.

ID Mapping Strategy

String Tags to Runtime IDs

Runtime uses numeric IDs (ushort) instead of strings for performance:

O(1) array indexing vs O(n) string comparison
Type-safe when using ID wrapper structs (optional)
Zero string comparisons during gameplay

Type-Safe ID Trade-offs

Pros: Can't accidentally mix different ID types, self-documenting Cons: Extra type complexity, implicit conversions can confuse debugging Recommendation: Use plain ushort unless you have many entity types

Province ID Handling (Value Types vs Reference Types)

Critical difference: Provinces are value types in NativeArray, not reference types in List.

For reference types (Countries, Religions, Buildings):

Registry stores items in managed List
Dictionary maps string keys to IDs

For provinces (value types in NativeArray):

ProvinceSystem stores states in unmanaged NativeArray
Burst-compatible for performance
Dictionary maps definition IDs to runtime IDs (if needed)

Sparse vs Dense Province IDs

When sparse→dense mapping is worth it:

Province IDs have large gaps
Wasted memory from gaps exceeds ~10% of total
Example: Max ID 10000, but only 3000 actual provinces = 70% waste

When direct indexing is simpler:

Province IDs are mostly contiguous
Wasted memory negligible
Simpler code with direct array indexing

Most Paradox-style games have nearly contiguous IDs. Unless profiling shows memory issues, direct indexing is simpler.

Reference Resolution System

Raw Data with String References

Data loaded from files contains strings ("ENG", "catholic", "grain"). Runtime needs numeric IDs for performance.

Engine-Game Separation

ENGINE LAYER (8-byte ProvinceState): Generic primitives only - ownerID, controllerID, terrainType, gameDataSlot

GAME LAYER (4-byte HegemonProvinceData): Game-specific hot data - development, fortLevel, unrest, population

Cold Data: Stored separately - names, positions, neighbors, religion, culture, trade goods, buildings

Reference Resolver Pattern

ReferenceResolver converts string references to numeric IDs:

ResolveCountryRef(tag) → countryID
ResolveTerrainRef(terrain) → terrainID
ResolveBuildingList(names) → buildingIDs[]
Collects errors for reporting
Supports deferred resolution for forward references

Cross-Reference System

Bidirectional References

After loading, build reverse lookups for performance:

Country → Provinces mapping (O(1) vs O(n))
CultureGroup → Cultures mapping
TradeNode → Provinces mapping

Without reverse mapping, queries like "What does France own?" require iterating all provinces (O(n)). With reverse mapping, it's O(1) array lookup. The key: ONE system owns BOTH mappings and keeps them synchronized.

Validation System

Data Integrity Checker

Validates loaded data after reference resolution:

Every owned province must have valid owner
Controller must be valid if different from owner
Capital must be owned by country
Technology groups must exist
Countries should own at least one province

Error vs Warning strategy:

Errors: Critical data integrity issues (throw exception)
Warnings: Suspicious but playable situations (log and continue)

Loading Pipeline Implementation

Complete loading process follows sequential steps:

Load static data (definitions with no dependencies)
Register all entities (assign IDs)
Load entity data with string refs
Resolve all references to IDs
Build cross-references (bidirectional mappings)
Validate data integrity
Optimize for runtime

Optimization Strategies

String Interning

During loading, many strings are repeated (government types, religions shared across countries). String interning reduces memory by ensuring identical strings reference same memory location.

Compile-Time ID Generation (Optional)

For frequently-used constants, generate IDs at build time from data files. Enables compile-time constants for common cases, eliminating runtime lookups. Trade-off: Requires build-time code generation.

Lazy Loading for Cold Data

Rarely accessed data can be loaded on-demand using lazy loaders. Reduces initial load time and memory usage.

Error Handling

Missing Reference Strategies

ThrowException: Fail fast (use for critical refs like countries)
LogWarning: Continue with warning (use for optional refs)
UseDefault: Silent fallback (use for cosmetic refs)
CreatePlaceholder: Generate missing entity (use for modding)

Performance Considerations

Memory Layout

ENGINE (8-byte ProvinceState): Generic primitives only, compact for cache efficiency

GAME (4-byte HegemonProvinceData): Game-specific hot data, separate from engine

Cold Data: Separate storage accessed rarely

This separation keeps hot data cache-friendly for high performance.

Lookup Performance

O(1) array lookup for runtime IDs (direct array access)
O(1) dictionary lookup for string tags (only during loading)
Avoid O(n) searches at runtime

Burst Compatibility

Ensuring Burst Can Compile Hot Path

Burst-compatible: Value type structs in NativeArray for hot data (ProvinceState) NOT Burst-compatible: Managed references (List, Dictionary) for cold data

Key principle: Hot data must be in NativeArray for Burst. Cold data can use managed collections since it's accessed rarely.

Usage Examples

Loading Process

Complete loading process: Load all game data → Link with IDs → Runtime uses IDs only. No string comparisons at runtime.

Modding Support

Mods can add new entities or overwrite base game data. Reference resolution runs again for mod data. Re-validate after mod loading to catch conflicts.

Best Practices

Never use strings at runtime - Convert everything to IDs during loading
Validate early and often - Catch bad references during loading, not gameplay
Use ushort for IDs - Simpler than typed wrappers unless you have complex cross-referencing
Dense arrays over dictionaries - Array indexing is much faster (unless IDs are very sparse)
Separate hot/cold data - Compact hot structs in NativeArray, cold data in Dictionary
Build reverse lookups once - Don't search arrays repeatedly (O(n) → O(1))
Reserve 0 for "none" - Makes checking for unset values easy
Log all resolution failures - Help modders debug their data
Support partial loading - Allow game to run with some missing optional data
Keep hot data Burst-compatible - NativeArray with value types only

Summary

This linking architecture ensures:

Type safety through ID-based references
Performance through array indexing instead of string lookups
Validation catches all bad references at load time
Flexibility for mods to extend base game data
Memory efficiency through compact structs and separate cold data
Burst compatibility through NativeArray and value types
Clear error messages for debugging data issues

The key is the three-phase approach: discover entities, load raw data with strings, then resolve strings to IDs. This allows handling forward references, validating everything, and converting to efficient runtime representations.

data-flow-architecture.md - System communication and bidirectional mappings
performance-architecture-guide.md - Memory layout and cache optimization
../Planning/modding-design.md - Mod system uses same reference resolution

Last Updated: 2025-10-15

Table of Contents