In my LLMApi project I wanted to be able to support asking for LOTS of data; however LLMs have a very limited amount of data they can output at one time; and aren't particularly fast about doing it. So I needed to add a clever way of pre-generting that data and deliver it in 'chunks' so you CAN get chunky data back quickly.
Along with that there was the problem of contexts, a great feature but until now a context lasted as long as the app so this adds full support for a sliding cache to eliminate that potental source of memory leaks if you leave the simulator running.
Release 1.8.0 - Advanced Performance & Resource Management
This document describes two powerful systems that make mostlylucid.mockllmapi intelligent, efficient, and production-ready: Automatic Chunking and Response Caching.
Quick Start: See ChunkingAndCaching.http for ready-to-run HTTP examples demonstrating all features.
Auto-chunking is an intelligent system that automatically breaks large requests into optimal chunks when they would exceed your LLM's token limits. It's enabled by default and works transparently—you don't need to change your API calls.
Requesting 100 items with a complex shape might generate a response that exceeds your LLM's output token limit (typically 2048-4096 tokens), causing truncated or failed responses.
The system:
Add to appsettings.json:
{
"MockLlmApi": {
"BaseUrl": "http://localhost:11434/v1/",
"ModelName": "llama3",
// Auto-Chunking Settings
"MaxOutputTokens": 2048, // LLM output limit (default: 2048)
"EnableAutoChunking": true, // Enable automatic chunking (default: true)
"MaxItems": 1000, // Maximum items per response (default: 1000)
// Input Token Settings
"MaxInputTokens": 2048 // LLM input limit (default: 2048)
}
}
| Setting | Default | Description |
|---|---|---|
MaxOutputTokens |
2048 | Maximum tokens your LLM can generate. Common values: 512 (small models), 2048 (Llama 3), 4096 (large models) |
EnableAutoChunking |
true | Globally enable/disable automatic chunking |
MaxItems |
1000 | Hard limit on items per response. Requests exceeding this are capped with a warning |
MaxInputTokens |
2048 | Maximum prompt size. Used to truncate context history if needed |
flowchart TD
A[API Request: ?count=100] --> B{Extract Count}
B --> C{Check MaxItems}
C -->|Exceeds| D[Cap at MaxItems]
C -->|Within| E{Check autoChunk param}
D --> E
E -->|false| F[Execute Single Request]
E -->|true/default| G{Estimate Tokens}
G --> H{Will Exceed<br/>MaxOutputTokens?}
H -->|No| F
H -->|Yes| I[Calculate Chunk Strategy]
I --> J[Execute Chunk 1]
J --> K[Build Context for Chunk 2]
K --> L[Execute Chunk 2]
L --> M{More Chunks?}
M -->|Yes| N[Build Context for Next]
M -->|No| O[Combine All Results]
N --> L
O --> P[Return Unified Response]
F --> P
style I fill:#e1f5ff
style O fill:#d4edda
style P fill:#d4edda
The system analyzes your shape JSON to estimate tokens per item:
// Simple shape → ~50 tokens/item
{"id": 1, "name": ""}
// Complex nested shape → ~200 tokens/item
{
"user": {
"id": 1,
"profile": {
"name": "",
"address": {"street": "", "city": ""},
"contacts": [{"type": "", "value": ""}]
}
}
}
Estimation factors:
flowchart LR
A[Shape JSON] --> B[Parse Structure]
B --> C[Count Properties]
B --> D[Measure Nesting Depth]
B --> E[Count Arrays]
C --> F[Base Token Count:<br/>Length / 4]
D --> G[Complexity Multiplier]
E --> G
C --> G
G --> H[Estimated Tokens/Item:<br/>Base × Multiplier]
Available Output Tokens = MaxOutputTokens × 75% (25% reserved for prompt overhead)
Items Per Chunk = Available Output Tokens / Estimated Tokens Per Item
Total Chunks = Ceiling(Requested Items / Items Per Chunk)
Example:
Each chunk:
Subsequent chunks receive context like:
IMPORTANT CONTEXT - Multi-part Response (Part 2/10):
This is a continuation of a larger request. Previous parts have generated:
Part 1: 10 items (first: id=1, name="Alice", last: id=10, name="Jane")
Ensure consistency with the above data (IDs, names, relationships, style).
Continue numbering, IDs, and patterns logically from where the previous part left off.
This ensures IDs don't restart, names stay consistent, and the data feels cohesive.
# Request 100 users
GET /api/mock/users?count=100
Content-Type: application/json
{
"shape": {
"id": 1,
"name": "string",
"email": "email@example.com"
}
}
What Happens:
Logs:
[INFO] Request needs chunking: 100 items × 100 tokens/item = 10000 tokens > 1536 available
[INFO] AUTO-CHUNKING ENABLED: Breaking request into 4 chunks (25 items/chunk)
[INFO] AUTO-CHUNKING: Executing chunk 1/4 (items 1-25 of 100)
[INFO] AUTO-CHUNKING: Executing chunk 2/4 (items 26-50 of 100)
[INFO] AUTO-CHUNKING: Executing chunk 3/4 (items 51-75 of 100)
[INFO] AUTO-CHUNKING: Executing chunk 4/4 (items 76-100 of 100)
[INFO] AUTO-CHUNKING COMPLETE: Combined 4 chunks into 100 items
# Disable chunking for this specific request
GET /api/mock/users?count=100&autoChunk=false
Use when:
# Request 2000 items (exceeds MaxItems=1000)
GET /api/mock/products?count=2000
Result:
[WARN] AUTO-LIMIT: Request for 2000 items exceeds MaxItems limit (1000). Capping to 1000 items
Returns 1000 items automatically chunked.
POST /api/mock/orders?count=50
Content-Type: application/json
{
"shape": {
"orderId": 1,
"customer": {
"id": 1,
"name": "string",
"address": {
"street": "string",
"city": "string",
"country": "string"
}
},
"items": [
{"productId": 1, "name": "string", "quantity": 1, "price": 9.99}
],
"shipping": {"method": "string", "trackingNumber": "string"},
"payment": {"method": "string", "status": "string"}
}
}
What Happens:
All chunking operations are logged for observability:
Debug Logs:
Found explicit count in query parameter 'count': 100No chunking needed for this requestInfo Logs:
Request needs chunking: 100 items × 150 tokens = 15000 tokens > 1536 availableChunking strategy: 100 items → 4 chunks × 25 items/chunkAUTO-CHUNKING: Executing chunk 2/4 (items 26-50 of 100)AUTO-CHUNKING COMPLETE: Combined 4 chunks into 100 itemsWarning Logs:
AUTO-LIMIT: Request for 2000 items exceeds MaxItems (1000). Capping to 1000Response caching pre-generates multiple LLM responses for the same request, storing them in memory and serving them one-by-one to provide variety without repeated LLM calls.
Unlike traditional caching (same response every time), this system:
{
"MockLlmApi": {
// Cache Settings
"MaxCachePerKey": 5, // Variants per unique request (default: 5)
"CacheSlidingExpirationMinutes": 15, // Idle time before expiration (default: 15)
"CacheAbsoluteExpirationMinutes": 60, // Max lifetime (default: 60, null = none)
"CacheRefreshThresholdPercent": 50, // Trigger refill at % empty (default: 50)
"MaxItems": 1000, // Max total cached items (default: 1000)
"CachePriority": 1, // Memory priority: 0=Low, 1=Normal, 2=High, 3=Never (default: 1)
// Advanced Cache Options
"EnableCacheStatistics": false, // Track cache hits/misses (default: false)
"EnableCacheCompression": false // Compress cached responses (default: false)
}
}
stateDiagram-v2
[*] --> ColdCache: First Request
ColdCache --> Generating: Generate N variants
Generating --> WarmCache: Store variants
WarmCache --> ServingFromCache: Subsequent requests
ServingFromCache --> WarmCache: Variants remaining
ServingFromCache --> RefillTriggered: Last variant served
RefillTriggered --> BackgroundRefill: Async refill
BackgroundRefill --> WarmCache: Refill complete
ServingFromCache --> Expired: 15min idle
Expired --> [*]
note right of ColdCache
Cache miss
5x LLM calls
Slow first response
end note
note right of ServingFromCache
Cache hit
Instant response
No LLM call
end note
note right of BackgroundRefill
Transparent refill
User does not wait
Fresh variants ready
end note
GET /api/mock/users?shape={"id":1,"name":""}
X-Cache-Count: 5
What Happens:
# Request 2
GET /api/mock/users?shape={"id":1,"name":""}
X-Cache-Count: 5
What Happens:
# Request 5 (last variant)
GET /api/mock/users?shape={"id":1,"name":""}
X-Cache-Count: 5
What Happens:
# Request 6 (refill in progress)
GET /api/mock/users?shape={"id":1,"name":""}
X-Cache-Count: 5
What Happens:
Cache keys are based on:
Different requests maintain separate caches:
GET /api/mock/users?count=10 # Cache Key A
GET /api/mock/users?count=20 # Cache Key B (different count)
GET /api/mock/products?count=10 # Cache Key C (different endpoint)
GET /api/mock/users?count=10
X-Cache-Count: 3
Generates 3 variants, serves them sequentially.
POST /api/mock/users
Content-Type: application/json
{
"shape": {
"$cache": 5,
"id": 1,
"name": "string"
}
}
Generates 5 variants.
# Don't cache this request
GET /api/mock/random-data?count=10
No X-Cache-Count or $cache means no caching—fresh LLM response every time.
GET /api/mock/users?count=100
X-Cache-Count: 3
Behavior:
API contexts (conversation history for maintaining consistency across requests) are stored with 15-minute sliding expiration.
gantt
title Context Sliding Window Expiration (15 minutes)
dateFormat mm:ss
axisFormat %M:%S
section Context Lifecycle
Request 1 creates context :milestone, m1, 00:00, 0m
Active (15min window) :active, 00:00, 10m
Request 2 extends window :milestone, m2, 10:00, 0m
Active (window reset) :active2, 10:00, 15m
Request 3 after 20min idle :milestone, m3, 30:00, 0m
Expired - starts fresh :crit, 25:00, 5m
# Request 1
GET /api/mock/users/123?context=user-session-1
# Creates context "user-session-1" with 15-minute expiration
# Request 2 (10 minutes later)
GET /api/mock/orders?context=user-session-1
# Context still exists, expiration resets to 15 minutes
# Request 3 (20 minutes after Request 2, no activity for 20 minutes)
GET /api/mock/profile?context=user-session-1
# Context expired, starts fresh
Contexts use the same expiration settings:
{
"MockLlmApi": {
"MaxInputTokens": 2048 // Contexts are truncated to fit within this limit
}
}
Context storage is implemented using IMemoryCache with:
Use chunking when:
Don't disable chunking unless:
Use caching when:
Don't cache when:
Small LLMs (e.g., tinyllama):
{
"MaxOutputTokens": 512,
"MaxInputTokens": 1024,
"MaxCachePerKey": 3,
"EnableAutoChunking": true
}
Medium LLMs (e.g., Llama 3):
{
"MaxOutputTokens": 2048,
"MaxInputTokens": 2048,
"MaxCachePerKey": 5,
"EnableAutoChunking": true
}
Large LLMs (e.g., Llama 3 70B):
{
"MaxOutputTokens": 4096,
"MaxInputTokens": 4096,
"MaxCachePerKey": 5,
"EnableAutoChunking": true
}
Problem: "Request for 100 items only returns 50"
Solution: Check logs for chunking execution. If not chunking, increase MaxOutputTokens:
{
"MaxOutputTokens": 4096 // Increase limit
}
Problem: "Chunks have inconsistent IDs (IDs restart at 1 for each chunk)"
Solution: This is a rare LLM behavior issue. Try:
Temperature for more creativity{"id": "start at 1 and increment"}Problem: "Too many chunks generated (performance impact)"
Solution:
MaxOutputTokens?autoChunk=false if you don't need full datasetProblem: "Same response every time despite cache"
Solution: Cache is empty, background refill may be slow. Check:
MaxCachePerKey for more variantsProblem: "Cache not being used"
Solution: Ensure you're specifying cache with either:
?cache=5X-Cache-Count: 5"$cache": 5Problem: "Memory usage growing"
Solution:
MaxCachePerKey (fewer variants)CacheSlidingExpirationMinutes (expire faster)EnableCacheCompression: true (trade CPU for memory)MaxItems (hard cap on cache size)Problem: "Context lost between requests"
Solution: Context expired (15 minutes of inactivity). If you need longer:
Problem: "Context becoming too large"
Solution: Contexts are automatically summarized when they exceed MaxInputTokens. Check:
MaxInputTokens if needed| Scenario | Without Chunking | With Chunking | Improvement |
|---|---|---|---|
| 100 simple items | Truncated | Complete (4 chunks) | Works vs. Fails |
| 50 complex items | Truncated | Complete (10 chunks) | Works vs. Fails |
| 10 items | 2.3s | 2.3s | No overhead |
| Scenario | Without Cache | With Cache (Warm) | Improvement |
|---|---|---|---|
| Simple request | 2.1s | 0.002s | 1000× faster |
| Complex request | 5.4s | 0.003s | 1800× faster |
| First request | 2.1s | 10.5s (5 variants) | 5× slower (priming) |
| Configuration | Estimated Memory |
|---|---|
| No caching | ~10 MB baseline |
| Cache: 5 variants × 100 keys | +50 MB |
| Cache: 5 variants × 1000 keys | +500 MB |
| Context storage (100 active) | +10 MB |
For more details, see RELEASE_NOTES.md.
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.