Files
OpenQuery/docs/performance.md
OpenQuery Documentation 65ca2401ae docs: add comprehensive documentation with README and detailed guides
- Add user-friendly README.md with quick start guide
- Create docs/ folder with structured technical documentation:
  - installation.md: Build and setup instructions
  - configuration.md: Complete config reference
  - usage.md: CLI usage guide with examples
  - architecture.md: System design and patterns
  - components/: Deep dive into each component (OpenQueryApp, SearchTool, Services, Models)
  - api/: CLI reference, environment variables, programmatic API
  - troubleshooting.md: Common issues and solutions
  - performance.md: Latency, throughput, and optimization
- All documentation fully cross-referenced with internal links
- Covers project overview, architecture, components, APIs, and support

See individual files for complete documentation.
2026-03-19 10:01:58 +01:00

16 KiB
Raw Blame History

Performance

Performance characteristics, optimization strategies, and scalability considerations for OpenQuery.

📋 Table of Contents

  1. Performance Overview
  2. Latency Breakdown
  3. Throughput
  4. Memory Usage
  5. Benchmarking
  6. Optimization Strategies
  7. Scalability Limits

Performance Overview

OpenQuery is designed for low-latency interactive use (15-50 seconds end-to-end) while maximizing parallelization to minimize wait time.

Key Metrics

Metric Typical Best Case Worst Case
End-to-End Latency 15-50s 10s 120s+
API Cost $0.01-0.05 $0.005 $0.20+
Memory Footprint 100-300MB 50MB 1GB+
Network I/O 5-20MB 1MB 100MB+

Note: Wide variance due to network latency, content size, and LLM speed.


Latency Breakdown

Default Configuration

-q 3 -r 5 -c 3 (3 queries, 5 results each, 3 final chunks)

Stage Operation Parallelism Time (p50) Time (p95) Dominant Factor
1 Query Generation 1 2-5s 10s LLM inference speed
2a Searches (3 queries × 5 results) 3 concurrent 3-8s 15s SearxNG latency
2b Article Fetching (≈15 URLs) 10 concurrent 5-15s 30s Each site's response time
2c Chunking 10 concurrent <1s 2s CPU (HTML parsing)
3a Query Embedding 1 0.5-1s 3s Embedding API latency
3b Chunk Embeddings (≈50 chunks) 4 concurrent 1-3s 10s Batch API latency
4 Ranking 1 <0.1s 0.5s CPU (vector math)
5 Final Answer Streaming 1 5-20s 40s LLM generation speed
Total 16-50s ~60s

Phase Details

Phase 1: Query Generation (2-5s)

  • Single non-streaming LLM call
  • Input: system prompt + user question (~200 tokens)
  • Output: JSON array of 3-5 short strings (~50 tokens)
  • Fast because small context and output

Phase 2a: Searches (3-8s)

  • 3 parallel SearxngClient.SearchAsync calls
  • Each: query → SearxNG → aggregator engines → scraped results
  • Latency highly variable based on:
    • SearxNG instance performance
    • Network distance to SearxNG
    • SearxNG's upstream search engines

Phase 2b: Article Fetching (5-15s)

  • ≈15 URLs to fetch (3 queries × 5 results minus duplicates)
  • Up to 10 concurrent fetches (semaphore)
  • Each: TCP connect + TLS handshake + HTTP GET + SmartReader parse
  • Latency:
    • Fast sites (CDN, cached): 200-500ms
    • Normal sites: 1-3s
    • Slow/unresponsive sites: timeout after ~30s

Why 5-15s for 15 URLs with 10 concurrent?

  • First wave (10 URLs): max latency among them ≈ 3s → 3s
  • Second wave (5 URLs): another ≈ 3s → total 6s
  • But many URLs faster (500ms) → total ≈ 2-3s
  • However, some sites take 5-10s → dominates

Tail latency: Slowest few URLs can dominate total time. Cannot proceed until all fetch attempts complete (or fail).

Phase 2c: Chunking (<1s)

  • CPU-bound HTML cleaning and splitting
  • SmartReader is surprisingly fast; C# HTML parser
  • Typically 100-300 chunks total
  • <1s on modern CPU

Phase 3: Embeddings (1.5-4s)

  • Query embedding: 1 call, ~200 tokens, ≈ 0.5-1s
  • Chunk embeddings: ≈50 chunks → 1 batch of 50 (batch size 300 unused here)
    • Batch of 50: still single API call, ~15K tokens (50 × 300 chars ≈ 15K tokens)
    • If using text-embedding-3-small: $0.00002 per 1K → ~$0.0003 per batch
    • Latency: 1-3s for embedding API

If more chunks (say 500), would be 2 batches → maybe 2-4s.

Parallel batches (4 concurrent) help if many batches (1500+ chunks).

Phase 4: Ranking (<0.1s)

  • Cosine similarity for 50-100 chunks
  • Each: dot product + normalization (O(dim)=1536)
  • 100 × 1536 ≈ 150K FLOPs → negligible on modern CPU
  • SIMD acceleration from TensorPrimitives

Phase 5: Final Answer (5-20s)

  • Streaming chat completion
  • Input: system prompt + context (50K tokens for 3×500-char chunks) + question
  • Output: varies wildly (200-2000 tokens typically)
  • Longer context slightly increases latency
  • Model choice major factor:
    • Qwen Flash: fast (5-10s for 1000 output tokens)
    • Gemini Flash: moderate (10-15s)
    • Llama-class: slower (20-40s)

Throughput

Sequential Execution

Running queries one after another (default CLI behavior):

  • Latency per query: 16-50s
  • Throughput: 1 query / 20s ≈ 180 queries/hour (theoretically)

But API rate limits will kick in before that:

  • OpenRouter free tier: limited RPM/TPM
  • Even paid: soft limits

Concurrent Execution (Multiple OpenQuery Instances)

You could run multiple OpenQuery processes in parallel (different terminals), but they share:

  • Same API key (OpenRouter rate limit is per API key, not per process)
  • Same SearxNG instance (could saturate it)

Practical: 3-5 concurrent processes before hitting diminishing returns or rate limits.

Throughput Optimization

To maximize queries per hour:

  1. Use fastest model (Qwen Flash)
  2. Reduce --chunks to 1-2
  3. Reduce --queries to 1
  4. Use local/fast SearxNG
  5. Cache embedding results (not implemented)
  6. Batch multiple questions in one process (not implemented; would require redesign)

Achievable: Maybe 500-1000 queries/hour on paid OpenRouter plan with aggressive settings.


Memory Usage

Baseline

.NET 10 AOT app with dependencies:

  • Code: ~30MB (AOT compiled native code)
  • Runtime: ~20MB (.NET runtime overhead)
  • Base Memory: ~50MB

Per-Query Memory

Component Memory Lifetime
Search results (15 items) ~30KB Pipeline
Articles (raw HTML) ~5MB (transient) Freed after parse
Articles (extracted text) ~500KB Until pipeline complete
Chunks (≈100 items) ~50KB text + embeddings 600KB Until pipeline complete
Embeddings (100 × 1536 floats) ~600KB Until pipeline complete
HTTP buffers ~1MB per concurrent request Short-lived
Total per query ~2-5MB (excluding base) Released after complete

Peak: When all articles fetched but not yet embedded, we have text ~500KB + chunks ~650KB = ~1.2MB + overhead ≈ 2-3MB.

If processing many queries in parallel (unlikely for CLI), memory would scale linearly.

Memory Leak Risks

  • HttpClient instances: Created per OpenRouterClient and SearxngClient. Should be disposed (not happening). But short-lived process exits anyway.
  • StatusReporter background task: Disposed via using
  • RateLimiter semaphore: Disposed via IAsyncDisposable if wrapped in using (not currently, but short-lived)

No major leaks observed.

Memory Optimization Opportunities

  1. Reuse HttpClient with IHttpClientFactory (but not needed for CLI)
  2. Stream article fetching instead of buffering all articles before embedding (possible: embed as URLs complete)
  3. Early chunk filtering: Discard low-quality chunks before embedding to reduce embedding count
  4. Cache embeddings: By content hash, avoid re-embedding seen text (would need persistent storage)

Benchmarking

Methodology

Measure with time command and verbose logging:

time openquery -v "What is quantum entanglement?" 2>&1 | tee log.txt

Parse log for timestamps (or add them manually by modifying code).

Sample Benchmark

Environment:

  • Linux x64, .NET 10 AOT
  • SearxNG local Docker (localhost:8002)
  • OpenRouter API (US East)
  • Model: qwen/qwen3.5-flash-02-23

Run 1:

real    0m23.4s
user    0m1.2s
sys     0m0.3s

Log breakdown:

  • Query generation: 3.2s
  • Searches: 4.1s
  • Article fetching: 8.7s (12 URLs)
  • Embeddings: 2.8s (45 chunks)
  • Final answer: 4.6s (325 tokens)

Run 2 (cached SearxNG results, same URLs):

real    0m15.8s

Faster article fetching (2.3s) because sites cached or faster second request.

Run 3 (verbose -s short answer):

real    0m18.2s

Final answer faster (2.1s instead of 4.6s) due to shorter output.

Benchmarking Tips

  1. Warm up: First run slower (JIT or AOT cold start). Discard first measurement.
  2. Network variance: Run multiple times and average.
  3. Control variables: Same question, same SearxNG instance, same network conditions.
  4. Measure API costs: Check OpenRouter dashboard for token counts.
  5. Profile with dotTrace or perf if investigating CPU bottlenecks.

Optimization Strategies

1. Tune Concurrent Limits

Edit SearchTool.cs where _options is created:

var _options = new ParallelProcessingOptions
{
    MaxConcurrentArticleFetches = 5,        // ↓ from 10
    MaxConcurrentEmbeddingRequests = 2,    // ↓ from 4
    EmbeddingBatchSize = 300               // ↑ or ↓ (rarely matters)
};

Why tune down?

  • Hit OpenRouter rate limits
  • Network bandwidth saturated
  • Too many concurrent fetches overwhelm target sites (ethical/scraping etiquette)

Why tune up?

  • Fast network, powerful CPU, no rate limits
  • Many chunks (>500) needing parallel embedding batches

Monitor:

  • openquery -v shows embedding progress: [Generating embeddings: batch X/Y]
  • If Y=1 (all fitted in one batch), batch size is fine
  • If Y>1 and max concurrent = Y, you're using full parallelism

2. Reduce Data Volume

Fewer search results:

openquery -r 3 "question"  # instead of 5 or 10

Effect: Fetches fewer URLs, extracts fewer chunks. Linear reduction in work.

Fewer queries:

openquery -q 1 "question"

Effect: One search instead of N. Quality may suffer (less diverse sources).

Fewer chunks:

openquery -c 1 "question"

Effect: Only top 1 chunk in context → fewer tokens → faster final answer, but may miss relevant info.

Chunk size (compile-time constant): Edit ChunkingService.cs:

private const int MAX_CHUNK_SIZE = 300;  // instead of 500

Effect: More chunks (more granular ranking) but each chunk shorter → more chunks to rank, more embeddings to generate. Could increase or decrease total time. Likely more tokens overall (more chunks in context if -c is fixed number).

3. Change Embedding Model

Currently hardcoded to openai/text-embedding-3-small. Could use:

  • openai/text-embedding-3-large (higher quality, slower, more expensive)
  • intfloat/multilingual-e5-large (multilingual, smaller)

Modify EmbeddingService constructor:

public EmbeddingService(OpenRouterClient client, string embeddingModel = "your-model")

Then pass:

var embeddingService = new EmbeddingService(client, "intfloat/multilingual-e5-large");

Impact: Different dimensionality (1536 vs 1024 vs 4096). Memory scales with dim. Quality may vary for non-English queries.

4. Caching

Current: No caching. Every query hits all APIs.

Embedding cache (by text hash):

  • Could store in memory: Dictionary<string, float[]>
  • Or disk: ~/.cache/openquery/embeddings/
  • Invalidation: embeddings are deterministic per model, so long-term cache viable

Search cache (by query hash):

  • Cache List<SearxngResult> for identical queries
  • TTL: maybe 1 hour (search results change over time)

Article cache (by URL hash):

  • Cache Article (text content) per URL
  • Invalidation: could check Last-Modified header or use TTL (1 day)

Implementation effort: Medium. Would need cache abstraction (interface, in-memory + disk options).

Benefit: Repeat queries (common in testing or similar questions) become instant.

5. Parallelize More (Aggressive)

Currently:

  • Searches: unbounded (as many as --queries)
  • Fetches: max 10
  • Embeddings: max 4

Could increase:

  • Fetches to 20 or 50 (if network/CPU can handle)
  • Embeddings to 8-16 (if OpenRouter rate limit allows)

Risk:

  • Overwhelming target sites (unethical scraping)
  • API rate limits → 429 errors
  • Local bandwidth saturation

6. Local Models (Self-Hosted)

Replace OpenRouter with local LLM:

  • Query generation: Could run tiny model locally (no API latency)
  • Embeddings: Could run all-MiniLM-L6-v2 locally (fast, free after setup)
  • Answer: Could run Llama 3 8B locally (no cost, but slower than GPT-4/Gemini)

Benefits:

  • Zero API costs (after hardware)
  • No network latency
  • Unlimited queries

Drawbacks:

  • GPU required for decent speed (or CPU very slow)
  • Setup complexity (Ollama, llama.cpp, vLLM, etc.)
  • Model quality may lag behind commercial APIs

Integration: Would need to implement local inference backends (separate project scope).


Scalability Limits

API Rate Limits

OpenRouter:

  • Free tier: Very limited (few RPM)
  • Paid: Varies by model, but typical ~10-30 requests/second
  • Embedding API has separate limits

Mitigation:

  • Reduce concurrency (see tuning)
  • Add exponential backoff (already have for embeddings)
  • Batch embedding requests (already done)

SearxNG Limits

Single instance:

  • Can handle ~10-50 QPS depending on hardware
  • Upstream search engines may rate limit per instance
  • Memory ~100-500MB

Mitigation:

  • Run multiple SearxNG instances behind load balancer
  • Use different public instances
  • Implement client-side rate limiting (currently only per-URL fetches limited, not searches)

Network Bandwidth

Typical data transfer:

  • Searches: 1KB per query × 3 = 3KB
  • Articles: 100-500KB per fetch × 15 = 1.5-7.5MB (raw HTML)
  • Extracted text: ~10% of HTML size = 150-750KB
  • Embeddings: 100 chunks × 1536 × 4 bytes = 600KB (request + response)
  • Final answer: 2-10KB

Total: ~3-10MB per query

100 queries/hour: ~300MB-1GB data transfer

Not an issue for broadband, but could matter on metered connections.


Moatslaw's Law: Scaling with Chunk Count

Let:

  • C = number of chunks with valid embeddings
  • d = embedding dimension (1536)
  • B = embedding batch size (300)
  • P = max parallel embedding batches (4)

Embedding TimeO(C/B * 1/P) (batches divided by parallelism)

Ranking TimeO(C * d) (dot product per chunk)

Context Tokens (for final answer) ≈ C * avg_chunk_tokens (≈ 500 chars = 125 tokens)

As C increases:

  • Embedding time: linear in C/B (sublinear if batch fits in one)
  • Ranking time: linear in C
  • Final answer latency: more tokens in context → longer context processing + potentially longer answer (more relevant chunks to synthesize)

Practical limit:

  • With defaults, C ~ 50-100 (from 15 articles)
  • Could reach C ~ 500-1000 if:
    • --queries = 10
    • --results = 20 (200 URLs)
    • Many articles long → many chunks each
  • At C = 1000:
    • Embeddings: 1000/300 ≈ 4 batches, with 4 parallel → still 1 sequential step (if 4 batches, parallel all 4 → time ≈ 1 batch duration)
    • But OpenRouter may have per-minute limits on embedding requests
    • Ranking: 1000 × 1536 = 1.5M FLOPs → still <0.01s
    • Context tokens: 1000 × 125 = 125K tokens! Many LLMs have 200K context, so fits, but expensive and slow.

Conclusion: Current defaults scale to C ~ 100-200 comfortably. Beyond that:

  • Need to increase batch size or parallelism for embeddings
  • May hit embedding API rate limits
  • Context token count becomes expensive and may degrade answer quality (LLMs lose focus in very long context)

Profiling

CPU Profiling

Use dotnet-trace or perf:

# Collect trace for 30 seconds while running query
dotnet-trace collect --process-id $(pgrep OpenQuery) --duration 30s -o trace.nettrace

# Analyze with Visual Studio or PerfView

Look for:

  • Hot methods: ChunkingService.ChunkText, EmbeddingService.GetEmbeddingsAsync, cosine similarity
  • Allocation hotspots

Memory Profiling

dotnet-gcdump collect -p <pid>
# Open in VS or dotnet-gcdump analyze

Check heap size, object counts (look for large string objects from article content).

Network Profiling

Use tcpdump or wireshark:

tcpdump -i any port 8002 or port 443 -w capture.pcap

Or simpler: time on individual curl commands to measure latency components.


Next Steps


Quick Tuning Cheatsheet

# Fast & cheap (factual Q&A)
openquery -q 1 -r 3 -c 2 -s "What is X?"

# Thorough (research)
openquery -q 5 -r 10 -c 5 -l "Deep dive on X"

# Custom code edit for concurrency
# In SearchTool.cs:
_options = new ParallelProcessingOptions {
    MaxConcurrentArticleFetches = 20,  // if network can handle
    MaxConcurrentEmbeddingRequests = 8  // if API allows
};