- Add user-friendly README.md with quick start guide - Create docs/ folder with structured technical documentation: - installation.md: Build and setup instructions - configuration.md: Complete config reference - usage.md: CLI usage guide with examples - architecture.md: System design and patterns - components/: Deep dive into each component (OpenQueryApp, SearchTool, Services, Models) - api/: CLI reference, environment variables, programmatic API - troubleshooting.md: Common issues and solutions - performance.md: Latency, throughput, and optimization - All documentation fully cross-referenced with internal links - Covers project overview, architecture, components, APIs, and support See individual files for complete documentation.
16 KiB
Performance
Performance characteristics, optimization strategies, and scalability considerations for OpenQuery.
📋 Table of Contents
- Performance Overview
- Latency Breakdown
- Throughput
- Memory Usage
- Benchmarking
- Optimization Strategies
- Scalability Limits
Performance Overview
OpenQuery is designed for low-latency interactive use (15-50 seconds end-to-end) while maximizing parallelization to minimize wait time.
Key Metrics
| Metric | Typical | Best Case | Worst Case |
|---|---|---|---|
| End-to-End Latency | 15-50s | 10s | 120s+ |
| API Cost | $0.01-0.05 | $0.005 | $0.20+ |
| Memory Footprint | 100-300MB | 50MB | 1GB+ |
| Network I/O | 5-20MB | 1MB | 100MB+ |
Note: Wide variance due to network latency, content size, and LLM speed.
Latency Breakdown
Default Configuration
-q 3 -r 5 -c 3 (3 queries, 5 results each, 3 final chunks)
| Stage | Operation | Parallelism | Time (p50) | Time (p95) | Dominant Factor |
|---|---|---|---|---|---|
| 1 | Query Generation | 1 | 2-5s | 10s | LLM inference speed |
| 2a | Searches (3 queries × 5 results) | 3 concurrent | 3-8s | 15s | SearxNG latency |
| 2b | Article Fetching (≈15 URLs) | 10 concurrent | 5-15s | 30s | Each site's response time |
| 2c | Chunking | 10 concurrent | <1s | 2s | CPU (HTML parsing) |
| 3a | Query Embedding | 1 | 0.5-1s | 3s | Embedding API latency |
| 3b | Chunk Embeddings (≈50 chunks) | 4 concurrent | 1-3s | 10s | Batch API latency |
| 4 | Ranking | 1 | <0.1s | 0.5s | CPU (vector math) |
| 5 | Final Answer Streaming | 1 | 5-20s | 40s | LLM generation speed |
| Total | 16-50s | ~60s |
Phase Details
Phase 1: Query Generation (2-5s)
- Single non-streaming LLM call
- Input: system prompt + user question (~200 tokens)
- Output: JSON array of 3-5 short strings (~50 tokens)
- Fast because small context and output
Phase 2a: Searches (3-8s)
- 3 parallel
SearxngClient.SearchAsynccalls - Each: query → SearxNG → aggregator engines → scraped results
- Latency highly variable based on:
- SearxNG instance performance
- Network distance to SearxNG
- SearxNG's upstream search engines
Phase 2b: Article Fetching (5-15s)
- ≈15 URLs to fetch (3 queries × 5 results minus duplicates)
- Up to 10 concurrent fetches (semaphore)
- Each: TCP connect + TLS handshake + HTTP GET + SmartReader parse
- Latency:
- Fast sites (CDN, cached): 200-500ms
- Normal sites: 1-3s
- Slow/unresponsive sites: timeout after ~30s
Why 5-15s for 15 URLs with 10 concurrent?
- First wave (10 URLs): max latency among them ≈ 3s → 3s
- Second wave (5 URLs): another ≈ 3s → total 6s
- But many URLs faster (500ms) → total ≈ 2-3s
- However, some sites take 5-10s → dominates
Tail latency: Slowest few URLs can dominate total time. Cannot proceed until all fetch attempts complete (or fail).
Phase 2c: Chunking (<1s)
- CPU-bound HTML cleaning and splitting
- SmartReader is surprisingly fast; C# HTML parser
- Typically 100-300 chunks total
- <1s on modern CPU
Phase 3: Embeddings (1.5-4s)
- Query embedding: 1 call, ~200 tokens, ≈ 0.5-1s
- Chunk embeddings: ≈50 chunks → 1 batch of 50 (batch size 300 unused here)
- Batch of 50: still single API call, ~15K tokens (50 × 300 chars ≈ 15K tokens)
- If using
text-embedding-3-small: $0.00002 per 1K → ~$0.0003 per batch - Latency: 1-3s for embedding API
If more chunks (say 500), would be 2 batches → maybe 2-4s.
Parallel batches (4 concurrent) help if many batches (1500+ chunks).
Phase 4: Ranking (<0.1s)
- Cosine similarity for 50-100 chunks
- Each: dot product + normalization (O(dim)=1536)
- 100 × 1536 ≈ 150K FLOPs → negligible on modern CPU
- SIMD acceleration from
TensorPrimitives
Phase 5: Final Answer (5-20s)
- Streaming chat completion
- Input: system prompt + context (50K tokens for 3×500-char chunks) + question
- Output: varies wildly (200-2000 tokens typically)
- Longer context slightly increases latency
- Model choice major factor:
- Qwen Flash: fast (5-10s for 1000 output tokens)
- Gemini Flash: moderate (10-15s)
- Llama-class: slower (20-40s)
Throughput
Sequential Execution
Running queries one after another (default CLI behavior):
- Latency per query: 16-50s
- Throughput: 1 query / 20s ≈ 180 queries/hour (theoretically)
But API rate limits will kick in before that:
- OpenRouter free tier: limited RPM/TPM
- Even paid: soft limits
Concurrent Execution (Multiple OpenQuery Instances)
You could run multiple OpenQuery processes in parallel (different terminals), but they share:
- Same API key (OpenRouter rate limit is per API key, not per process)
- Same SearxNG instance (could saturate it)
Practical: 3-5 concurrent processes before hitting diminishing returns or rate limits.
Throughput Optimization
To maximize queries per hour:
- Use fastest model (Qwen Flash)
- Reduce
--chunksto 1-2 - Reduce
--queriesto 1 - Use local/fast SearxNG
- Cache embedding results (not implemented)
- Batch multiple questions in one process (not implemented; would require redesign)
Achievable: Maybe 500-1000 queries/hour on paid OpenRouter plan with aggressive settings.
Memory Usage
Baseline
.NET 10 AOT app with dependencies:
- Code: ~30MB (AOT compiled native code)
- Runtime: ~20MB (.NET runtime overhead)
- Base Memory: ~50MB
Per-Query Memory
| Component | Memory | Lifetime |
|---|---|---|
| Search results (15 items) | ~30KB | Pipeline |
| Articles (raw HTML) | ~5MB (transient) | Freed after parse |
| Articles (extracted text) | ~500KB | Until pipeline complete |
| Chunks (≈100 items) | ~50KB text + embeddings 600KB | Until pipeline complete |
| Embeddings (100 × 1536 floats) | ~600KB | Until pipeline complete |
| HTTP buffers | ~1MB per concurrent request | Short-lived |
| Total per query | ~2-5MB (excluding base) | Released after complete |
Peak: When all articles fetched but not yet embedded, we have text ~500KB + chunks ~650KB = ~1.2MB + overhead ≈ 2-3MB.
If processing many queries in parallel (unlikely for CLI), memory would scale linearly.
Memory Leak Risks
HttpClientinstances: Created perOpenRouterClientandSearxngClient. Should be disposed (not happening). But short-lived process exits anyway.StatusReporterbackground task: Disposed viausingRateLimitersemaphore: Disposed viaIAsyncDisposableif wrapped inusing(not currently, but short-lived)
No major leaks observed.
Memory Optimization Opportunities
- Reuse HttpClient with
IHttpClientFactory(but not needed for CLI) - Stream article fetching instead of buffering all articles before embedding (possible: embed as URLs complete)
- Early chunk filtering: Discard low-quality chunks before embedding to reduce embedding count
- Cache embeddings: By content hash, avoid re-embedding seen text (would need persistent storage)
Benchmarking
Methodology
Measure with time command and verbose logging:
time openquery -v "What is quantum entanglement?" 2>&1 | tee log.txt
Parse log for timestamps (or add them manually by modifying code).
Sample Benchmark
Environment:
- Linux x64, .NET 10 AOT
- SearxNG local Docker (localhost:8002)
- OpenRouter API (US East)
- Model: qwen/qwen3.5-flash-02-23
Run 1:
real 0m23.4s
user 0m1.2s
sys 0m0.3s
Log breakdown:
- Query generation: 3.2s
- Searches: 4.1s
- Article fetching: 8.7s (12 URLs)
- Embeddings: 2.8s (45 chunks)
- Final answer: 4.6s (325 tokens)
Run 2 (cached SearxNG results, same URLs):
real 0m15.8s
Faster article fetching (2.3s) because sites cached or faster second request.
Run 3 (verbose -s short answer):
real 0m18.2s
Final answer faster (2.1s instead of 4.6s) due to shorter output.
Benchmarking Tips
- Warm up: First run slower (JIT or AOT cold start). Discard first measurement.
- Network variance: Run multiple times and average.
- Control variables: Same question, same SearxNG instance, same network conditions.
- Measure API costs: Check OpenRouter dashboard for token counts.
- Profile with dotTrace or
perfif investigating CPU bottlenecks.
Optimization Strategies
1. Tune Concurrent Limits
Edit SearchTool.cs where _options is created:
var _options = new ParallelProcessingOptions
{
MaxConcurrentArticleFetches = 5, // ↓ from 10
MaxConcurrentEmbeddingRequests = 2, // ↓ from 4
EmbeddingBatchSize = 300 // ↑ or ↓ (rarely matters)
};
Why tune down?
- Hit OpenRouter rate limits
- Network bandwidth saturated
- Too many concurrent fetches overwhelm target sites (ethical/scraping etiquette)
Why tune up?
- Fast network, powerful CPU, no rate limits
- Many chunks (>500) needing parallel embedding batches
Monitor:
openquery -vshows embedding progress:[Generating embeddings: batch X/Y]- If Y=1 (all fitted in one batch), batch size is fine
- If Y>1 and max concurrent = Y, you're using full parallelism
2. Reduce Data Volume
Fewer search results:
openquery -r 3 "question" # instead of 5 or 10
Effect: Fetches fewer URLs, extracts fewer chunks. Linear reduction in work.
Fewer queries:
openquery -q 1 "question"
Effect: One search instead of N. Quality may suffer (less diverse sources).
Fewer chunks:
openquery -c 1 "question"
Effect: Only top 1 chunk in context → fewer tokens → faster final answer, but may miss relevant info.
Chunk size (compile-time constant):
Edit ChunkingService.cs:
private const int MAX_CHUNK_SIZE = 300; // instead of 500
Effect: More chunks (more granular ranking) but each chunk shorter → more chunks to rank, more embeddings to generate. Could increase or decrease total time. Likely more tokens overall (more chunks in context if -c is fixed number).
3. Change Embedding Model
Currently hardcoded to openai/text-embedding-3-small. Could use:
openai/text-embedding-3-large(higher quality, slower, more expensive)intfloat/multilingual-e5-large(multilingual, smaller)
Modify EmbeddingService constructor:
public EmbeddingService(OpenRouterClient client, string embeddingModel = "your-model")
Then pass:
var embeddingService = new EmbeddingService(client, "intfloat/multilingual-e5-large");
Impact: Different dimensionality (1536 vs 1024 vs 4096). Memory scales with dim. Quality may vary for non-English queries.
4. Caching
Current: No caching. Every query hits all APIs.
Embedding cache (by text hash):
- Could store in memory:
Dictionary<string, float[]> - Or disk:
~/.cache/openquery/embeddings/ - Invalidation: embeddings are deterministic per model, so long-term cache viable
Search cache (by query hash):
- Cache
List<SearxngResult>for identical queries - TTL: maybe 1 hour (search results change over time)
Article cache (by URL hash):
- Cache
Article(text content) per URL - Invalidation: could check
Last-Modifiedheader or use TTL (1 day)
Implementation effort: Medium. Would need cache abstraction (interface, in-memory + disk options).
Benefit: Repeat queries (common in testing or similar questions) become instant.
5. Parallelize More (Aggressive)
Currently:
- Searches: unbounded (as many as
--queries) - Fetches: max 10
- Embeddings: max 4
Could increase:
- Fetches to 20 or 50 (if network/CPU can handle)
- Embeddings to 8-16 (if OpenRouter rate limit allows)
Risk:
- Overwhelming target sites (unethical scraping)
- API rate limits → 429 errors
- Local bandwidth saturation
6. Local Models (Self-Hosted)
Replace OpenRouter with local LLM:
- Query generation: Could run tiny model locally (no API latency)
- Embeddings: Could run
all-MiniLM-L6-v2locally (fast, free after setup) - Answer: Could run Llama 3 8B locally (no cost, but slower than GPT-4/Gemini)
Benefits:
- Zero API costs (after hardware)
- No network latency
- Unlimited queries
Drawbacks:
- GPU required for decent speed (or CPU very slow)
- Setup complexity (Ollama, llama.cpp, vLLM, etc.)
- Model quality may lag behind commercial APIs
Integration: Would need to implement local inference backends (separate project scope).
Scalability Limits
API Rate Limits
OpenRouter:
- Free tier: Very limited (few RPM)
- Paid: Varies by model, but typical ~10-30 requests/second
- Embedding API has separate limits
Mitigation:
- Reduce concurrency (see tuning)
- Add exponential backoff (already have for embeddings)
- Batch embedding requests (already done)
SearxNG Limits
Single instance:
- Can handle ~10-50 QPS depending on hardware
- Upstream search engines may rate limit per instance
- Memory ~100-500MB
Mitigation:
- Run multiple SearxNG instances behind load balancer
- Use different public instances
- Implement client-side rate limiting (currently only per-URL fetches limited, not searches)
Network Bandwidth
Typical data transfer:
- Searches: 1KB per query × 3 = 3KB
- Articles: 100-500KB per fetch × 15 = 1.5-7.5MB (raw HTML)
- Extracted text: ~10% of HTML size = 150-750KB
- Embeddings: 100 chunks × 1536 × 4 bytes = 600KB (request + response)
- Final answer: 2-10KB
Total: ~3-10MB per query
100 queries/hour: ~300MB-1GB data transfer
Not an issue for broadband, but could matter on metered connections.
Moatslaw's Law: Scaling with Chunk Count
Let:
- C = number of chunks with valid embeddings
- d = embedding dimension (1536)
- B = embedding batch size (300)
- P = max parallel embedding batches (4)
Embedding Time ≈ O(C/B * 1/P) (batches divided by parallelism)
Ranking Time ≈ O(C * d) (dot product per chunk)
Context Tokens (for final answer) ≈ C * avg_chunk_tokens (≈ 500 chars = 125 tokens)
As C increases:
- Embedding time: linear in C/B (sublinear if batch fits in one)
- Ranking time: linear in C
- Final answer latency: more tokens in context → longer context processing + potentially longer answer (more relevant chunks to synthesize)
Practical limit:
- With defaults, C ~ 50-100 (from 15 articles)
- Could reach C ~ 500-1000 if:
--queries= 10--results= 20 (200 URLs)- Many articles long → many chunks each
- At C = 1000:
- Embeddings: 1000/300 ≈ 4 batches, with 4 parallel → still 1 sequential step (if 4 batches, parallel all 4 → time ≈ 1 batch duration)
- But OpenRouter may have per-minute limits on embedding requests
- Ranking: 1000 × 1536 = 1.5M FLOPs → still <0.01s
- Context tokens: 1000 × 125 = 125K tokens! Many LLMs have 200K context, so fits, but expensive and slow.
Conclusion: Current defaults scale to C ~ 100-200 comfortably. Beyond that:
- Need to increase batch size or parallelism for embeddings
- May hit embedding API rate limits
- Context token count becomes expensive and may degrade answer quality (LLMs lose focus in very long context)
Profiling
CPU Profiling
Use dotnet-trace or perf:
# Collect trace for 30 seconds while running query
dotnet-trace collect --process-id $(pgrep OpenQuery) --duration 30s -o trace.nettrace
# Analyze with Visual Studio or PerfView
Look for:
- Hot methods:
ChunkingService.ChunkText,EmbeddingService.GetEmbeddingsAsync, cosine similarity - Allocation hotspots
Memory Profiling
dotnet-gcdump collect -p <pid>
# Open in VS or dotnet-gcdump analyze
Check heap size, object counts (look for large string objects from article content).
Network Profiling
Use tcpdump or wireshark:
tcpdump -i any port 8002 or port 443 -w capture.pcap
Or simpler: time on individual curl commands to measure latency components.
Next Steps
- Configuration - Tune for your environment
- Troubleshooting - Diagnose slow performance
- Architecture - Understand pipeline bottlenecks
Quick Tuning Cheatsheet
# Fast & cheap (factual Q&A)
openquery -q 1 -r 3 -c 2 -s "What is X?"
# Thorough (research)
openquery -q 5 -r 10 -c 5 -l "Deep dive on X"
# Custom code edit for concurrency
# In SearchTool.cs:
_options = new ParallelProcessingOptions {
MaxConcurrentArticleFetches = 20, // if network can handle
MaxConcurrentEmbeddingRequests = 8 // if API allows
};