# Performance Performance characteristics, optimization strategies, and scalability considerations for OpenQuery. ## 📋 Table of Contents 1. [Performance Overview](#performance-overview) 2. [Latency Breakdown](#latency-breakdown) 3. [Throughput](#throughput) 4. [Memory Usage](#memory-usage) 5. [Benchmarking](#benchmarking) 6. [Optimization Strategies](#optimization-strategies) 7. [Scalability Limits](#scalability-limits) ## Performance Overview OpenQuery is designed for **low-latency interactive use** (15-50 seconds end-to-end) while maximizing parallelization to minimize wait time. ### Key Metrics | Metric | Typical | Best Case | Worst Case | |--------|---------|-----------|------------| | **End-to-End Latency** | 15-50s | 10s | 120s+ | | **API Cost** | $0.01-0.05 | $0.005 | $0.20+ | | **Memory Footprint** | 100-300MB | 50MB | 1GB+ | | **Network I/O** | 5-20MB | 1MB | 100MB+ | **Note**: Wide variance due to network latency, content size, and LLM speed. --- ## Latency Breakdown ### Default Configuration `-q 3 -r 5 -c 3` (3 queries, 5 results each, 3 final chunks) | Stage | Operation | Parallelism | Time (p50) | Time (p95) | Dominant Factor | |-------|-----------|-------------|------------|------------|-----------------| | 1 | Query Generation | 1 | 2-5s | 10s | LLM inference speed | | 2a | Searches (3 queries × 5 results) | 3 concurrent | 3-8s | 15s | SearxNG latency | | 2b | Article Fetching (≈15 URLs) | 10 concurrent | 5-15s | 30s | Each site's response time | | 2c | Chunking | 10 concurrent | <1s | 2s | CPU (HTML parsing) | | 3a | Query Embedding | 1 | 0.5-1s | 3s | Embedding API latency | | 3b | Chunk Embeddings (≈50 chunks) | 4 concurrent | 1-3s | 10s | Batch API latency | | 4 | Ranking | 1 | <0.1s | 0.5s | CPU (vector math) | | 5 | Final Answer Streaming | 1 | 5-20s | 40s | LLM generation speed | | **Total** | | | **16-50s** | **~60s** | | ### Phase Details #### Phase 1: Query Generation (2-5s) - Single non-streaming LLM call - Input: system prompt + user question (~200 tokens) - Output: JSON array of 3-5 short strings (~50 tokens) - Fast because small context and output #### Phase 2a: Searches (3-8s) - 3 parallel `SearxngClient.SearchAsync` calls - Each: query → SearxNG → aggregator engines → scraped results - Latency highly variable based on: - SearxNG instance performance - Network distance to SearxNG - SearxNG's upstream search engines #### Phase 2b: Article Fetching (5-15s) - ≈15 URLs to fetch (3 queries × 5 results minus duplicates) - Up to 10 concurrent fetches (semaphore) - Each: TCP connect + TLS handshake + HTTP GET + SmartReader parse - Latency: - Fast sites (CDN, cached): 200-500ms - Normal sites: 1-3s - Slow/unresponsive sites: timeout after ~30s Why 5-15s for 15 URLs with 10 concurrent? - First wave (10 URLs): max latency among them ≈ 3s → 3s - Second wave (5 URLs): another ≈ 3s → total 6s - But many URLs faster (500ms) → total ≈ 2-3s - However, some sites take 5-10s → dominates **Tail latency**: Slowest few URLs can dominate total time. Cannot proceed until all fetch attempts complete (or fail). #### Phase 2c: Chunking (<1s) - CPU-bound HTML cleaning and splitting - SmartReader is surprisingly fast; C# HTML parser - Typically 100-300 chunks total - <1s on modern CPU #### Phase 3: Embeddings (1.5-4s) - **Query embedding**: 1 call, ~200 tokens, ≈ 0.5-1s - **Chunk embeddings**: ≈50 chunks → 1 batch of 50 (batch size 300 unused here) - Batch of 50: still single API call, ~15K tokens (50 × 300 chars ≈ 15K tokens) - If using `text-embedding-3-small`: $0.00002 per 1K → ~$0.0003 per batch - Latency: 1-3s for embedding API If more chunks (say 500), would be 2 batches → maybe 2-4s. Parallel batches (4 concurrent) help if many batches (1500+ chunks). #### Phase 4: Ranking (<0.1s) - Cosine similarity for 50-100 chunks - Each: dot product + normalization (O(dim)=1536) - 100 × 1536 ≈ 150K FLOPs → negligible on modern CPU - SIMD acceleration from `TensorPrimitives` #### Phase 5: Final Answer (5-20s) - Streaming chat completion - Input: system prompt + context (50K tokens for 3×500-char chunks) + question - Output: varies wildly (200-2000 tokens typically) - Longer context slightly increases latency - Model choice major factor: - Qwen Flash: fast (5-10s for 1000 output tokens) - Gemini Flash: moderate (10-15s) - Llama-class: slower (20-40s) --- ## Throughput ### Sequential Execution Running queries one after another (default CLI behavior): - Latency per query: 16-50s - Throughput: 1 query / 20s ≈ 180 queries/hour (theoretically) But API rate limits will kick in before that: - OpenRouter free tier: limited RPM/TPM - Even paid: soft limits ### Concurrent Execution (Multiple OpenQuery Instances) You could run multiple OpenQuery processes in parallel (different terminals), but they share: - Same API key (OpenRouter rate limit is per API key, not per process) - Same SearxNG instance (could saturate it) **Practical**: 3-5 concurrent processes before hitting diminishing returns or rate limits. ### Throughput Optimization To maximize queries per hour: 1. Use fastest model (Qwen Flash) 2. Reduce `--chunks` to 1-2 3. Reduce `--queries` to 1 4. Use local/fast SearxNG 5. Cache embedding results (not implemented) 6. Batch multiple questions in one process (not implemented; would require redesign) **Achievable**: Maybe 500-1000 queries/hour on paid OpenRouter plan with aggressive settings. --- ## Memory Usage ### Baseline .NET 10 AOT app with dependencies: - **Code**: ~30MB (AOT compiled native code) - **Runtime**: ~20MB (.NET runtime overhead) - **Base Memory**: ~50MB ### Per-Query Memory | Component | Memory | Lifetime | |-----------|--------|----------| | Search results (15 items) | ~30KB | Pipeline | | Articles (raw HTML) | ~5MB (transient) | Freed after parse | | Articles (extracted text) | ~500KB | Until pipeline complete | | Chunks (≈100 items) | ~50KB text + embeddings 600KB | Until pipeline complete | | Embeddings (100 × 1536 floats) | ~600KB | Until pipeline complete | | HTTP buffers | ~1MB per concurrent request | Short-lived | | **Total per query** | **~2-5MB** (excluding base) | Released after complete | **Peak**: When all articles fetched but not yet embedded, we have text ~500KB + chunks ~650KB = ~1.2MB + overhead ≈ 2-3MB. **If processing many queries in parallel** (unlikely for CLI), memory would scale linearly. ### Memory Leak Risks - `HttpClient` instances: Created per `OpenRouterClient` and `SearxngClient`. Should be disposed (not happening). But short-lived process exits anyway. - `StatusReporter` background task: Disposed via `using` - `RateLimiter` semaphore: Disposed via `IAsyncDisposable` if wrapped in `using` (not currently, but short-lived) No major leaks observed. ### Memory Optimization Opportunities 1. **Reuse HttpClient** with `IHttpClientFactory` (but not needed for CLI) 2. **Stream article fetching** instead of buffering all articles before embedding (possible: embed as URLs complete) 3. **Early chunk filtering**: Discard low-quality chunks before embedding to reduce embedding count 4. **Cache embeddings**: By content hash, avoid re-embedding seen text (would need persistent storage) --- ## Benchmarking ### Methodology Measure with `time` command and verbose logging: ```bash time openquery -v "What is quantum entanglement?" 2>&1 | tee log.txt ``` Parse log for timestamps (or add them manually by modifying code). ### Sample Benchmark **Environment**: - Linux x64, .NET 10 AOT - SearxNG local Docker (localhost:8002) - OpenRouter API (US East) - Model: qwen/qwen3.5-flash-02-23 **Run 1**: ``` real 0m23.4s user 0m1.2s sys 0m0.3s ``` Log breakdown: - Query generation: 3.2s - Searches: 4.1s - Article fetching: 8.7s (12 URLs) - Embeddings: 2.8s (45 chunks) - Final answer: 4.6s (325 tokens) **Run 2** (cached SearxNG results, same URLs): ``` real 0m15.8s ``` Faster article fetching (2.3s) because sites cached or faster second request. **Run 3** (verbose `-s` short answer): ``` real 0m18.2s ``` Final answer faster (2.1s instead of 4.6s) due to shorter output. ### Benchmarking Tips 1. **Warm up**: First run slower (JIT or AOT cold start). Discard first measurement. 2. **Network variance**: Run multiple times and average. 3. **Control variables**: Same question, same SearxNG instance, same network conditions. 4. **Measure API costs**: Check OpenRouter dashboard for token counts. 5. **Profile with dotTrace** or `perf` if investigating CPU bottlenecks. --- ## Optimization Strategies ### 1. Tune Concurrent Limits Edit `SearchTool.cs` where `_options` is created: ```csharp var _options = new ParallelProcessingOptions { MaxConcurrentArticleFetches = 5, // ↓ from 10 MaxConcurrentEmbeddingRequests = 2, // ↓ from 4 EmbeddingBatchSize = 300 // ↑ or ↓ (rarely matters) }; ``` **Why tune down?** - Hit OpenRouter rate limits - Network bandwidth saturated - Too many concurrent fetches overwhelm target sites (ethical/scraping etiquette) **Why tune up?** - Fast network, powerful CPU, no rate limits - Many chunks (>500) needing parallel embedding batches **Monitor**: - `openquery -v` shows embedding progress: `[Generating embeddings: batch X/Y]` - If Y=1 (all fitted in one batch), batch size is fine - If Y>1 and max concurrent = Y, you're using full parallelism ### 2. Reduce Data Volume **Fewer search results**: ```bash openquery -r 3 "question" # instead of 5 or 10 ``` Effect: Fetches fewer URLs, extracts fewer chunks. Linear reduction in work. **Fewer queries**: ```bash openquery -q 1 "question" ``` Effect: One search instead of N. Quality may suffer (less diverse sources). **Fewer chunks**: ```bash openquery -c 1 "question" ``` Effect: Only top 1 chunk in context → fewer tokens → faster final answer, but may miss relevant info. **Chunk size** (compile-time constant): Edit `ChunkingService.cs`: ```csharp private const int MAX_CHUNK_SIZE = 300; // instead of 500 ``` Effect: More chunks (more granular ranking) but each chunk shorter → more chunks to rank, more embeddings to generate. Could increase or decrease total time. Likely more tokens overall (more chunks in context if `-c` is fixed number). ### 3. Change Embedding Model Currently hardcoded to `openai/text-embedding-3-small`. Could use: - `openai/text-embedding-3-large` (higher quality, slower, more expensive) - `intfloat/multilingual-e5-large` (multilingual, smaller) Modify `EmbeddingService` constructor: ```csharp public EmbeddingService(OpenRouterClient client, string embeddingModel = "your-model") ``` Then pass: ```csharp var embeddingService = new EmbeddingService(client, "intfloat/multilingual-e5-large"); ``` **Impact**: Different dimensionality (1536 vs 1024 vs 4096). Memory scales with dim. Quality may vary for non-English queries. ### 4. Caching **Current**: No caching. Every query hits all APIs. **Embedding cache** (by text hash): - Could store in memory: `Dictionary` - Or disk: `~/.cache/openquery/embeddings/` - Invalidation: embeddings are deterministic per model, so long-term cache viable **Search cache** (by query hash): - Cache `List` for identical queries - TTL: maybe 1 hour (search results change over time) **Article cache** (by URL hash): - Cache `Article` (text content) per URL - Invalidation: could check `Last-Modified` header or use TTL (1 day) **Implementation effort**: Medium. Would need cache abstraction (interface, in-memory + disk options). **Benefit**: Repeat queries (common in testing or similar questions) become instant. ### 5. Parallelize More (Aggressive) **Currently**: - Searches: unbounded (as many as `--queries`) - Fetches: max 10 - Embeddings: max 4 Could increase: - Fetches to 20 or 50 (if network/CPU can handle) - Embeddings to 8-16 (if OpenRouter rate limit allows) **Risk**: - Overwhelming target sites (unethical scraping) - API rate limits → 429 errors - Local bandwidth saturation ### 6. Local Models (Self-Hosted) Replace OpenRouter with local LLM: - **Query generation**: Could run tiny model locally (no API latency) - **Embeddings**: Could run `all-MiniLM-L6-v2` locally (fast, free after setup) - **Answer**: Could run Llama 3 8B locally (no cost, but slower than GPT-4/Gemini) **Benefits**: - Zero API costs (after hardware) - No network latency - Unlimited queries **Drawbacks**: - GPU required for decent speed (or CPU very slow) - Setup complexity (Ollama, llama.cpp, vLLM, etc.) - Model quality may lag behind commercial APIs **Integration**: Would need to implement local inference backends (separate project scope). --- ## Scalability Limits ### API Rate Limits **OpenRouter**: - Free tier: Very limited (few RPM) - Paid: Varies by model, but typical ~10-30 requests/second - Embedding API has separate limits **Mitigation**: - Reduce concurrency (see tuning) - Add exponential backoff (already have for embeddings) - Batch embedding requests (already done) ### SearxNG Limits **Single instance**: - Can handle ~10-50 QPS depending on hardware - Upstream search engines may rate limit per instance - Memory ~100-500MB **Mitigation**: - Run multiple SearxNG instances behind load balancer - Use different public instances - Implement client-side rate limiting (currently only per-URL fetches limited, not searches) ### Network Bandwidth **Typical data transfer**: - Searches: 1KB per query × 3 = 3KB - Articles: 100-500KB per fetch × 15 = 1.5-7.5MB (raw HTML) - Extracted text: ~10% of HTML size = 150-750KB - Embeddings: 100 chunks × 1536 × 4 bytes = 600KB (request + response) - Final answer: 2-10KB **Total**: ~3-10MB per query **100 queries/hour**: ~300MB-1GB data transfer **Not an issue** for broadband, but could matter on metered connections. --- ## Moatslaw's Law: Scaling with Chunk Count Let: - C = number of chunks with valid embeddings - d = embedding dimension (1536) - B = embedding batch size (300) - P = max parallel embedding batches (4) **Embedding Time** ≈ `O(C/B * 1/P)` (batches divided by parallelism) **Ranking Time** ≈ `O(C * d)` (dot product per chunk) **Context Tokens** (for final answer) ≈ `C * avg_chunk_tokens` (≈ 500 chars = 125 tokens) **As C increases**: - Embedding time: linear in C/B (sublinear if batch fits in one) - Ranking time: linear in C - Final answer latency: more tokens in context → longer context processing + potentially longer answer (more relevant chunks to synthesize) **Practical limit**: - With defaults, C ~ 50-100 (from 15 articles) - Could reach C ~ 500-1000 if: - `--queries` = 10 - `--results` = 20 (200 URLs) - Many articles long → many chunks each - At C = 1000: - Embeddings: 1000/300 ≈ 4 batches, with 4 parallel → still 1 sequential step (if 4 batches, parallel all 4 → time ≈ 1 batch duration) - But OpenRouter may have per-minute limits on embedding requests - Ranking: 1000 × 1536 = 1.5M FLOPs → still <0.01s - Context tokens: 1000 × 125 = 125K tokens! Many LLMs have 200K context, so fits, but expensive and slow. **Conclusion**: Current defaults scale to C ~ 100-200 comfortably. Beyond that: - Need to increase batch size or parallelism for embeddings - May hit embedding API rate limits - Context token count becomes expensive and may degrade answer quality (LLMs lose focus in very long context) --- ## Profiling ### CPU Profiling Use `dotnet-trace` or `perf`: ```bash # Collect trace for 30 seconds while running query dotnet-trace collect --process-id $(pgrep OpenQuery) --duration 30s -o trace.nettrace # Analyze with Visual Studio or PerfView ``` Look for: - Hot methods: `ChunkingService.ChunkText`, `EmbeddingService.GetEmbeddingsAsync`, cosine similarity - Allocation hotspots ### Memory Profiling ```bash dotnet-gcdump collect -p # Open in VS or dotnet-gcdump analyze ``` Check heap size, object counts (look for large `string` objects from article content). ### Network Profiling Use `tcpdump` or `wireshark`: ```bash tcpdump -i any port 8002 or port 443 -w capture.pcap ``` Or simpler: `time` on individual curl commands to measure latency components. --- ## Next Steps - [Configuration](../configuration.md) - Tune for your environment - [Troubleshooting](../troubleshooting.md) - Diagnose slow performance - [Architecture](../architecture.md) - Understand pipeline bottlenecks --- **Quick Tuning Cheatsheet** ```bash # Fast & cheap (factual Q&A) openquery -q 1 -r 3 -c 2 -s "What is X?" # Thorough (research) openquery -q 5 -r 10 -c 5 -l "Deep dive on X" # Custom code edit for concurrency # In SearchTool.cs: _options = new ParallelProcessingOptions { MaxConcurrentArticleFetches = 20, // if network can handle MaxConcurrentEmbeddingRequests = 8 // if API allows }; ```