docs: add comprehensive documentation with README and detailed guides

- Add user-friendly README.md with quick start guide - Create docs/ folder with structured technical documentation: - installation.md: Build and setup instructions - configuration.md: Complete config reference - usage.md: CLI usage guide with examples - architecture.md: System design and patterns - components/: Deep dive into each component (OpenQueryApp, SearchTool, Services, Models) - api/: CLI reference, environment variables, programmatic API - troubleshooting.md: Common issues and solutions - performance.md: Latency, throughput, and optimization - All documentation fully cross-referenced with internal links - Covers project overview, architecture, components, APIs, and support See individual files for complete documentation.
2026-03-19 10:01:58 +01:00
parent b28d8998f7
commit 65ca2401ae
16 changed files with 7073 additions and 0 deletions
--- a/docs/performance.md
+++ b/docs/performance.md
@@ -0,0 +1,522 @@
+# Performance
+
+Performance characteristics, optimization strategies, and scalability considerations for OpenQuery.
+
+## 📋 Table of Contents
+
+1. [Performance Overview](#performance-overview)
+2. [Latency Breakdown](#latency-breakdown)
+3. [Throughput](#throughput)
+4. [Memory Usage](#memory-usage)
+5. [Benchmarking](#benchmarking)
+6. [Optimization Strategies](#optimization-strategies)
+7. [Scalability Limits](#scalability-limits)
+
+## Performance Overview
+
+OpenQuery is designed for **low-latency interactive use** (15-50 seconds end-to-end) while maximizing parallelization to minimize wait time.
+
+### Key Metrics
+
+| Metric | Typical | Best Case | Worst Case |
+|--------|---------|-----------|------------|
+| **End-to-End Latency** | 15-50s | 10s | 120s+ |
+| **API Cost** | $0.01-0.05 | $0.005 | $0.20+ |
+| **Memory Footprint** | 100-300MB | 50MB | 1GB+ |
+| **Network I/O** | 5-20MB | 1MB | 100MB+ |
+
+**Note**: Wide variance due to network latency, content size, and LLM speed.
+
+---
+
+## Latency Breakdown
+
+### Default Configuration
+
+`-q 3 -r 5 -c 3` (3 queries, 5 results each, 3 final chunks)
+
+| Stage | Operation | Parallelism | Time (p50) | Time (p95) | Dominant Factor |
+|-------|-----------|-------------|------------|------------|-----------------|
+| 1 | Query Generation | 1 | 2-5s | 10s | LLM inference speed |
+| 2a | Searches (3 queries × 5 results) | 3 concurrent | 3-8s | 15s | SearxNG latency |
+| 2b | Article Fetching (≈15 URLs) | 10 concurrent | 5-15s | 30s | Each site's response time |
+| 2c | Chunking | 10 concurrent | <1s | 2s | CPU (HTML parsing) |
+| 3a | Query Embedding | 1 | 0.5-1s | 3s | Embedding API latency |
+| 3b | Chunk Embeddings (≈50 chunks) | 4 concurrent | 1-3s | 10s | Batch API latency |
+| 4 | Ranking | 1 | <0.1s | 0.5s | CPU (vector math) |
+| 5 | Final Answer Streaming | 1 | 5-20s | 40s | LLM generation speed |
+| **Total** | | | **16-50s** | **~60s** | |
+
+### Phase Details
+
+#### Phase 1: Query Generation (2-5s)
+- Single non-streaming LLM call
+- Input: system prompt + user question (~200 tokens)
+- Output: JSON array of 3-5 short strings (~50 tokens)
+- Fast because small context and output
+
+#### Phase 2a: Searches (3-8s)
+- 3 parallel `SearxngClient.SearchAsync` calls
+- Each: query → SearxNG → aggregator engines → scraped results
+- Latency highly variable based on:
+  - SearxNG instance performance
+  - Network distance to SearxNG
+  - SearxNG's upstream search engines
+
+#### Phase 2b: Article Fetching (5-15s)
+- ≈15 URLs to fetch (3 queries × 5 results minus duplicates)
+- Up to 10 concurrent fetches (semaphore)
+- Each: TCP connect + TLS handshake + HTTP GET + SmartReader parse
+- Latency:
+  - Fast sites (CDN, cached): 200-500ms
+  - Normal sites: 1-3s
+  - Slow/unresponsive sites: timeout after ~30s
+
+Why 5-15s for 15 URLs with 10 concurrent?
+- First wave (10 URLs): max latency among them ≈ 3s → 3s
+- Second wave (5 URLs): another ≈ 3s → total 6s
+- But many URLs faster (500ms) → total ≈ 2-3s
+- However, some sites take 5-10s → dominates
+
+**Tail latency**: Slowest few URLs can dominate total time. Cannot proceed until all fetch attempts complete (or fail).
+
+#### Phase 2c: Chunking (<1s)
+- CPU-bound HTML cleaning and splitting
+- SmartReader is surprisingly fast; C# HTML parser
+- Typically 100-300 chunks total
+- <1s on modern CPU
+
+#### Phase 3: Embeddings (1.5-4s)
+- **Query embedding**: 1 call, ~200 tokens, ≈ 0.5-1s
+- **Chunk embeddings**: ≈50 chunks → 1 batch of 50 (batch size 300 unused here)
+  - Batch of 50: still single API call, ~15K tokens (50 × 300 chars ≈ 15K tokens)
+  - If using `text-embedding-3-small`: $0.00002 per 1K → ~$0.0003 per batch
+  - Latency: 1-3s for embedding API
+
+If more chunks (say 500), would be 2 batches → maybe 2-4s.
+
+Parallel batches (4 concurrent) help if many batches (1500+ chunks).
+
+#### Phase 4: Ranking (<0.1s)
+- Cosine similarity for 50-100 chunks
+- Each: dot product + normalization (O(dim)=1536)
+- 100 × 1536 ≈ 150K FLOPs → negligible on modern CPU
+- SIMD acceleration from `TensorPrimitives`
+
+#### Phase 5: Final Answer (5-20s)
+- Streaming chat completion
+- Input: system prompt + context (50K tokens for 3×500-char chunks) + question
+- Output: varies wildly (200-2000 tokens typically)
+- Longer context slightly increases latency
+- Model choice major factor:
+  - Qwen Flash: fast (5-10s for 1000 output tokens)
+  - Gemini Flash: moderate (10-15s)
+  - Llama-class: slower (20-40s)
+
+---
+
+## Throughput
+
+### Sequential Execution
+
+Running queries one after another (default CLI behavior):
+- Latency per query: 16-50s
+- Throughput: 1 query / 20s ≈ 180 queries/hour (theoretically)
+
+But API rate limits will kick in before that:
+- OpenRouter free tier: limited RPM/TPM
+- Even paid: soft limits
+
+### Concurrent Execution (Multiple OpenQuery Instances)
+
+You could run multiple OpenQuery processes in parallel (different terminals), but they share:
+- Same API key (OpenRouter rate limit is per API key, not per process)
+- Same SearxNG instance (could saturate it)
+
+**Practical**: 3-5 concurrent processes before hitting diminishing returns or rate limits.
+
+### Throughput Optimization
+
+To maximize queries per hour:
+1. Use fastest model (Qwen Flash)
+2. Reduce `--chunks` to 1-2
+3. Reduce `--queries` to 1
+4. Use local/fast SearxNG
+5. Cache embedding results (not implemented)
+6. Batch multiple questions in one process (not implemented; would require redesign)
+
+**Achievable**: Maybe 500-1000 queries/hour on paid OpenRouter plan with aggressive settings.
+
+---
+
+## Memory Usage
+
+### Baseline
+
+.NET 10 AOT app with dependencies:
+- **Code**: ~30MB (AOT compiled native code)
+- **Runtime**: ~20MB (.NET runtime overhead)
+- **Base Memory**: ~50MB
+
+### Per-Query Memory
+
+| Component | Memory | Lifetime |
+|-----------|--------|----------|
+| Search results (15 items) | ~30KB | Pipeline |
+| Articles (raw HTML) | ~5MB (transient) | Freed after parse |
+| Articles (extracted text) | ~500KB | Until pipeline complete |
+| Chunks (≈100 items) | ~50KB text + embeddings 600KB | Until pipeline complete |
+| Embeddings (100 × 1536 floats) | ~600KB | Until pipeline complete |
+| HTTP buffers | ~1MB per concurrent request | Short-lived |
+| **Total per query** | **~2-5MB** (excluding base) | Released after complete |
+
+**Peak**: When all articles fetched but not yet embedded, we have text ~500KB + chunks ~650KB = ~1.2MB + overhead ≈ 2-3MB.
+
+**If processing many queries in parallel** (unlikely for CLI), memory would scale linearly.
+
+### Memory Leak Risks
+
+- `HttpClient` instances: Created per `OpenRouterClient` and `SearxngClient`. Should be disposed (not happening). But short-lived process exits anyway.
+- `StatusReporter` background task: Disposed via `using`
+- `RateLimiter` semaphore: Disposed via `IAsyncDisposable` if wrapped in `using` (not currently, but short-lived)
+
+No major leaks observed.
+
+### Memory Optimization Opportunities
+
+1. **Reuse HttpClient** with `IHttpClientFactory` (but not needed for CLI)
+2. **Stream article fetching** instead of buffering all articles before embedding (possible: embed as URLs complete)
+3. **Early chunk filtering**: Discard low-quality chunks before embedding to reduce embedding count
+4. **Cache embeddings**: By content hash, avoid re-embedding seen text (would need persistent storage)
+
+---
+
+## Benchmarking
+
+### Methodology
+
+Measure with `time` command and verbose logging:
+
+```bash
+time openquery -v "What is quantum entanglement?" 2>&1 | tee log.txt
+```
+
+Parse log for timestamps (or add them manually by modifying code).
+
+### Sample Benchmark
+
+**Environment**:
+- Linux x64, .NET 10 AOT
+- SearxNG local Docker (localhost:8002)
+- OpenRouter API (US East)
+- Model: qwen/qwen3.5-flash-02-23
+
+**Run 1**:
+```
+real    0m23.4s
+user    0m1.2s
+sys     0m0.3s
+```
+Log breakdown:
+- Query generation: 3.2s
+- Searches: 4.1s
+- Article fetching: 8.7s (12 URLs)
+- Embeddings: 2.8s (45 chunks)
+- Final answer: 4.6s (325 tokens)
+
+**Run 2** (cached SearxNG results, same URLs):
+```
+real    0m15.8s
+```
+Faster article fetching (2.3s) because sites cached or faster second request.
+
+**Run 3** (verbose `-s` short answer):
+```
+real    0m18.2s
+```
+Final answer faster (2.1s instead of 4.6s) due to shorter output.
+
+### Benchmarking Tips
+
+1. **Warm up**: First run slower (JIT or AOT cold start). Discard first measurement.
+2. **Network variance**: Run multiple times and average.
+3. **Control variables**: Same question, same SearxNG instance, same network conditions.
+4. **Measure API costs**: Check OpenRouter dashboard for token counts.
+5. **Profile with dotTrace** or `perf` if investigating CPU bottlenecks.
+
+---
+
+## Optimization Strategies
+
+### 1. Tune Concurrent Limits
+
+Edit `SearchTool.cs` where `_options` is created:
+
+```csharp
+var _options = new ParallelProcessingOptions
+{
+    MaxConcurrentArticleFetches = 5,        // ↓ from 10
+    MaxConcurrentEmbeddingRequests = 2,    // ↓ from 4
+    EmbeddingBatchSize = 300               // ↑ or ↓ (rarely matters)
+};
+```
+
+**Why tune down?**
+- Hit OpenRouter rate limits
+- Network bandwidth saturated
+- Too many concurrent fetches overwhelm target sites (ethical/scraping etiquette)
+
+**Why tune up?**
+- Fast network, powerful CPU, no rate limits
+- Many chunks (>500) needing parallel embedding batches
+
+**Monitor**:
+- `openquery -v` shows embedding progress: `[Generating embeddings: batch X/Y]`
+- If Y=1 (all fitted in one batch), batch size is fine
+- If Y>1 and max concurrent = Y, you're using full parallelism
+
+### 2. Reduce Data Volume
+
+**Fewer search results**:
+```bash
+openquery -r 3 "question"  # instead of 5 or 10
+```
+Effect: Fetches fewer URLs, extracts fewer chunks. Linear reduction in work.
+
+**Fewer queries**:
+```bash
+openquery -q 1 "question"
+```
+Effect: One search instead of N. Quality may suffer (less diverse sources).
+
+**Fewer chunks**:
+```bash
+openquery -c 1 "question"
+```
+Effect: Only top 1 chunk in context → fewer tokens → faster final answer, but may miss relevant info.
+
+**Chunk size** (compile-time constant):
+Edit `ChunkingService.cs`:
+```csharp
+private const int MAX_CHUNK_SIZE = 300;  // instead of 500
+```
+Effect: More chunks (more granular ranking) but each chunk shorter → more chunks to rank, more embeddings to generate. Could increase or decrease total time. Likely more tokens overall (more chunks in context if `-c` is fixed number).
+
+### 3. Change Embedding Model
+
+Currently hardcoded to `openai/text-embedding-3-small`. Could use:
+- `openai/text-embedding-3-large` (higher quality, slower, more expensive)
+- `intfloat/multilingual-e5-large` (multilingual, smaller)
+
+Modify `EmbeddingService` constructor:
+```csharp
+public EmbeddingService(OpenRouterClient client, string embeddingModel = "your-model")
+```
+
+Then pass:
+```csharp
+var embeddingService = new EmbeddingService(client, "intfloat/multilingual-e5-large");
+```
+
+**Impact**: Different dimensionality (1536 vs 1024 vs 4096). Memory scales with dim. Quality may vary for non-English queries.
+
+### 4. Caching
+
+**Current**: No caching. Every query hits all APIs.
+
+**Embedding cache** (by text hash):
+- Could store in memory: `Dictionary<string, float[]>`
+- Or disk: `~/.cache/openquery/embeddings/`
+- Invalidation: embeddings are deterministic per model, so long-term cache viable
+
+**Search cache** (by query hash):
+- Cache `List<SearxngResult>` for identical queries
+- TTL: maybe 1 hour (search results change over time)
+
+**Article cache** (by URL hash):
+- Cache `Article` (text content) per URL
+- Invalidation: could check `Last-Modified` header or use TTL (1 day)
+
+**Implementation effort**: Medium. Would need cache abstraction (interface, in-memory + disk options).
+
+**Benefit**: Repeat queries (common in testing or similar questions) become instant.
+
+### 5. Parallelize More (Aggressive)
+
+**Currently**: 
+- Searches: unbounded (as many as `--queries`)
+- Fetches: max 10
+- Embeddings: max 4
+
+Could increase:
+- Fetches to 20 or 50 (if network/CPU can handle)
+- Embeddings to 8-16 (if OpenRouter rate limit allows)
+
+**Risk**: 
+- Overwhelming target sites (unethical scraping)
+- API rate limits → 429 errors
+- Local bandwidth saturation
+
+### 6. Local Models (Self-Hosted)
+
+Replace OpenRouter with local LLM:
+- **Query generation**: Could run tiny model locally (no API latency)
+- **Embeddings**: Could run `all-MiniLM-L6-v2` locally (fast, free after setup)
+- **Answer**: Could run Llama 3 8B locally (no cost, but slower than GPT-4/Gemini)
+
+**Benefits**:
+- Zero API costs (after hardware)
+- No network latency
+- Unlimited queries
+
+**Drawbacks**:
+- GPU required for decent speed (or CPU very slow)
+- Setup complexity (Ollama, llama.cpp, vLLM, etc.)
+- Model quality may lag behind commercial APIs
+
+**Integration**: Would need to implement local inference backends (separate project scope).
+
+---
+
+## Scalability Limits
+
+### API Rate Limits
+
+**OpenRouter**:
+- Free tier: Very limited (few RPM)
+- Paid: Varies by model, but typical ~10-30 requests/second
+- Embedding API has separate limits
+
+**Mitigation**:
+- Reduce concurrency (see tuning)
+- Add exponential backoff (already have for embeddings)
+- Batch embedding requests (already done)
+
+### SearxNG Limits
+
+**Single instance**:
+- Can handle ~10-50 QPS depending on hardware
+- Upstream search engines may rate limit per instance
+- Memory ~100-500MB
+
+**Mitigation**:
+- Run multiple SearxNG instances behind load balancer
+- Use different public instances
+- Implement client-side rate limiting (currently only per-URL fetches limited, not searches)
+
+### Network Bandwidth
+
+**Typical data transfer**:
+- Searches: 1KB per query × 3 = 3KB
+- Articles: 100-500KB per fetch × 15 = 1.5-7.5MB (raw HTML)
+- Extracted text: ~10% of HTML size = 150-750KB
+- Embeddings: 100 chunks × 1536 × 4 bytes = 600KB (request + response)
+- Final answer: 2-10KB
+
+**Total**: ~3-10MB per query
+
+**100 queries/hour**: ~300MB-1GB data transfer
+
+**Not an issue** for broadband, but could matter on metered connections.
+
+---
+
+## Moatslaw's Law: Scaling with Chunk Count
+
+Let:
+- C = number of chunks with valid embeddings
+- d = embedding dimension (1536)
+- B = embedding batch size (300)
+- P = max parallel embedding batches (4)
+
+**Embedding Time** ≈ `O(C/B * 1/P)` (batches divided by parallelism)
+
+**Ranking Time** ≈ `O(C * d)` (dot product per chunk)
+
+**Context Tokens** (for final answer) ≈ `C * avg_chunk_tokens` (≈ 500 chars = 125 tokens)
+
+**As C increases**:
+- Embedding time: linear in C/B (sublinear if batch fits in one)
+- Ranking time: linear in C
+- Final answer latency: more tokens in context → longer context processing + potentially longer answer (more relevant chunks to synthesize)
+
+**Practical limit**:
+- With defaults, C ~ 50-100 (from 15 articles)
+- Could reach C ~ 500-1000 if:
+  - `--queries` = 10
+  - `--results` = 20 (200 URLs)
+  - Many articles long → many chunks each
+- At C = 1000:
+  - Embeddings: 1000/300 ≈ 4 batches, with 4 parallel → still 1 sequential step (if 4 batches, parallel all 4 → time ≈ 1 batch duration)
+  - But OpenRouter may have per-minute limits on embedding requests
+  - Ranking: 1000 × 1536 = 1.5M FLOPs → still <0.01s
+  - Context tokens: 1000 × 125 = 125K tokens! Many LLMs have 200K context, so fits, but expensive and slow.
+
+**Conclusion**: Current defaults scale to C ~ 100-200 comfortably. Beyond that:
+- Need to increase batch size or parallelism for embeddings
+- May hit embedding API rate limits
+- Context token count becomes expensive and may degrade answer quality (LLMs lose focus in very long context)
+
+---
+
+## Profiling
+
+### CPU Profiling
+
+Use `dotnet-trace` or `perf`:
+
+```bash
+# Collect trace for 30 seconds while running query
+dotnet-trace collect --process-id $(pgrep OpenQuery) --duration 30s -o trace.nettrace
+
+# Analyze with Visual Studio or PerfView
+```
+
+Look for:
+- Hot methods: `ChunkingService.ChunkText`, `EmbeddingService.GetEmbeddingsAsync`, cosine similarity
+- Allocation hotspots
+
+### Memory Profiling
+
+```bash
+dotnet-gcdump collect -p <pid>
+# Open in VS or dotnet-gcdump analyze
+```
+
+Check heap size, object counts (look for large `string` objects from article content).
+
+### Network Profiling
+
+Use `tcpdump` or `wireshark`:
+```bash
+tcpdump -i any port 8002 or port 443 -w capture.pcap
+```
+
+Or simpler: `time` on individual curl commands to measure latency components.
+
+---
+
+## Next Steps
+
+- [Configuration](../configuration.md) - Tune for your environment
+- [Troubleshooting](../troubleshooting.md) - Diagnose slow performance
+- [Architecture](../architecture.md) - Understand pipeline bottlenecks
+
+---
+
+**Quick Tuning Cheatsheet**
+
+```bash
+# Fast & cheap (factual Q&A)
+openquery -q 1 -r 3 -c 2 -s "What is X?"
+
+# Thorough (research)
+openquery -q 5 -r 10 -c 5 -l "Deep dive on X"
+
+# Custom code edit for concurrency
+# In SearchTool.cs:
+_options = new ParallelProcessingOptions {
+    MaxConcurrentArticleFetches = 20,  // if network can handle
+    MaxConcurrentEmbeddingRequests = 8  // if API allows
+};
+```