Files
OpenQuery/docs/performance.md
OpenQuery Documentation 65ca2401ae docs: add comprehensive documentation with README and detailed guides
- Add user-friendly README.md with quick start guide
- Create docs/ folder with structured technical documentation:
  - installation.md: Build and setup instructions
  - configuration.md: Complete config reference
  - usage.md: CLI usage guide with examples
  - architecture.md: System design and patterns
  - components/: Deep dive into each component (OpenQueryApp, SearchTool, Services, Models)
  - api/: CLI reference, environment variables, programmatic API
  - troubleshooting.md: Common issues and solutions
  - performance.md: Latency, throughput, and optimization
- All documentation fully cross-referenced with internal links
- Covers project overview, architecture, components, APIs, and support

See individual files for complete documentation.
2026-03-19 10:01:58 +01:00

523 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Performance
Performance characteristics, optimization strategies, and scalability considerations for OpenQuery.
## 📋 Table of Contents
1. [Performance Overview](#performance-overview)
2. [Latency Breakdown](#latency-breakdown)
3. [Throughput](#throughput)
4. [Memory Usage](#memory-usage)
5. [Benchmarking](#benchmarking)
6. [Optimization Strategies](#optimization-strategies)
7. [Scalability Limits](#scalability-limits)
## Performance Overview
OpenQuery is designed for **low-latency interactive use** (15-50 seconds end-to-end) while maximizing parallelization to minimize wait time.
### Key Metrics
| Metric | Typical | Best Case | Worst Case |
|--------|---------|-----------|------------|
| **End-to-End Latency** | 15-50s | 10s | 120s+ |
| **API Cost** | $0.01-0.05 | $0.005 | $0.20+ |
| **Memory Footprint** | 100-300MB | 50MB | 1GB+ |
| **Network I/O** | 5-20MB | 1MB | 100MB+ |
**Note**: Wide variance due to network latency, content size, and LLM speed.
---
## Latency Breakdown
### Default Configuration
`-q 3 -r 5 -c 3` (3 queries, 5 results each, 3 final chunks)
| Stage | Operation | Parallelism | Time (p50) | Time (p95) | Dominant Factor |
|-------|-----------|-------------|------------|------------|-----------------|
| 1 | Query Generation | 1 | 2-5s | 10s | LLM inference speed |
| 2a | Searches (3 queries × 5 results) | 3 concurrent | 3-8s | 15s | SearxNG latency |
| 2b | Article Fetching (≈15 URLs) | 10 concurrent | 5-15s | 30s | Each site's response time |
| 2c | Chunking | 10 concurrent | <1s | 2s | CPU (HTML parsing) |
| 3a | Query Embedding | 1 | 0.5-1s | 3s | Embedding API latency |
| 3b | Chunk Embeddings (≈50 chunks) | 4 concurrent | 1-3s | 10s | Batch API latency |
| 4 | Ranking | 1 | <0.1s | 0.5s | CPU (vector math) |
| 5 | Final Answer Streaming | 1 | 5-20s | 40s | LLM generation speed |
| **Total** | | | **16-50s** | **~60s** | |
### Phase Details
#### Phase 1: Query Generation (2-5s)
- Single non-streaming LLM call
- Input: system prompt + user question (~200 tokens)
- Output: JSON array of 3-5 short strings (~50 tokens)
- Fast because small context and output
#### Phase 2a: Searches (3-8s)
- 3 parallel `SearxngClient.SearchAsync` calls
- Each: query → SearxNG → aggregator engines → scraped results
- Latency highly variable based on:
- SearxNG instance performance
- Network distance to SearxNG
- SearxNG's upstream search engines
#### Phase 2b: Article Fetching (5-15s)
- ≈15 URLs to fetch (3 queries × 5 results minus duplicates)
- Up to 10 concurrent fetches (semaphore)
- Each: TCP connect + TLS handshake + HTTP GET + SmartReader parse
- Latency:
- Fast sites (CDN, cached): 200-500ms
- Normal sites: 1-3s
- Slow/unresponsive sites: timeout after ~30s
Why 5-15s for 15 URLs with 10 concurrent?
- First wave (10 URLs): max latency among them ≈ 3s → 3s
- Second wave (5 URLs): another ≈ 3s → total 6s
- But many URLs faster (500ms) → total ≈ 2-3s
- However, some sites take 5-10s → dominates
**Tail latency**: Slowest few URLs can dominate total time. Cannot proceed until all fetch attempts complete (or fail).
#### Phase 2c: Chunking (<1s)
- CPU-bound HTML cleaning and splitting
- SmartReader is surprisingly fast; C# HTML parser
- Typically 100-300 chunks total
- <1s on modern CPU
#### Phase 3: Embeddings (1.5-4s)
- **Query embedding**: 1 call, ~200 tokens, ≈ 0.5-1s
- **Chunk embeddings**: ≈50 chunks → 1 batch of 50 (batch size 300 unused here)
- Batch of 50: still single API call, ~15K tokens (50 × 300 chars ≈ 15K tokens)
- If using `text-embedding-3-small`: $0.00002 per 1K → ~$0.0003 per batch
- Latency: 1-3s for embedding API
If more chunks (say 500), would be 2 batches → maybe 2-4s.
Parallel batches (4 concurrent) help if many batches (1500+ chunks).
#### Phase 4: Ranking (<0.1s)
- Cosine similarity for 50-100 chunks
- Each: dot product + normalization (O(dim)=1536)
- 100 × 1536 ≈ 150K FLOPs → negligible on modern CPU
- SIMD acceleration from `TensorPrimitives`
#### Phase 5: Final Answer (5-20s)
- Streaming chat completion
- Input: system prompt + context (50K tokens for 3×500-char chunks) + question
- Output: varies wildly (200-2000 tokens typically)
- Longer context slightly increases latency
- Model choice major factor:
- Qwen Flash: fast (5-10s for 1000 output tokens)
- Gemini Flash: moderate (10-15s)
- Llama-class: slower (20-40s)
---
## Throughput
### Sequential Execution
Running queries one after another (default CLI behavior):
- Latency per query: 16-50s
- Throughput: 1 query / 20s ≈ 180 queries/hour (theoretically)
But API rate limits will kick in before that:
- OpenRouter free tier: limited RPM/TPM
- Even paid: soft limits
### Concurrent Execution (Multiple OpenQuery Instances)
You could run multiple OpenQuery processes in parallel (different terminals), but they share:
- Same API key (OpenRouter rate limit is per API key, not per process)
- Same SearxNG instance (could saturate it)
**Practical**: 3-5 concurrent processes before hitting diminishing returns or rate limits.
### Throughput Optimization
To maximize queries per hour:
1. Use fastest model (Qwen Flash)
2. Reduce `--chunks` to 1-2
3. Reduce `--queries` to 1
4. Use local/fast SearxNG
5. Cache embedding results (not implemented)
6. Batch multiple questions in one process (not implemented; would require redesign)
**Achievable**: Maybe 500-1000 queries/hour on paid OpenRouter plan with aggressive settings.
---
## Memory Usage
### Baseline
.NET 10 AOT app with dependencies:
- **Code**: ~30MB (AOT compiled native code)
- **Runtime**: ~20MB (.NET runtime overhead)
- **Base Memory**: ~50MB
### Per-Query Memory
| Component | Memory | Lifetime |
|-----------|--------|----------|
| Search results (15 items) | ~30KB | Pipeline |
| Articles (raw HTML) | ~5MB (transient) | Freed after parse |
| Articles (extracted text) | ~500KB | Until pipeline complete |
| Chunks (≈100 items) | ~50KB text + embeddings 600KB | Until pipeline complete |
| Embeddings (100 × 1536 floats) | ~600KB | Until pipeline complete |
| HTTP buffers | ~1MB per concurrent request | Short-lived |
| **Total per query** | **~2-5MB** (excluding base) | Released after complete |
**Peak**: When all articles fetched but not yet embedded, we have text ~500KB + chunks ~650KB = ~1.2MB + overhead ≈ 2-3MB.
**If processing many queries in parallel** (unlikely for CLI), memory would scale linearly.
### Memory Leak Risks
- `HttpClient` instances: Created per `OpenRouterClient` and `SearxngClient`. Should be disposed (not happening). But short-lived process exits anyway.
- `StatusReporter` background task: Disposed via `using`
- `RateLimiter` semaphore: Disposed via `IAsyncDisposable` if wrapped in `using` (not currently, but short-lived)
No major leaks observed.
### Memory Optimization Opportunities
1. **Reuse HttpClient** with `IHttpClientFactory` (but not needed for CLI)
2. **Stream article fetching** instead of buffering all articles before embedding (possible: embed as URLs complete)
3. **Early chunk filtering**: Discard low-quality chunks before embedding to reduce embedding count
4. **Cache embeddings**: By content hash, avoid re-embedding seen text (would need persistent storage)
---
## Benchmarking
### Methodology
Measure with `time` command and verbose logging:
```bash
time openquery -v "What is quantum entanglement?" 2>&1 | tee log.txt
```
Parse log for timestamps (or add them manually by modifying code).
### Sample Benchmark
**Environment**:
- Linux x64, .NET 10 AOT
- SearxNG local Docker (localhost:8002)
- OpenRouter API (US East)
- Model: qwen/qwen3.5-flash-02-23
**Run 1**:
```
real 0m23.4s
user 0m1.2s
sys 0m0.3s
```
Log breakdown:
- Query generation: 3.2s
- Searches: 4.1s
- Article fetching: 8.7s (12 URLs)
- Embeddings: 2.8s (45 chunks)
- Final answer: 4.6s (325 tokens)
**Run 2** (cached SearxNG results, same URLs):
```
real 0m15.8s
```
Faster article fetching (2.3s) because sites cached or faster second request.
**Run 3** (verbose `-s` short answer):
```
real 0m18.2s
```
Final answer faster (2.1s instead of 4.6s) due to shorter output.
### Benchmarking Tips
1. **Warm up**: First run slower (JIT or AOT cold start). Discard first measurement.
2. **Network variance**: Run multiple times and average.
3. **Control variables**: Same question, same SearxNG instance, same network conditions.
4. **Measure API costs**: Check OpenRouter dashboard for token counts.
5. **Profile with dotTrace** or `perf` if investigating CPU bottlenecks.
---
## Optimization Strategies
### 1. Tune Concurrent Limits
Edit `SearchTool.cs` where `_options` is created:
```csharp
var _options = new ParallelProcessingOptions
{
MaxConcurrentArticleFetches = 5, // ↓ from 10
MaxConcurrentEmbeddingRequests = 2, // ↓ from 4
EmbeddingBatchSize = 300 // ↑ or ↓ (rarely matters)
};
```
**Why tune down?**
- Hit OpenRouter rate limits
- Network bandwidth saturated
- Too many concurrent fetches overwhelm target sites (ethical/scraping etiquette)
**Why tune up?**
- Fast network, powerful CPU, no rate limits
- Many chunks (>500) needing parallel embedding batches
**Monitor**:
- `openquery -v` shows embedding progress: `[Generating embeddings: batch X/Y]`
- If Y=1 (all fitted in one batch), batch size is fine
- If Y>1 and max concurrent = Y, you're using full parallelism
### 2. Reduce Data Volume
**Fewer search results**:
```bash
openquery -r 3 "question" # instead of 5 or 10
```
Effect: Fetches fewer URLs, extracts fewer chunks. Linear reduction in work.
**Fewer queries**:
```bash
openquery -q 1 "question"
```
Effect: One search instead of N. Quality may suffer (less diverse sources).
**Fewer chunks**:
```bash
openquery -c 1 "question"
```
Effect: Only top 1 chunk in context → fewer tokens → faster final answer, but may miss relevant info.
**Chunk size** (compile-time constant):
Edit `ChunkingService.cs`:
```csharp
private const int MAX_CHUNK_SIZE = 300; // instead of 500
```
Effect: More chunks (more granular ranking) but each chunk shorter → more chunks to rank, more embeddings to generate. Could increase or decrease total time. Likely more tokens overall (more chunks in context if `-c` is fixed number).
### 3. Change Embedding Model
Currently hardcoded to `openai/text-embedding-3-small`. Could use:
- `openai/text-embedding-3-large` (higher quality, slower, more expensive)
- `intfloat/multilingual-e5-large` (multilingual, smaller)
Modify `EmbeddingService` constructor:
```csharp
public EmbeddingService(OpenRouterClient client, string embeddingModel = "your-model")
```
Then pass:
```csharp
var embeddingService = new EmbeddingService(client, "intfloat/multilingual-e5-large");
```
**Impact**: Different dimensionality (1536 vs 1024 vs 4096). Memory scales with dim. Quality may vary for non-English queries.
### 4. Caching
**Current**: No caching. Every query hits all APIs.
**Embedding cache** (by text hash):
- Could store in memory: `Dictionary<string, float[]>`
- Or disk: `~/.cache/openquery/embeddings/`
- Invalidation: embeddings are deterministic per model, so long-term cache viable
**Search cache** (by query hash):
- Cache `List<SearxngResult>` for identical queries
- TTL: maybe 1 hour (search results change over time)
**Article cache** (by URL hash):
- Cache `Article` (text content) per URL
- Invalidation: could check `Last-Modified` header or use TTL (1 day)
**Implementation effort**: Medium. Would need cache abstraction (interface, in-memory + disk options).
**Benefit**: Repeat queries (common in testing or similar questions) become instant.
### 5. Parallelize More (Aggressive)
**Currently**:
- Searches: unbounded (as many as `--queries`)
- Fetches: max 10
- Embeddings: max 4
Could increase:
- Fetches to 20 or 50 (if network/CPU can handle)
- Embeddings to 8-16 (if OpenRouter rate limit allows)
**Risk**:
- Overwhelming target sites (unethical scraping)
- API rate limits → 429 errors
- Local bandwidth saturation
### 6. Local Models (Self-Hosted)
Replace OpenRouter with local LLM:
- **Query generation**: Could run tiny model locally (no API latency)
- **Embeddings**: Could run `all-MiniLM-L6-v2` locally (fast, free after setup)
- **Answer**: Could run Llama 3 8B locally (no cost, but slower than GPT-4/Gemini)
**Benefits**:
- Zero API costs (after hardware)
- No network latency
- Unlimited queries
**Drawbacks**:
- GPU required for decent speed (or CPU very slow)
- Setup complexity (Ollama, llama.cpp, vLLM, etc.)
- Model quality may lag behind commercial APIs
**Integration**: Would need to implement local inference backends (separate project scope).
---
## Scalability Limits
### API Rate Limits
**OpenRouter**:
- Free tier: Very limited (few RPM)
- Paid: Varies by model, but typical ~10-30 requests/second
- Embedding API has separate limits
**Mitigation**:
- Reduce concurrency (see tuning)
- Add exponential backoff (already have for embeddings)
- Batch embedding requests (already done)
### SearxNG Limits
**Single instance**:
- Can handle ~10-50 QPS depending on hardware
- Upstream search engines may rate limit per instance
- Memory ~100-500MB
**Mitigation**:
- Run multiple SearxNG instances behind load balancer
- Use different public instances
- Implement client-side rate limiting (currently only per-URL fetches limited, not searches)
### Network Bandwidth
**Typical data transfer**:
- Searches: 1KB per query × 3 = 3KB
- Articles: 100-500KB per fetch × 15 = 1.5-7.5MB (raw HTML)
- Extracted text: ~10% of HTML size = 150-750KB
- Embeddings: 100 chunks × 1536 × 4 bytes = 600KB (request + response)
- Final answer: 2-10KB
**Total**: ~3-10MB per query
**100 queries/hour**: ~300MB-1GB data transfer
**Not an issue** for broadband, but could matter on metered connections.
---
## Moatslaw's Law: Scaling with Chunk Count
Let:
- C = number of chunks with valid embeddings
- d = embedding dimension (1536)
- B = embedding batch size (300)
- P = max parallel embedding batches (4)
**Embedding Time**`O(C/B * 1/P)` (batches divided by parallelism)
**Ranking Time**`O(C * d)` (dot product per chunk)
**Context Tokens** (for final answer) ≈ `C * avg_chunk_tokens` (≈ 500 chars = 125 tokens)
**As C increases**:
- Embedding time: linear in C/B (sublinear if batch fits in one)
- Ranking time: linear in C
- Final answer latency: more tokens in context → longer context processing + potentially longer answer (more relevant chunks to synthesize)
**Practical limit**:
- With defaults, C ~ 50-100 (from 15 articles)
- Could reach C ~ 500-1000 if:
- `--queries` = 10
- `--results` = 20 (200 URLs)
- Many articles long → many chunks each
- At C = 1000:
- Embeddings: 1000/300 ≈ 4 batches, with 4 parallel → still 1 sequential step (if 4 batches, parallel all 4 → time ≈ 1 batch duration)
- But OpenRouter may have per-minute limits on embedding requests
- Ranking: 1000 × 1536 = 1.5M FLOPs → still <0.01s
- Context tokens: 1000 × 125 = 125K tokens! Many LLMs have 200K context, so fits, but expensive and slow.
**Conclusion**: Current defaults scale to C ~ 100-200 comfortably. Beyond that:
- Need to increase batch size or parallelism for embeddings
- May hit embedding API rate limits
- Context token count becomes expensive and may degrade answer quality (LLMs lose focus in very long context)
---
## Profiling
### CPU Profiling
Use `dotnet-trace` or `perf`:
```bash
# Collect trace for 30 seconds while running query
dotnet-trace collect --process-id $(pgrep OpenQuery) --duration 30s -o trace.nettrace
# Analyze with Visual Studio or PerfView
```
Look for:
- Hot methods: `ChunkingService.ChunkText`, `EmbeddingService.GetEmbeddingsAsync`, cosine similarity
- Allocation hotspots
### Memory Profiling
```bash
dotnet-gcdump collect -p <pid>
# Open in VS or dotnet-gcdump analyze
```
Check heap size, object counts (look for large `string` objects from article content).
### Network Profiling
Use `tcpdump` or `wireshark`:
```bash
tcpdump -i any port 8002 or port 443 -w capture.pcap
```
Or simpler: `time` on individual curl commands to measure latency components.
---
## Next Steps
- [Configuration](../configuration.md) - Tune for your environment
- [Troubleshooting](../troubleshooting.md) - Diagnose slow performance
- [Architecture](../architecture.md) - Understand pipeline bottlenecks
---
**Quick Tuning Cheatsheet**
```bash
# Fast & cheap (factual Q&A)
openquery -q 1 -r 3 -c 2 -s "What is X?"
# Thorough (research)
openquery -q 5 -r 10 -c 5 -l "Deep dive on X"
# Custom code edit for concurrency
# In SearchTool.cs:
_options = new ParallelProcessingOptions {
MaxConcurrentArticleFetches = 20, // if network can handle
MaxConcurrentEmbeddingRequests = 8 // if API allows
};
```