Files

OpenQuery Documentation 65ca2401ae docs: add comprehensive documentation with README and detailed guides

- Add user-friendly README.md with quick start guide
- Create docs/ folder with structured technical documentation:
  - installation.md: Build and setup instructions
  - configuration.md: Complete config reference
  - usage.md: CLI usage guide with examples
  - architecture.md: System design and patterns
  - components/: Deep dive into each component (OpenQueryApp, SearchTool, Services, Models)
  - api/: CLI reference, environment variables, programmatic API
  - troubleshooting.md: Common issues and solutions
  - performance.md: Latency, throughput, and optimization
- All documentation fully cross-referenced with internal links
- Covers project overview, architecture, components, APIs, and support

See individual files for complete documentation.

2026-03-19 10:01:58 +01:00

16 KiB

Raw Blame History

Performance

Performance characteristics, optimization strategies, and scalability considerations for OpenQuery.

📋 Table of Contents

Performance Overview
Latency Breakdown
Throughput
Memory Usage
Benchmarking
Optimization Strategies
Scalability Limits

Performance Overview

OpenQuery is designed for low-latency interactive use (15-50 seconds end-to-end) while maximizing parallelization to minimize wait time.

Key Metrics

Metric	Typical	Best Case	Worst Case
End-to-End Latency	15-50s	10s	120s+
API Cost	$0.01-0.05	$0.005	$0.20+
Memory Footprint	100-300MB	50MB	1GB+
Network I/O	5-20MB	1MB	100MB+

Note: Wide variance due to network latency, content size, and LLM speed.

Latency Breakdown

Default Configuration

-q 3 -r 5 -c 3 (3 queries, 5 results each, 3 final chunks)

Stage	Operation	Parallelism	Time (p50)	Time (p95)	Dominant Factor
1	Query Generation	1	2-5s	10s	LLM inference speed
2a	Searches (3 queries × 5 results)	3 concurrent	3-8s	15s	SearxNG latency
2b	Article Fetching (≈15 URLs)	10 concurrent	5-15s	30s	Each site's response time
2c	Chunking	10 concurrent	<1s	2s	CPU (HTML parsing)
3a	Query Embedding	1	0.5-1s	3s	Embedding API latency
3b	Chunk Embeddings (≈50 chunks)	4 concurrent	1-3s	10s	Batch API latency
4	Ranking	1	<0.1s	0.5s	CPU (vector math)
5	Final Answer Streaming	1	5-20s	40s	LLM generation speed
Total			16-50s	~60s

Phase Details

Phase 1: Query Generation (2-5s)

Single non-streaming LLM call
Input: system prompt + user question (~200 tokens)
Output: JSON array of 3-5 short strings (~50 tokens)
Fast because small context and output

Phase 2a: Searches (3-8s)

3 parallel SearxngClient.SearchAsync calls
Each: query → SearxNG → aggregator engines → scraped results
Latency highly variable based on:
- SearxNG instance performance
- Network distance to SearxNG
- SearxNG's upstream search engines

Phase 2b: Article Fetching (5-15s)

≈15 URLs to fetch (3 queries × 5 results minus duplicates)
Up to 10 concurrent fetches (semaphore)
Each: TCP connect + TLS handshake + HTTP GET + SmartReader parse
Latency:
- Fast sites (CDN, cached): 200-500ms
- Normal sites: 1-3s
- Slow/unresponsive sites: timeout after ~30s

Why 5-15s for 15 URLs with 10 concurrent?

First wave (10 URLs): max latency among them ≈ 3s → 3s
Second wave (5 URLs): another ≈ 3s → total 6s
But many URLs faster (500ms) → total ≈ 2-3s
However, some sites take 5-10s → dominates

Tail latency: Slowest few URLs can dominate total time. Cannot proceed until all fetch attempts complete (or fail).

Phase 2c: Chunking (<1s)

CPU-bound HTML cleaning and splitting
SmartReader is surprisingly fast; C# HTML parser
Typically 100-300 chunks total
<1s on modern CPU

Phase 3: Embeddings (1.5-4s)

Query embedding: 1 call, ~200 tokens, ≈ 0.5-1s
Chunk embeddings: ≈50 chunks → 1 batch of 50 (batch size 300 unused here)
- Batch of 50: still single API call, ~15K tokens (50 × 300 chars ≈ 15K tokens)
- If using text-embedding-3-small: $0.00002 per 1K → ~$0.0003 per batch
- Latency: 1-3s for embedding API

If more chunks (say 500), would be 2 batches → maybe 2-4s.

Parallel batches (4 concurrent) help if many batches (1500+ chunks).

Phase 4: Ranking (<0.1s)

Cosine similarity for 50-100 chunks
Each: dot product + normalization (O(dim)=1536)
100 × 1536 ≈ 150K FLOPs → negligible on modern CPU
SIMD acceleration from TensorPrimitives

Phase 5: Final Answer (5-20s)

Streaming chat completion
Input: system prompt + context (50K tokens for 3×500-char chunks) + question
Output: varies wildly (200-2000 tokens typically)
Longer context slightly increases latency
Model choice major factor:
- Qwen Flash: fast (5-10s for 1000 output tokens)
- Gemini Flash: moderate (10-15s)
- Llama-class: slower (20-40s)

Throughput

Sequential Execution

Running queries one after another (default CLI behavior):

Latency per query: 16-50s
Throughput: 1 query / 20s ≈ 180 queries/hour (theoretically)

But API rate limits will kick in before that:

OpenRouter free tier: limited RPM/TPM
Even paid: soft limits

Concurrent Execution (Multiple OpenQuery Instances)

You could run multiple OpenQuery processes in parallel (different terminals), but they share:

Same API key (OpenRouter rate limit is per API key, not per process)
Same SearxNG instance (could saturate it)

Practical: 3-5 concurrent processes before hitting diminishing returns or rate limits.

Throughput Optimization

To maximize queries per hour:

Use fastest model (Qwen Flash)
Reduce --chunks to 1-2
Reduce --queries to 1
Use local/fast SearxNG
Cache embedding results (not implemented)
Batch multiple questions in one process (not implemented; would require redesign)

Achievable: Maybe 500-1000 queries/hour on paid OpenRouter plan with aggressive settings.

Memory Usage

Baseline

.NET 10 AOT app with dependencies:

Code: ~30MB (AOT compiled native code)
Runtime: ~20MB (.NET runtime overhead)
Base Memory: ~50MB

Per-Query Memory

Component	Memory	Lifetime
Search results (15 items)	~30KB	Pipeline
Articles (raw HTML)	~5MB (transient)	Freed after parse
Articles (extracted text)	~500KB	Until pipeline complete
Chunks (≈100 items)	~50KB text + embeddings 600KB	Until pipeline complete
Embeddings (100 × 1536 floats)	~600KB	Until pipeline complete
HTTP buffers	~1MB per concurrent request	Short-lived
Total per query	~2-5MB (excluding base)	Released after complete

Peak: When all articles fetched but not yet embedded, we have text ~500KB + chunks ~650KB = ~1.2MB + overhead ≈ 2-3MB.

If processing many queries in parallel (unlikely for CLI), memory would scale linearly.

Memory Leak Risks

HttpClient instances: Created per OpenRouterClient and SearxngClient. Should be disposed (not happening). But short-lived process exits anyway.
StatusReporter background task: Disposed via using
RateLimiter semaphore: Disposed via IAsyncDisposable if wrapped in using (not currently, but short-lived)

No major leaks observed.

Memory Optimization Opportunities

Reuse HttpClient with IHttpClientFactory (but not needed for CLI)
Stream article fetching instead of buffering all articles before embedding (possible: embed as URLs complete)
Early chunk filtering: Discard low-quality chunks before embedding to reduce embedding count
Cache embeddings: By content hash, avoid re-embedding seen text (would need persistent storage)

Benchmarking

Methodology

Measure with time command and verbose logging:

time openquery -v "What is quantum entanglement?" 2>&1 | tee log.txt

Parse log for timestamps (or add them manually by modifying code).

Sample Benchmark

Environment:

Linux x64, .NET 10 AOT
SearxNG local Docker (localhost:8002)
OpenRouter API (US East)
Model: qwen/qwen3.5-flash-02-23

Run 1:

real    0m23.4s
user    0m1.2s
sys     0m0.3s

Log breakdown:

Query generation: 3.2s
Searches: 4.1s
Article fetching: 8.7s (12 URLs)
Embeddings: 2.8s (45 chunks)
Final answer: 4.6s (325 tokens)

Run 2 (cached SearxNG results, same URLs):

real    0m15.8s

Faster article fetching (2.3s) because sites cached or faster second request.

Run 3 (verbose -s short answer):

real    0m18.2s

Final answer faster (2.1s instead of 4.6s) due to shorter output.

Benchmarking Tips

Warm up: First run slower (JIT or AOT cold start). Discard first measurement.
Network variance: Run multiple times and average.
Control variables: Same question, same SearxNG instance, same network conditions.
Measure API costs: Check OpenRouter dashboard for token counts.
Profile with dotTrace or perf if investigating CPU bottlenecks.

Optimization Strategies

1. Tune Concurrent Limits

Edit SearchTool.cs where _options is created:

var _options = new ParallelProcessingOptions
{
    MaxConcurrentArticleFetches = 5,        // ↓ from 10
    MaxConcurrentEmbeddingRequests = 2,    // ↓ from 4
    EmbeddingBatchSize = 300               // ↑ or ↓ (rarely matters)
};

Why tune down?

Hit OpenRouter rate limits
Network bandwidth saturated
Too many concurrent fetches overwhelm target sites (ethical/scraping etiquette)

Why tune up?

Fast network, powerful CPU, no rate limits
Many chunks (>500) needing parallel embedding batches

Monitor:

openquery -v shows embedding progress: [Generating embeddings: batch X/Y]
If Y=1 (all fitted in one batch), batch size is fine
If Y>1 and max concurrent = Y, you're using full parallelism

2. Reduce Data Volume

Fewer search results:

openquery -r 3 "question"  # instead of 5 or 10

Effect: Fetches fewer URLs, extracts fewer chunks. Linear reduction in work.

Fewer queries:

openquery -q 1 "question"

Effect: One search instead of N. Quality may suffer (less diverse sources).

Fewer chunks:

openquery -c 1 "question"

Effect: Only top 1 chunk in context → fewer tokens → faster final answer, but may miss relevant info.

Chunk size (compile-time constant): Edit ChunkingService.cs:

private const int MAX_CHUNK_SIZE = 300;  // instead of 500

Effect: More chunks (more granular ranking) but each chunk shorter → more chunks to rank, more embeddings to generate. Could increase or decrease total time. Likely more tokens overall (more chunks in context if -c is fixed number).

3. Change Embedding Model

Currently hardcoded to openai/text-embedding-3-small. Could use:

openai/text-embedding-3-large (higher quality, slower, more expensive)
intfloat/multilingual-e5-large (multilingual, smaller)

Modify EmbeddingService constructor:

public EmbeddingService(OpenRouterClient client, string embeddingModel = "your-model")

Then pass:

var embeddingService = new EmbeddingService(client, "intfloat/multilingual-e5-large");

Impact: Different dimensionality (1536 vs 1024 vs 4096). Memory scales with dim. Quality may vary for non-English queries.

4. Caching

Current: No caching. Every query hits all APIs.

Embedding cache (by text hash):

Could store in memory: Dictionary<string, float[]>
Or disk: ~/.cache/openquery/embeddings/
Invalidation: embeddings are deterministic per model, so long-term cache viable

Search cache (by query hash):

Cache List<SearxngResult> for identical queries
TTL: maybe 1 hour (search results change over time)

Article cache (by URL hash):

Cache Article (text content) per URL
Invalidation: could check Last-Modified header or use TTL (1 day)

Implementation effort: Medium. Would need cache abstraction (interface, in-memory + disk options).

Benefit: Repeat queries (common in testing or similar questions) become instant.

5. Parallelize More (Aggressive)

Currently:

Searches: unbounded (as many as --queries)
Fetches: max 10
Embeddings: max 4

Could increase:

Fetches to 20 or 50 (if network/CPU can handle)
Embeddings to 8-16 (if OpenRouter rate limit allows)

Risk:

Overwhelming target sites (unethical scraping)
API rate limits → 429 errors
Local bandwidth saturation

6. Local Models (Self-Hosted)

Replace OpenRouter with local LLM:

Query generation: Could run tiny model locally (no API latency)
Embeddings: Could run all-MiniLM-L6-v2 locally (fast, free after setup)
Answer: Could run Llama 3 8B locally (no cost, but slower than GPT-4/Gemini)

Benefits:

Zero API costs (after hardware)
No network latency
Unlimited queries

Drawbacks:

GPU required for decent speed (or CPU very slow)
Setup complexity (Ollama, llama.cpp, vLLM, etc.)
Model quality may lag behind commercial APIs

Integration: Would need to implement local inference backends (separate project scope).

Scalability Limits

API Rate Limits

OpenRouter:

Free tier: Very limited (few RPM)
Paid: Varies by model, but typical ~10-30 requests/second
Embedding API has separate limits

Mitigation:

Reduce concurrency (see tuning)
Add exponential backoff (already have for embeddings)
Batch embedding requests (already done)

SearxNG Limits

Single instance:

Can handle ~10-50 QPS depending on hardware
Upstream search engines may rate limit per instance
Memory ~100-500MB

Mitigation:

Run multiple SearxNG instances behind load balancer
Use different public instances
Implement client-side rate limiting (currently only per-URL fetches limited, not searches)

Network Bandwidth

Typical data transfer:

Searches: 1KB per query × 3 = 3KB
Articles: 100-500KB per fetch × 15 = 1.5-7.5MB (raw HTML)
Extracted text: ~10% of HTML size = 150-750KB
Embeddings: 100 chunks × 1536 × 4 bytes = 600KB (request + response)
Final answer: 2-10KB

Total: ~3-10MB per query

100 queries/hour: ~300MB-1GB data transfer

Not an issue for broadband, but could matter on metered connections.

Moatslaw's Law: Scaling with Chunk Count

Let:

C = number of chunks with valid embeddings
d = embedding dimension (1536)
B = embedding batch size (300)
P = max parallel embedding batches (4)

Embedding Time ≈ O(C/B * 1/P) (batches divided by parallelism)

Ranking Time ≈ O(C * d) (dot product per chunk)

Context Tokens (for final answer) ≈ C * avg_chunk_tokens (≈ 500 chars = 125 tokens)

As C increases:

Embedding time: linear in C/B (sublinear if batch fits in one)
Ranking time: linear in C
Final answer latency: more tokens in context → longer context processing + potentially longer answer (more relevant chunks to synthesize)

Practical limit:

With defaults, C ~ 50-100 (from 15 articles)
Could reach C ~ 500-1000 if:
- --queries = 10
- --results = 20 (200 URLs)
- Many articles long → many chunks each
At C = 1000:
- Embeddings: 1000/300 ≈ 4 batches, with 4 parallel → still 1 sequential step (if 4 batches, parallel all 4 → time ≈ 1 batch duration)
- But OpenRouter may have per-minute limits on embedding requests
- Ranking: 1000 × 1536 = 1.5M FLOPs → still <0.01s
- Context tokens: 1000 × 125 = 125K tokens! Many LLMs have 200K context, so fits, but expensive and slow.

Conclusion: Current defaults scale to C ~ 100-200 comfortably. Beyond that:

Need to increase batch size or parallelism for embeddings
May hit embedding API rate limits
Context token count becomes expensive and may degrade answer quality (LLMs lose focus in very long context)

Profiling

CPU Profiling

Use dotnet-trace or perf:

# Collect trace for 30 seconds while running query
dotnet-trace collect --process-id $(pgrep OpenQuery) --duration 30s -o trace.nettrace

# Analyze with Visual Studio or PerfView

Look for:

Hot methods: ChunkingService.ChunkText, EmbeddingService.GetEmbeddingsAsync, cosine similarity
Allocation hotspots

Memory Profiling

dotnet-gcdump collect -p <pid>
# Open in VS or dotnet-gcdump analyze

Check heap size, object counts (look for large string objects from article content).

Network Profiling

Use tcpdump or wireshark:

tcpdump -i any port 8002 or port 443 -w capture.pcap

Or simpler: time on individual curl commands to measure latency components.

Next Steps

Configuration - Tune for your environment
Troubleshooting - Diagnose slow performance
Architecture - Understand pipeline bottlenecks

Quick Tuning Cheatsheet

# Fast & cheap (factual Q&A)
openquery -q 1 -r 3 -c 2 -s "What is X?"

# Thorough (research)
openquery -q 5 -r 10 -c 5 -l "Deep dive on X"

# Custom code edit for concurrency
# In SearchTool.cs:
_options = new ParallelProcessingOptions {
    MaxConcurrentArticleFetches = 20,  // if network can handle
    MaxConcurrentEmbeddingRequests = 8  // if API allows
};

16 KiB Raw Blame History Unescape Escape