docs: add comprehensive documentation with README and detailed guides

- Add user-friendly README.md with quick start guide
- Create docs/ folder with structured technical documentation:
  - installation.md: Build and setup instructions
  - configuration.md: Complete config reference
  - usage.md: CLI usage guide with examples
  - architecture.md: System design and patterns
  - components/: Deep dive into each component (OpenQueryApp, SearchTool, Services, Models)
  - api/: CLI reference, environment variables, programmatic API
  - troubleshooting.md: Common issues and solutions
  - performance.md: Latency, throughput, and optimization
- All documentation fully cross-referenced with internal links
- Covers project overview, architecture, components, APIs, and support

See individual files for complete documentation.
This commit is contained in:
OpenQuery Documentation
2026-03-19 10:01:58 +01:00
parent b28d8998f7
commit 65ca2401ae
16 changed files with 7073 additions and 0 deletions

699
docs/troubleshooting.md Normal file
View File

@@ -0,0 +1,699 @@
# Troubleshooting
Solve common issues, errors, and performance problems with OpenQuery.
## 📋 Table of Contents
1. [Common Errors](#common-errors)
2. [Performance Issues](#performance-issues)
3. [Debugging Strategies](#debugging-strategies)
4. [Getting Help](#getting-help)
## Common Errors
### ❌ "API Key is missing"
**Error Message**:
```
[Error] API Key is missing. Set OPENROUTER_API_KEY environment variable or run 'configure -i' to set it up.
```
**Cause**: No API key available from environment or config file.
**Solutions**:
1. **Set environment variable** (temporary):
```bash
export OPENROUTER_API_KEY="sk-or-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
```
2. **Configure interactively** (persistent):
```bash
openquery configure -i
# Follow prompts to enter API key
```
3. **Check config file**:
```bash
cat ~/.config/openquery/config
# Should contain: ApiKey=sk-or-...
```
4. **Verify environment**:
```bash
echo $OPENROUTER_API_KEY
# If empty, you didn't export or exported in wrong shell
```
---
### ❌ "Network request failed"
**Error Message**:
```
[Error] Network request failed. Details: Name or service not known
```
**Cause**: Cannot reach OpenRouter or SearxNG API endpoints.
**Solutions**:
1. **Check internet connectivity**:
```bash
ping 8.8.8.8
curl https://openrouter.ai
```
2. **Verify SearxNG is running**:
```bash
curl "http://localhost:8002/search?q=test&format=json"
# Should return JSON
```
If connection refused:
```bash
# Start SearxNG if using Docker
docker start searxng
# Or run fresh
docker run -d --name searxng -p 8002:8080 searxng/searxng:latest
```
3. **Check firewall/proxy**:
```bash
# Test OpenRouter API
curl -H "Authorization: Bearer $OPENROUTER_API_KEY" \
https://openrouter.ai/api/v1/models
```
4. **Test from different network** (if behind restrictive firewall)
---
### ❌ "No search results found"
**Error Message**:
```
No search results found.
```
**Cause**: Search queries returned zero results from SearxNG.
**Solutions**:
1. **Test SearxNG manually**:
```bash
curl "http://localhost:8002/search?q=test&format=json" | jq '.results | length'
# Should be > 0
```
2. **Check SearxNG configuration**:
- If self-hosted: ensure internet access is enabled in `/etc/searxng/settings.yml`
- Some public instances disable certain engines or have rate limits
3. **Try a different SearxNG instance**:
```bash
export SEARXNG_URL="https://searx.example.com"
openquery "question"
```
4. **Use simpler queries**: Some queries may be too obscure or malformed
5. **Verbose mode to see queries**:
```bash
openquery -v "complex question"
# See what queries were generated
```
---
### ❌ "Found search results but could not extract readable content."
**Cause**: SearxNG returned results but `ArticleService` failed to extract content from all URLs.
**Common Reasons**:
- JavaScript-heavy sites (React, Vue apps) where content loaded dynamically
- Paywalled sites (NYT, academic journals)
- PDFs or non-HTML content
- Malformed HTML
- Server returned error (404, 403, 500)
- `robots.txt` blocked crawler
**Solutions**:
1. **Accept that some sites can't be scraped** - try different query to get different results
2. **Use site:reddit.com or site:wikipedia.org** - these are usually scrape-friendly
3. **Increase `--results`** to get more URLs (some will work)
4. **Check verbose output**:
```bash
openquery -v "question"
# Look for "Warning: Failed to fetch article"
```
5. **Try a local SearxNG instance with more engines** - some engines fetch different sources
---
### ❌ Rate Limiting (429 Too Many Requests)
**Symptoms**:
```bash
[Error] Response status code does not indicate success: 429 (Too Many Requests).
```
Or retries exhausting after Polly attempts.
**Cause**: Too many concurrent requests to OpenRouter API.
**Solutions**:
1. **Reduce concurrency** (edit `SearchTool.cs`):
```csharp
var _options = new ParallelProcessingOptions
{
MaxConcurrentArticleFetches = 5, // reduce from 10
MaxConcurrentEmbeddingRequests = 2, // reduce from 4
EmbeddingBatchSize = 150 // reduce from 300
};
```
2. **Add delay** between embedding batches (custom implementation)
3. **Upgrade OpenRouter plan** to higher rate limits
4. **Wait and retry** - rate limits reset after time window
---
### ❌ Slow Performance
**Symptom**: Queries take 60+ seconds when they usually take 20s.
**Diagnosis Steps**:
1. **Run with verbose mode**:
```bash
openquery -v "question"
```
Watch which phase takes longest:
- Query generation?
- Searching?
- Fetching articles?
- Embeddings?
2. **Check network latency**:
```bash
time curl "https://openrouter.ai/api/v1/models"
time curl "http://localhost:8002/search?q=test&format=json"
```
**Common Causes & Fixes**:
| Phase | Cause | Fix |
|-------|-------|-----|
| Searches | SearxNG overloaded/slow | Check CPU/memory, restart container |
| Fetching | Target sites slow | Reduce `--results` to fewer URLs |
| Embeddings | API rate limited | Reduce concurrency (see above) |
| Answer | Heavy model/load | Switch to faster model (e.g., Qwen Flash) |
3. **Resource monitoring**:
```bash
htop # CPU/memory usage
iftop # network throughput
```
4. **Reduce parameters**:
```bash
openquery -q 2 -r 3 -c 2 "question" # lighter load
```
---
### ❌ Out of Memory
**Symptoms**:
- Process killed by OOM killer (Linux)
- `System.OutOfMemoryException`
- System becomes unresponsive
**Cause**: Processing too many large articles simultaneously.
**Why**: Each article can be 100KB+ of text, split into many chunks, embeddings are 6KB per chunk (1536 floats × 4 bytes). 200 chunks = 1.2MB embeddings, plus text ~100KB = 1.3MB. Not huge, but many large articles could create thousands of chunks.
**Solutions**:
1. **Reduce `--results`** (fewer URLs per query):
```bash
openquery -r 3 "question" # instead of 10
```
2. **Reduce `--queries`** (fewer search queries):
```bash
openquery -q 2 "question"
```
3. **Fetches already limited** to 10 concurrent by default, which is reasonable
4. **Check article size**: Some sites (PDFs, long documents) may yield megabytes of text; SmartReader should truncate but may not
---
### ❌ Invalid JSON from Query Generation
**Symptom**: Query generation fails silently, falls back to original question.
**Cause**: LLM returned non-JSON (even though instructed). Could be:
- Model not instruction-following
- Output exceeded context window
- API error in response
**Detection**: Run with `-v` to see:
```
[Failed to generate queries, falling back to original question. Error: ...]
```
**Solutions**:
- Try a different model (configure to use Gemini or DeepSeek)
- Reduce `--queries` count (simpler task)
- Tune system prompt (would require code change)
- Accept fallback - the original question often works as sole query
---
### ❌ Spinner Artifacts in Output
**Symptom**: When redirecting output to file, you see weird characters like `⠋`, `<60>`, etc.
**Cause**: Spinner uses Unicode Braille characters and ANSI escape codes.
**Fix**: Use `2>/dev/null | sed 's/.\x08//g'` to clean:
```bash
openquery "question" 2>/dev/null | sed 's/.\x08//g' > answer.md
```
Or run with `--verbose` (no spinner, only newline-separated messages):
```bash
openquery -v "question" > answer.txt
```
---
### ❌ "The type or namespace name '...' does not exist" (Build Error)
**Cause**: Missing NuGet package or wrong .NET SDK version.
**Solution**:
1. **Verify .NET SDK 10.0**:
```bash
dotnet --version
# Should be 10.x
```
If lower: https://dotnet.microsoft.com/download/dotnet/10.0
2. **Restore packages**:
```bash
dotnet restore
```
3. **Clean and rebuild**:
```bash
dotnet clean
dotnet build
```
4. **Check OpenQuery.csproj** for package references:
```xml
<PackageReference Include="Polly.Core" Version="8.6.6" />
<PackageReference Include="Polly.RateLimiting" Version="8.6.6" />
<PackageReference Include="SmartReader" Version="0.11.0" />
<PackageReference Include="System.CommandLine" Version="2.0.0-beta4.22272.1" />
<PackageReference Include="System.Numerics.Tensors" Version="9.0.0" />
```
If restore fails, these packages may not be available for .NET 10 preview. Consider:
- Downgrade to .NET 8.0 (if packages incompatible)
- Or find package versions compatible with .NET 10
---
### ❌ AOT Compilation Fails
**Error**: `error NETSDK1085: The current .NET SDK does not support targeting .NET 10.0.`
**Cause**: Using .NET SDK older than 10.0.
**Fix**: Install .NET SDK 10.0 preview.
**Or**: Disable AOT for development (edit `.csproj`):
```xml
<!-- Remove or set to false -->
<PublishAot>false</PublishAot>
```
---
## Performance Issues
### Slow First Request
**Expected**: First query slower (JIT compilation for .NET runtime if not AOT, or initial API connections).
If not using AOT:
- Consider publishing with `/p:PublishAot=true` for production distribution
- Development builds use JIT, which adds 500ms-2s warmup
**Mitigation**: Accept as warmup cost, or pre-warm with dummy query.
---
### High Memory Usage
**Check**:
```bash
ps aux | grep OpenQuery
# Look at RSS (resident set size)
```
**Typical**: 50-200MB (including .NET runtime, AOT code, data structures)
**If >500MB**:
- Likely processing very many articles
- Check `--results` and `--queries` values
- Use `--verbose` to see counts: `[Fetched X search results]`, `[Extracted Y chunks]`
**Reduce**:
- `--queries 2` instead of 10
- `--results 3` instead of 15
- These directly limit number of URLs to fetch
---
### High CPU Usage
**Cause**:
- SmartReader HTML parsing (CPU-bound)
- Cosine similarity calculations (many chunks, but usually fast)
- Spinner animation (negligible)
**Check**: `htop` → which core at 100%? If single core, likely parsing. If all cores, parallel fetch.
**Mitigation**:
- Ensure `MaxConcurrentArticleFetches` not excessively high (default 10 is okay)
- Accept - CPU spikes normal during fetch phase
---
### API Costs Higher Than Expected
**Symptom**: OpenRouter dashboard shows high token usage.
**Causes**:
1. Using expensive model (check `OPENROUTER_MODEL`)
2. High `--chunks` → more tokens in context
3. High `--queries` + `--results` → many articles → many embedding tokens (usually cheap)
4. Long answers (many completion tokens) - especially with `--long`
**Mitigation**:
- Use `qwen/qwen3.5-flash-02-23` (cheapest good option)
- Reduce `--chunks` to 2-3
- Use `--short` when detailed answer not needed
- Set `MaxTokens` in request (would need code change or **LLM capabilities**
---
## Debugging Strategies
### 1. Enable Verbose Mode
Always start with:
```bash
openquery -v "question" 2>&1 | tee debug.log
```
Logs everything:
- Generated queries
- URLs fetched
- Progress counts
- Errors/warnings
**Analyze log**:
- How many queries generated? (Should match `--queries`)
- How many search results per query? (Should be ≤ `--results`)
- How many articles fetched successfully?
- How many chunks extracted?
- Any warnings?
---
### 2. Isolate Components
**Test SearxNG**:
```bash
curl "http://localhost:8002/search?q=test&format=json" | jq '.results[0]'
```
**Test OpenRouter API**:
```bash
curl -X POST https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"qwen/qwen3.5-flash-02-23","messages":[{"role":"user","content":"Hello"}]}'
```
**Test Article Fetching** (with known good URL):
```bash
curl -L "https://example.com/article" | head -50
```
Then check if SmartReader can parse.
---
### 3. Reduce Scope
Test with minimal parameters to isolate failing phase:
```bash
# 1 query, 2 results, 1 chunk - should be fast and simple
openquery -q 1 -r 2 -c 1 "simple test question" -v
# If that works, gradually increase:
openquery -q 1 -r 5 -c 1 "simple question"
openquery -q 3 -r 5 -c 1 "simple question"
openquery -q 3 -r 5 -c 3 "simple question"
# Then try complex question
```
---
### 4. Check Resource Limits
**File descriptors**: If fetching many articles, may hit limit.
```bash
ulimit -n # usually 1024, should be fine
```
**Memory**: Monitor with `free -h` while running.
**Disk space**: Not much disk use, but logs could fill if verbose mode used repeatedly.
---
### 5. Examine Config File
```bash
cat ~/.config/openquery/config
# Ensure no spaces around '='
# Correct: ApiKey=sk-or-...
# Wrong: ApiKey = sk-or-... (spaces become part of value)
```
Reconfigure if needed:
```bash
openquery configure --key "sk-or-..."
```
---
### 6. Clear Cache / Reset
No persistent cache exists, but:
- Re-start SearxNG container: `docker restart searxng`
- Clear DNS cache if network issues: `sudo systemd-resolve --flush-caches`
---
## Getting Help
### Before Asking
Gather information:
1. **OpenQuery version** (commit or build date if available)
2. **OS and architecture**: `uname -a` (Linux/macOS) or `systeminfo` (Windows)
3. **Full command** you ran
4. **Verbose output**: `openquery -v "question" 2>&1 | tee log.txt`
5. **Config** (redact API key):
```bash
sed 's/ApiKey=.*/ApiKey=REDACTED/' ~/.config/openquery/config
```
6. **SearxNG test**:
```bash
curl -s "http://localhost:8002/search?q=test&format=json" | jq '.results | length'
```
7. **OpenRouter test**:
```bash
curl -s -H "Authorization: Bearer $OPENROUTER_API_KEY" \
https://openrouter.ai/api/v1/models | jq '.data[0].id'
```
---
### Where to Ask
1. **GitHub Issues** (if repository hosted there):
- Search existing issues first
- Provide all info from above
- Include log file (or link to gist)
2. **Community Forum** (if exists)
3. **Self-Diagnose**:
- Check `docs/troubleshooting.md` (this file)
- Check `docs/configuration.md`
- Check `docs/usage.md`
---
### Example Bug Report
```
Title: OpenQuery hangs on "Fetching article X/Y"
Platform: Ubuntu 22.04, .NET 10.0, OpenQuery built from commit abc123
Command: openquery -v "What is Docker?" 2>&1 | tee log.txt
Verbose output shows:
[...]
[Fetching article 1/15: docker.com]
[Fetching article 2/15: hub.docker.com]
[Fetching article 3/15: docs.docker.com]
# Hangs here indefinitely, no more progress
SearxNG test:
$ curl "http://localhost:8002/search?q=docker&format=json" | jq '.results | length'
15 # SearxNG works
Config:
ApiKey=sk-or-xxxx (redacted)
Model=qwen/qwen3.5-flash-02-23
DefaultQueries=3
DefaultChunks=3
DefaultResults=5
Observation:
- Fetches 3 articles fine, then stalls
- Nothing in log after "Fetching article 3/15"
- Process uses ~150MB memory, CPU 0% (idle)
- Ctrl+C exits immediately
Expected: Should fetch remaining 12 articles (concurrent up to 10)
Actual: Only 3 fetched, then silent hang
```
---
## Known Issues
### Issue: Spinner Characters Not Displaying
Some terminals don't support Braille Unicode patterns.
**Symptoms**: Spinner shows as `?` or boxes.
**Fix**: Use font with Unicode support, or disable spinner by setting `TERM=dumb` or use `--verbose`.
---
### Issue: Progress Messages Overwritten
In very fast operations, progress updates may overlap.
**Cause**: `StatusReporter` uses `Console.Write` without lock in compact mode; concurrent writes from channel processor and spinner task could interleave.
**Mitigation**: Unlikely in practice (channel serializes, spinner only updates when `_currentMessage` set). If problematic, add lock around Console operations.
---
### Issue: Articles with No Text Content
Some URLs return articles with empty `TextContent`.
**Cause**: SmartReader's quality heuristic (`IsReadable`) failed, or article truly has no text (image, script, error page).
**Effect**: Those URLs contribute zero chunks.
**Acceptable**: Part of normal operation; not all URLs yield readable content.
---
### Issue: Duplicate Sources in Answer
Same website may appear multiple times (different articles).
**Cause**: Different URLs from different search results may be from same domain but different pages.
**Effect**: `[Source 1]` and `[Source 3]` could both be `example.com`. Not necessarily bad - they're different articles.
---
## Performance Tuning Reference
| Setting | Default | Fastest | Most Thorough | Notes |
|---------|---------|---------|---------------|-------|
| `--queries` | 3 | 1 | 8+ | More queries = more searches |
| `--results` | 5 | 2 | 15+ | Fewer = fewer articles to fetch |
| `--chunks` | 3 | 1 | 5+ | More chunks = more context tokens |
| `MaxConcurrentArticleFetches` | 10 | 5 | 20 | Higher = more parallel fetches |
| `MaxConcurrentEmbeddingRequests` | 4 | 2 | 8 | Higher = faster embeddings (may hit rate limits) |
| `EmbeddingBatchSize` | 300 | 100 | 1000 | Larger = fewer API calls, more data per call |
**Start**: Defaults are balanced.
**Adjust if**:
- Slow: Reduce `--results`, `--queries`, or concurrency limits
- Poor quality: Increase `--chunks`, `--results`, `--queries`
- Rate limited: Reduce concurrency limits
- High cost: Use `--short`, reduce `--chunks`, choose cheaper model
---
## Next Steps
- [Performance](../performance.md) - Detailed performance analysis
- [Configuration](../configuration.md) - Adjust settings
- [Usage](../usage.md) - Optimize workflow
---
**Quick Diagnostic Checklist**
```bash
# 1. Check API key
echo $OPENROUTER_API_KEY | head -c 10
# 2. Test SearxNG
curl -s "http://localhost:8002/search?q=test&format=json" | jq '.results | length'
# 3. Test OpenRouter
curl -s -H "Authorization: Bearer $OPENROUTER_API_KEY" \
https://openrouter.ai/api/v1/models | jq '.data[0].id'
# 4. Run verbose
openquery -v "test" 2>&1 | grep -E "Fetching|Generated|Found"
# 5. Check resource usage while running
htop
# 6. Reduce scope and retry
openquery -q 1 -r 2 -c 1 "simple test"
```