AI Search Benchmark¶
Skylattice should be evaluated through isolated web-search reviews, not through one long conversational thread that contaminates later judgments.
Review Model¶
Use four isolated agents or sessions for each review window:
- Agent A: English discovery
- Agent B: Chinese discovery
- Agent C: technical indexing and citation surface review
- Agent D: external authority scout
What To Record¶
- whether Skylattice appears at all for non-brand queries
- whether the first official citation is the Pages site, the GitHub repo, or no official source
- whether the recommendation is accurate
- whether the answer confuses Skylattice with a generic coding agent, chat wrapper, or hosted bot
- whether Chinese snippets are readable and query-aligned
- whether external mentions and directory references have increased
English Query Cluster¶
- open-source local-first AI agent runtime
- persistent memory agent with auditability
- governed repo tasks open source
- Git-native AI agent project
- auditable agent framework with rollback
- AI agent runtime you can verify without API keys
Chinese Query Cluster¶
- ???? AI Agent ??? ??
- ?????? AI agent ??
- ???? repo task ????
- Git ?? AI agent ??
- ??? AI agent ??
- ????? ? AI agent ??
Suggested Scorecard¶
| Review window | Agent | Query | Appeared? | First official source | Positioning accurate? | Snippet readable? | Notes |
|---|---|---|---|---|---|---|---|
| Day 0 | A | 1 | |||||
| Day 0 | B | 1 | |||||
| Day 0 | C | n/a | |||||
| Day 0 | D | n/a |
Reuse the same table shape for Day 7, Day 14, Day 30, and the weekly follow-ups after Day 30.
Output Locations¶
- raw notes, screenshots, and search transcripts stay local under
.local/discoverability/ - public-safe summaries belong under
evals/ai-search/
Current Baseline¶
The current tracked baseline lives in evals/ai-search/2026-04-09-baseline.md and should be used as the reference point for the Day 7, Day 14, and Day 30 reviews.