How We Test
Standardized benchmarks. Same conditions for all tools. Real measurements, not marketing claims.
The Problem
AI tools promise "one-click" solutions and "10x productivity". But what's the real cost?
- • Marketing shows generation time, not time-to-publish
- • Reviews focus on features, not actual efficiency
- • "AI-generated" often means "needs heavy editing"
- • Credit-based pricing hides true cost per usable output
We measure what matters: how much of your time goes to fixing AI output?
Friction Score
Our primary metric. Simple formula, powerful insight:
Output mostly usable as-is. Minor tweaks only.
Requires refinement but still saves time overall.
More editing than generating. Marginal time savings.
Faster to do manually. Tool creates more work than it saves.
Standardized Benchmarks
Like crash tests for cars. Same conditions for every tool.
The Process
- 1Define the task — Clear objective, quality criteria, manual baseline time.
- 2Create two prompts — Beginner (minimal context) and Expert (detailed instructions).
- 3Run the test — Same prompts for all tools in the category.
- 4Measure everything — Generation time, edit time, quality assessment.
- 5Document the output — Raw output, edits made, final result.
Why Two Prompt Levels?
This separates "tool is bad" from "user can't prompt".
"Write a LinkedIn post about our product launch"
Minimal context. How most users actually prompt.
Detailed: audience, tone, structure, length, what to avoid, examples...
Maximum context. Tests tool's ceiling.
Other Metrics
Time-to-Publish (TTP)
Total time from opening the tool to having publish-ready output. Includes: generation, review, editing, fact-checking.
Cost per Usable Unit
Monthly subscription ÷ usable outputs. For credit-based tools, we factor in wasted generations.
Intervention Rate
What percentage of outputs require manual intervention? 100% = every output needs editing.
Prompt Sensitivity
- Linear — Quality scales with prompt effort. Worth investing in prompts.
- Plateau — Diminishing returns. Basic prompts are enough.
- Unpredictable — No correlation. Results vary regardless of input.
User Reports
Our benchmarks are controlled tests. User reports add real-world context.
What we collect:
- • Task type and description
- • Generation time and edit time
- • Was output usable? (Yes / With edits / No)
- • Issues encountered
- • Hourly rate (optional, for cost calculation)
Spam prevention:
- • Rate limiting (1 report per tool per day per IP)
- • Outlier detection
- • Manual review for suspicious patterns
Limitations
We're transparent about what we can and can't measure.
- Subjectivity: "Usable" quality is somewhat subjective. We use consistent criteria, but your standards may differ.
- Task coverage: We can't test every possible use case. We focus on common, representative tasks.
- Tool updates: AI tools change frequently. We retest quarterly and note when data may be outdated.
- User skill: Even "beginner" prompts assume basic literacy. Expert prompts assume prompt engineering knowledge.
Help improve our data
Your experience matters. Submit your efficiency data to help others.