How We Test

Standardized benchmarks. Same conditions for all tools. Real measurements, not marketing claims.

The Problem

AI tools promise "one-click" solutions and "10x productivity". But what's the real cost?

  • • Marketing shows generation time, not time-to-publish
  • • Reviews focus on features, not actual efficiency
  • • "AI-generated" often means "needs heavy editing"
  • • Credit-based pricing hides true cost per usable output

We measure what matters: how much of your time goes to fixing AI output?

Friction Score

Our primary metric. Simple formula, powerful insight:

Friction Score =
(Edit Time ÷ Total Time) × 100%
0-20%
Low Friction ✅

Output mostly usable as-is. Minor tweaks only.

21-50%
Medium Friction ⚡

Requires refinement but still saves time overall.

51-80%
High Friction ⚠️

More editing than generating. Marginal time savings.

81-100%
Critical 🛑

Faster to do manually. Tool creates more work than it saves.

Standardized Benchmarks

Like crash tests for cars. Same conditions for every tool.

The Process

  1. 1
    Define the task — Clear objective, quality criteria, manual baseline time.
  2. 2
    Create two prompts — Beginner (minimal context) and Expert (detailed instructions).
  3. 3
    Run the test — Same prompts for all tools in the category.
  4. 4
    Measure everything — Generation time, edit time, quality assessment.
  5. 5
    Document the output — Raw output, edits made, final result.

Why Two Prompt Levels?

This separates "tool is bad" from "user can't prompt".

🟢 Beginner Prompt

"Write a LinkedIn post about our product launch"

Minimal context. How most users actually prompt.

🔵 Expert Prompt

Detailed: audience, tone, structure, length, what to avoid, examples...

Maximum context. Tests tool's ceiling.

Other Metrics

Time-to-Publish (TTP)

Total time from opening the tool to having publish-ready output. Includes: generation, review, editing, fact-checking.

Cost per Usable Unit

Monthly subscription ÷ usable outputs. For credit-based tools, we factor in wasted generations.

Intervention Rate

What percentage of outputs require manual intervention? 100% = every output needs editing.

Prompt Sensitivity

  • Linear — Quality scales with prompt effort. Worth investing in prompts.
  • Plateau — Diminishing returns. Basic prompts are enough.
  • Unpredictable — No correlation. Results vary regardless of input.

User Reports

Our benchmarks are controlled tests. User reports add real-world context.

What we collect:

  • • Task type and description
  • • Generation time and edit time
  • • Was output usable? (Yes / With edits / No)
  • • Issues encountered
  • • Hourly rate (optional, for cost calculation)

Spam prevention:

  • • Rate limiting (1 report per tool per day per IP)
  • • Outlier detection
  • • Manual review for suspicious patterns

Limitations

We're transparent about what we can and can't measure.

  • Subjectivity: "Usable" quality is somewhat subjective. We use consistent criteria, but your standards may differ.
  • Task coverage: We can't test every possible use case. We focus on common, representative tasks.
  • Tool updates: AI tools change frequently. We retest quarterly and note when data may be outdated.
  • User skill: Even "beginner" prompts assume basic literacy. Expert prompts assume prompt engineering knowledge.

Help improve our data

Your experience matters. Submit your efficiency data to help others.