Benchmark

Last updated 4 min read

Voxstr's pitch is speed and locality, not adjectives. This page shows the numbers behind that pitch and how they compare to the named alternatives developers usually evaluate. Voxstr's numbers come from our own evaluation harness in the open-source repo. Competitor numbers are pulled from the vendor's site or independent reviews, with retrieval dates so you can verify them yourself.

How Voxstr compares

Four products, eight rows. Voxstr's numbers come from our public eval run on 2026-05-11; competitor numbers are linked to the source we pulled them from on the same day.

Metric Voxstr Wispr Flow Aqua Voice Apple Dictation
p50 latency 85ms1 ~700ms3 450ms4 150ms to 400ms5
p95 latency 124ms1 unconfirmed unconfirmed unconfirmed
WER on LibriSpeech test-clean 2.41%1 unconfirmed 2.7% (AISpeak)6 ~8% WER7
Empty-output rate on speech 0% (0/100)1 unconfirmed unconfirmed unconfirmed
Silence hallucination rate 0% (0/5 clean)1 unconfirmed unconfirmed unconfirmed
Local-only? Yes2 No8 No9 Yes (Apple Silicon)10
Cloud round-trip required? No2 Yes8 Yes9 No (on Apple Silicon)10
Open source? Yes (MIT)2 No No No

On Voxstr, the cleanup-pass latency adds 207ms p50 and 1019ms p95 on top of the speech-to-text numbers above when AI cleanup is enabled. Cleanup is local too. Both stages run on-device.11

Methodology

Voxstr's latency, WER, empty-output, and silence-hallucination numbers come from a 105-clip public evaluation set: 75 clips from LibriSpeech test-clean, 20 from the VCTK corpus, 5 silence clips, and 5 vocabulary-dense clips. Hardware is an Apple M4 Max with 64 GB of unified memory. The reference transcripts and the model output are both passed through OpenAI's EnglishTextNormalizer before WER is computed, which prevents punctuation and contraction artefacts from inflating the error rate. The exact harness, dataset list, and raw per-clip results are in the open-source repo under eval/.1

The cleanup-stage numbers come from a separate 117-fixture editorial-cleanup evaluation, run on the same hardware on 2026-05-11 against the production prompt and LoRA bundle that ship in the app. The full report is in the same repo.11

How we measure

The Voxstr p50 and p95 latency rows above measure speech-to-text engine time on the Apple Neural Engine. The user-facing perf metric in everyday use is the total pipeline time from hotkey release to text appearing in the focused app: that combines the speech-to-text stage, the cleanup stage if AI cleanup is on, and the text-injection stage. The 85ms speech-to-text figure is the part competitors generally publish; we surface it here for a like-for-like comparison. The full pipeline figure including cleanup is what you actually feel, and Voxstr's worst case on long inputs sits in the seconds range, like every other AI-cleanup tool in the category. Turn cleanup off and Voxstr returns to the 85ms regime.

Why the cloud round-trip matters

Wispr Flow and Aqua Voice both run transcription on their own servers. That is a structural design decision, not a tradeoff that can be patched away. Every dictation pays for the network: TLS handshake, audio upload, server inference, text download. Independent reviews put Wispr Flow at roughly 700 milliseconds end-to-end; Aqua Voice publishes 450 milliseconds. Voxstr runs the entire pipeline on the Apple Neural Engine inside your process, so there is no round trip to pay for and no privacy story to debate. The numbers above are the consequence of that architectural choice, not a marketing claim.

Sources

  1. Voxstr speech-to-text numbers (latency, WER, empty-output rate, silence hallucination): README, Evaluation section. Hardware: Apple M4 Max, 64 GB. Eval set: 105 public clips (75 LibriSpeech test-clean + 20 VCTK + 5 silence + 5 vocab-dense). Retrieved 2026-05-11.
  2. Voxstr architecture (local-only, MIT open source): repo CLAUDE.md and the Privacy and local-first docs page. Retrieved 2026-05-11.
  3. Wispr Flow latency, ~700ms end-to-end (cloud round trip): Voibe Wispr Flow review. Retrieved 2026-05-11.
  4. Aqua Voice latency, 450ms response time: Voibe Aqua Voice pricing review; the Aqua site itself uses "sub-second latency". Retrieved 2026-05-11.
  5. Apple Dictation / SpeechAnalyzer latency, 150ms to 400ms: Dictato engine benchmark. Retrieved 2026-05-11.
  6. Aqua Voice "Avalon" model accuracy on AISpeak benchmark (97.3% accuracy, i.e. ~2.7% WER): aquavoice.com homepage. Note this is a different benchmark than LibriSpeech test-clean and is the vendor's own published number, not an independent measurement. Retrieved 2026-05-11.
  7. Apple SpeechAnalyzer WER ~8% (CER ~3%): MacRumors transcription benchmark. The benchmark dataset is not LibriSpeech test-clean and the methodology differs, so this row is a like-with-caveats comparison. Retrieved 2026-05-11.
  8. Wispr Flow cloud transcription: wisprflow.ai/privacy: "transcription always happens in the cloud to provide the best speed and accuracy". Retrieved 2026-05-11.
  9. Aqua Voice cloud architecture: Hacker News launch thread for Aqua Voice 2, where the developers said local models could not yet meet their quality bar at speed. Retrieved 2026-05-11.
  10. Apple Dictation on-device transcription on Apple Silicon: Apple support page. On Apple Silicon, processing is on-device; Intel Macs route through Apple's servers. Retrieved 2026-05-11.
  11. Voxstr cleanup-stage latency (p50 207ms, p95 1019ms), measured against the 117-fixture editorial-cleanup eval set on the production prompt and LoRA bundle that ship in the app: v5p6-shipping-baseline-2026-05-11.md. Hardware: Apple M4 Max, 64 GB. Retrieved 2026-05-11.

Reproducing the Voxstr numbers

The eval harness is open source. Clone the repo, run make eval-setup and make eval-datasets-fetch to install dependencies and download the public clip set, then make eval-ab ENGINE=parakeet DATASET=public to produce the speech-to-text numbers. The cleanup-stage report is regenerated by make eval. We will not update this page silently when results change. New baselines get a new file in eval/results/ with a date and a new link from this page.