How long until Mythos-level capability is generally available?
Anthropic’s Claude Mythos Preview can autonomously discover zero-day vulnerabilities at industrial scale — finding flaws in OpenBSD, FFmpeg, and Firefox that survived decades of expert review. Anthropic concluded general release would make large-scale cyberattacks “far more likely” and restricted access to eleven partner organizations.
This project tracks one metric: Time-to-Parity (TTP) — how many days until unrestricted, downloadable AI models match Mythos across reasoning, software engineering, and cybersecurity. Historically, TTP has compressed from 440 days to 106 days — a 4x acceleration. At current rates, there is an 85% probability of full parity before the end of 2026.
Three pillars to parity
Mythos-level capability requires reasoning, software engineering, and cybersecurity. The headline number tracks the last pillar to fall.
Reasoning
GPQA Diamond (198 graduate-level science questions) measures deep reasoning that cannot be solved by search. Anthropic identifies this as the root capability behind Mythos's cybersecurity performance. Frontier models have essentially closed the gap already — Gemini 3.1 Pro (94.3%) is 0.2 points from Mythos (94.5%). The best open-weight model (Qwen3.5-397B, 88.4%) is 6.1 points behind.
GPQA Diamond — top models
Why reasoning matters
Anthropic's own assessment is that Mythos's cybersecurity capability is a downstream effect of general reasoning improvement, not specialized cyber training. GPQA Diamond is the hardest public general reasoning benchmark.
The causal chain: General reasoning → Code reasoning → Cybersecurity capability. If the reasoning gap closes, the others follow.
Frontier has essentially achieved reasoning parity. Open-weight is 6.1 points behind and closing at ~2 points per quarter. This pillar will not be the bottleneck.
Software Engineering
SWE-bench measures autonomous software engineering — the ability to understand, navigate, and modify real codebases. This is the current bottleneck: the best non-restricted model is 18.7 points from Mythos's 77.8% on SWE-bench Pro. The historical Verified variant (left) shows TTP compressing from 440 to 106 days.
SWE-bench Verified (historical)
Deprecated Feb 2026 due to confirmed contamination. Historical context only.
SWE-bench Pro — top models
Cybersecurity
This project uses CyberGym as its primary cybersecurity benchmark: proof-of-concept exploit generation against 1,507 real-world vulnerabilities across 188 open-source projects.
The previous standard — Cybench — was adopted by both the U.S. AI Safety Institute (NIST) and UK AI Safety Institute as their sole open-source measure of offensive cyber capability in pre-deployment testing. Mythos achieved a 100% pass rate, saturating the benchmark entirely. Anthropic concluded it is “no longer sufficiently informative of current frontier model capabilities.”
On April 13, 2026, the AI Security Institute independently confirmed the step-change: Mythos is the first model to complete an AISI cyber range end-to-end — progressing from initial reconnaissance through full network takeover across all nine milestones. The next closest model, Claude Opus 4.6, stalls at infrastructure compromise. Every other model plateaus at credential theft.
CyberGym provides the statistical depth to measure what Cybench no longer can.
CyberGym — capability over time
CyberGym — top models
The Open-Weight Inversion
On the most direct cybersecurity benchmark, open-weight models have already surpassed every frontier model. The gap we’re tracking isn’t between open-weight and frontier — it’s between open-weight and a single restricted model.
The frontier tier — models with audit trails, guardrails, and usage policies — is already behind the unmonitored, decentralized tier on direct cybersecurity capability. Open-weight models should be on every security leader’s radar.
The Convergence
The gap between frontier and open-weight is compressing across every benchmark. When the lines converge, the gap closes. When amber crosses above blue, open-weight has surpassed frontier.
GPQA Diamond — frontier vs open-weight
SWE-bench Pro — frontier vs open-weight
CyberGym — frontier vs open-weight
The Trendline
Time-to-parity has compressed from ~440 days (at the 49% SWE-bench threshold) to ~106 days (at the 80% threshold). That’s a 4x compression over roughly two years. The question isn’t whether open-weight catches up — it’s how fast.
The Inflection Points
Accelerators
Three risk factors could compress the projected parity window below what the historical trendline suggests:
“Citizen Hacker”
Just as citizen developers use AI to build software without formal training, citizen hackers will use AI to discover and exploit vulnerabilities without deep security expertise. When open-weight models reach Mythos-level capability, anyone with a GPU can run autonomous offensive campaigns — with no audit trail, no guardrails, and no way for the originating lab to intervene.
The VulnApocalypse is already underway. Research from Nicholas Carlini at UnPrompted 2025, MOAK.AI, and AISLE demonstrates that current-generation models already provide real offensive uplift. Mythos is a step-change beyond that — and eleven organizations already have access to it. This project tracks how fast unrestricted models are closing in.
CISO Response
The same convergence that makes offensive capability more accessible also makes defensive capability cheaper — open-weight models that reach 90%+ SWE capability let enterprises self-host autonomous remediation without sending proprietary code to third-party APIs. The resources below are where the practitioner community is coordinating the response.
How We Measure — the benchmarks, projection methods, confidence intervals, and what we deliberately don’t control for.
Forward projections fit logistic growth with linear fallback on the leading-edge (best-so-far) trajectory of non-restricted models. 95% bootstrap CIs from 1,000 resamples. All-model regression reported as cross-check. Probability = fraction of bootstrap draws projecting parity on or before Dec 31, 2026; combined = product of independent pillar probabilities.