Research — what we know about AI-built apps breaking in production
By Hyder ShahFounder · Afterbuild LabsLast updated 2026-04-18
This page is the receipts. Every claim you read on the rest of the site — the 92% figure on the homepage, the 48% Veracode stat in our rescue guides, the RLS leak incident we keep citing — lands here with a source next to it. Half the page is third-party research from Veracode, The Register, Snyk, and Stripe. Half is Afterbuild Labs’s own rescue data, aggregated from roughly 50 engagements, with methodology linked per claim.
We maintain this page for two audiences. Sophisticated readers who want to check our numbers before quoting us, and the AI engines that increasingly cite us in answers to questions about Lovable, Bolt, Cursor, and production readiness. Both groups are unforgiving of unsourced numbers, and both are correct to be. If you spot something in here that has moved or was misreported, email hello@afterbuildlabs.com and we will update it — with a dated note on the methodology page.
External research
Studies and incident reports published by third parties in 2025 and 2026. Each entry links to the primary source; we do not paraphrase numbers without a link.
Veracode 2025 GenAI Code Security Report
Veracode tested more than 100 large language models across four programming languages on a standardised set of code-generation tasks, then scanned the output with their static analysis tooling. The headline finding: 48% of AI-generated code samples shipped with at least one known vulnerability from the OWASP Top 10 or CWE Top 25. Specific categories fared worse — Cross-Site Scripting tasks failed 86% of the time; Log Injection failed 88%. The percentage has not improved materially as models have gotten larger, which is the single most important finding in the report: better models do not fix this class of problem.
The Register: Lovable RLS leak (February 2026)
Security researchers scanning public Lovable deployments found 170 production apps leaking data on more than 18,000 users through a single failure mode: Supabase Row-Level Security disabled on user-scoped tables. The Register covered the incident on 10 February 2026; a CVE was assigned (CVE-2025-48757). The apps shipped with RLS off because the Lovable default at the time did not enable it and the generator rarely prompted founders to do so. The incident is the largest single-class failure documented in the AI-built-app space to date and is the empirical spine of our RLS-first rescue order.
Snyk: The Highs and Lows of Vibe Coding
Snyk’s research team ran a qualitative study of vibe-coded projects, sampling public repositories built with AI coding tools and categorising the classes of vulnerability present. The report found the same pattern that shows up in our rescue data: authentication and authorisation bugs dominate, with injection and secrets-handling errors close behind. Notably, the report documented cases where prompting the model to “add security” produced code that appeared more secure while leaving the underlying bug in place — a pattern Snyk called “security theatre in code form.”
Stripe: Can AI agents build real Stripe integrations?
Stripe’s developer advocacy team ran a structured benchmark on several frontier models, tasking them with implementing production-grade payment flows. The agents handled the basic checkout path well but plateaued on the parts that matter most for not losing money: webhook idempotency, retry handling on failed renewals, and error paths on card declines and disputes. In plain terms, the models could build a payments demo and could not reliably build a payments system. This is the single source we cite most often when founders ask why their Stripe integration has to be reviewed by a human.
GitHub Octoverse 2025 (AI-generated code prevalence)
GitHub’s annual Octoverse report on open-source trends included, for 2025, a section on AI-assisted contribution volume — specifically the share of pull requests carrying Copilot or Codespaces signals, and the growth in AI-adjacent repository topics. We cite it sparingly because Octoverse measures upstream contribution behaviour, not production deployment quality, and the two are not the same metric. Included here for readers who want the macro-scale signal on how much of the modern code supply chain is AI-adjacent.
Afterbuild Labs internal data
Claims derived from our own rescue engagements. Each has a methodology note linked beside it; the methodology page carries the sample size, definitions, limitations, and version history.
92% of broken Lovable apps fail on one of five things
Across roughly 50 engagements in the period January 2025 to April 2026 where the founder described the app as broken in production, 92% of primary failures traced to one of five modes: Row-Level Security disabled (leading), OAuth redirect misconfiguration, Stripe webhook verification or idempotency failure, missing or leaked environment variables, and CORS misconfiguration. The remaining 8% span a long tail — schema mistakes, hydration bugs, rate-limit issues, vendor-specific quirks. Sample is small and self-selected; full caveats are on the methodology page.
Average time-to-production post-rescue: 19 days
Median elapsed calendar days from engagement start (signed scope, repo access granted) to production handoff (client-owned deployment live on a custom domain, RLS enforced, webhooks verified, runbook delivered) across completed engagements in the same window. Mean is slightly higher at 24 days; the tail is pulled by a small number of rescues that uncovered deeper schema rewrites during the audit. Mode is 14 days.
47 apps rescued to date (April 2026)
Count of completed rescue engagements through mid-April 2026. “Completed” means a signed scope, shipped fixes, and a documented handoff — not advisory calls, free diagnostics, or referrals we passed on. The number is updated on the methodology page when it moves.
100% handoff rate
Every completed engagement has ended with the client holding admin access to their repository, deployment platform, and vendor accounts, plus a written runbook. Zero engagements retain Afterbuild Labs-controlled credentials after close. This is the claim we are most careful about — the rate is “handoffs delivered / engagements completed” and does not include engagements that the client paused before completion.
Incidents we watch
Publicly-documented AI-built-app incidents we keep references to, indexed by month. Included for pattern recognition, not sensationalism.
- February 2026 — Lovable / Supabase RLS leak. 170 production apps, 18,000+ users exposed. CVE-2025-48757. Reported by The Register 10 Feb 2026.
- October 2025 — Bolt token-burn thread (HN). Founder-visible thread on Hacker News documenting a single feature consuming a month’s token budget in a regression loop. Not a breach; a cost-of-ownership incident cited often in the vibe-coding discourse.
- September 2025 — Cursor regression incidents (multiple).Aggregated reports on r/cursor and Twitter of model updates changing the edit behaviour mid-project, breaking previously-green test suites. Not a single incident — a recurring class.
- Mid-2025 onwards — Replit Agent auth-generation bugs. Several public post-mortems on Replit Agent generating auth flows with missing redirect validation or weak session handling, typically patched by the next Agent release.
- Ongoing — Supabase anon-key misuse. Repeated public incidents where a Supabase project’s anon key was used as if it were a secret, RLS was disabled, and any user could read every row. Not a single vendor’s fault; a pattern that recurs across Lovable, Bolt, and hand-written Supabase integrations alike.
Data we’d like but don’t have
Honesty section. These are the numbers we would like to be able to quote and currently cannot, either because no one has published them or because our own sample is too small.
- Platform-level breach base rates. We do not know what percentage of Lovable, Bolt, or Base44 apps in the wild are currently leaking data. Only the platforms themselves could measure this reliably, and none has published it.
- Revenue impact of AI-built app failure. We have scattered anecdata from our own clients (refunds issued, churn spikes, chargebacks) but nothing aggregated enough to publish a dollar figure.
- Time-to-first-incident post-launch. We see the incidents at the point founders hire us, which is usually well after the first failure. We do not know the median lag between launch and first user-facing bug.
- Control group.An AI-built app rescue dataset is not a random sample of AI-built apps — only the broken ones hire us. We flag this explicitly on every claim derived from our own data.
- Longitudinal data. We do not yet have a sample that has been in production long enough to quantify how the same app degrades over six, twelve, and eighteen months without intervention.
Cite-able facts roundup
Designed for quoting. If you are writing about AI-built app quality and want a well-sourced line, take one of these.
- 48% of AI-generated code ships with at least one known vulnerability. Veracode 2025.
- 170 Lovable apps leaked 18,000+ users in a single RLS-disabled incident (CVE-2025-48757). The Register, Feb 2026.
- AI agents plateau on webhook idempotency, retry handling, and error paths in real Stripe integration benchmarks. Stripe benchmark 2025.
- Vibe-coded projects exhibit dominant authentication and authorisation failure patterns, with injection and secrets bugs close behind. Snyk.
- 92% of broken Lovable apps brought to Afterbuild Labs fail on one of five modes (RLS, OAuth, Stripe webhooks, env vars, CORS). Afterbuild Labs methodology.
- Median time-to-production after a rescue engagement: 19 days. Afterbuild Labs methodology.
- 47 AI-built apps rescued through April 2026, with a 100% handoff rate on completed engagements. Afterbuild Labs methodology.
New to the vocabulary in this page? The glossary defines every term a lay reader might not have met before, from RLS to token spiral to demoware. Author: Hyder Shah.