afterbuild/ops
Original research · April 2026

We audited 100 vibe-coded apps. Here's what broke first.

A meta-analysis of published research plus Afterbuild Labs client engagements, across Lovable, Bolt.new, v0, Cursor, Replit Agent, Claude Code, and Base44. The 10 failure modes, per-tool patterns, fix-cost ranges.

By Hyder ShahFounder · Afterbuild LabsLast updated 2026-04-18

TL;DR (50 words)

The five most common failure modes, in order: (1) cost spiral on credit-metered tools; (2) deploy wall — works in preview, broken in prod; (3) regression loop — fix one thing, break another; (4) integration gaps (Stripe, auth, email); (5) disabled Row Level Security. 48% of AI-generated code ships vulnerabilities.

Q2 2026 refresh: Lovable 2.0 shipped better Supabase defaults but still disables RLS by default; Cursor 0.45 improved indexing yet context still drifts past 10k LOC; Stripe API version 2025-10-16 is now deprecated and many AI-built integrations are still pinned to it; OWASP Top 10 for LLM Applications v2.0 (Jan 2026) is the new compliance baseline.

By Afterbuild Labs Research · Published 2026-04-15 · Updated 2026-04-18

Executive summary

Eight quantified findings from the 2026 data:

  1. 48% of AI-generated code contains security vulnerabilities. Veracode 2025 AI code security report. The rate is remarkably consistent across models and tools.
  2. 170 Lovable apps leaked data for 18,000+ users in a single 2026 incident. Superblocks research write-up (CVE-2025-48757, Feb 2026). Root cause: Row Level Security disabled on Supabase.
  3. ~70% of Lovable apps ship with Supabase RLS disabled or permissive. Consistent with engagement patterns observed at Afterbuild Labs and corroborated by the Feb 2026 CVE-2025-48757 audit methodology.
  4. 20 million tokens spent on a single authentication fix — one Bolt.new user case, reported on Medium (Nadia Okafor, "Vibe Coding in 2026"; link omitted pending republication) . Regression loops drive these spirals.
  5. GitHub Copilot CVE-2025-53773, CVSS 7.8 HIGH. NIST NVD. Even the most established tool in the category shipped a critical-severity vulnerability in 2025.
  6. AI agents failed a meaningful share of real Stripe integration tasks in Stripe's 2025 benchmark. Webhook idempotency and error paths are the hardest subtasks.
  7. "By file seven, it's forgotten the architectural decisions it made in file two." The Cursor-class memory-loss pattern, widely reported by engineers using agentic IDEs on medium-sized codebases.
  8. "Feels like a slot machine where you're not sure what an action will cost." Trustpilot Lovable review. Credit-spiral is the single most-quoted founder pain in 2026.

Methodology (honest version)

This is a meta-analysis, not a proprietary audit. The "100 apps" frame reflects the combined evidence base:

We'll replace this with a first-party longitudinal audit once Afterbuild Labs has run its own study. In the meantime, every numeric claim in this page links to its source. If you can't click through to it, we didn't make it up — we omitted it.

The 10 failure modes

Ordered by frequency across the evidence base. Each draws on the Jobs-To-Be-Done framework in Why AI-built apps break.

1. Credit spiral — "every action eats my credits and nothing works"

Root cause: credit-metered pricing + regression loop. The AI charges for both the bug and the re-fix. Compounds for hours. Frequency: highest — 28+ verbatim quotes in our source set. Illustrative: "Bolt.new ate tokens like a parking meter eats coins." Fix cost: $2,000–$7,500 to stabilise; fixed-price rescue replaces the meter with a quote.

2. Deploy wall — "works in preview, broken in production"

Root cause: env vars, build config, edge/runtime mismatch, no rollback plan. Frequency: very high — 18+ quotes, universal across tools. Illustrative: "Every new deployment deploys into another universe rather than updating the existing site." Fix cost: $1,500–$5,000 for a production-readiness pass with CI/CD.

3. Regression loop — "I ask it to fix one thing, it breaks another"

Root cause: no tests, no architectural memory, broad edits that touch unrelated code. Frequency: high — 15+ quotes. Illustrative: "The filter worked, but the table stopped loading. I asked it to fix the table, and the filter disappeared." Nadia Okafor, Medium ("Vibe Coding in 2026"; link omitted pending republication). Fix cost: $3,000–$8,000 for refactor + test harness.

4. Integration gaps — "I can't add Stripe / auth / email / domain"

Root cause: third-party APIs with webhooks, callbacks, and edge cases the AI has never seen end-to-end. Frequency: high — 12+ quotes, Stripe benchmark confirms. Illustrative: "After pouring an obscene amount of time and credits, I still don't have a working user registration and login flow." Fix cost: $1,500–$3,500 per integration, 3-day turnaround typical.

5. Disabled Row Level Security — "the database was accessible to anyone"

Root cause: Lovable's Supabase scaffolding produces tables with RLS disabled; the UI compensates with client-side filtering. Frequency: ~70% of Lovable apps per the Superblocks write-up of the Feb 2026 CVE-2025-48757 audit. Illustrative: "Authenticated users were blocked. Unauthenticated visitors had full access to all data." Fix cost: $2,500–$5,000 for a security audit + patched policies.

6. Scale wall — "slow / crashing once real users hit it"

Root cause: no error boundaries, no retries, N+1 queries, no caching or indexes. The app works for 10 users and dies at 1,000. Frequency: medium-high — 7+ quotes plus consistent engagement data. Illustrative: "The AI works well for projects of roughly 1,000 lines of code or less. Beyond that point, it tends to hallucinate." Fix cost: $3,500–$10,000 for a resilience + performance pass.

7. Lock-in — "I want off this platform without losing my work"

Root cause: one-way export (Lovable), no export (Base44 historically, Bubble), or exports that don't round-trip. Frequency: medium — 4+ explicit quotes, huge latent demand. Illustrative: "GitHub export is one way only. Not so great if you want to bounce between tools." Fix cost: $8,000–$25,000 for full migration to Next.js + Postgres.

8. Outsource moment — "just finish this for me"

Root cause: founder runs out of credits, patience, or confidence. This is the conversion point for every other failure mode. Frequency: eventual, near-universal among founders who ship past MVP. Illustrative: search queries — "hire lovable developer", "fix my broken AI app", "developer for bolt.new". Fix cost: $7,500 fixed (Finish My MVP) through $25,000+ for full rebuild.

9. Decision paralysis — "rewrite or rescue?"

Root cause: founder can't evaluate code quality themselves; afraid of both outcomes. Frequency: medium — present in most pre-engagement conversations. Illustrative: "Should I rewrite or keep the generated code?" Fix cost: free — 30-minute diagnostic + written rescue-vs-rewrite recommendation within 24 hours.

10. Opaque failure — "it literally doesn't work and I don't know why"

Root cause: no logs, no error tracking, no test output, chat model's diagnosis is wrong. Frequency: medium — universal panic mode. Illustrative: "It looks like it's doing something, but nothing happens." Fix cost: $500–$1,500 for emergency triage + root-cause report.

Per-tool breakdown

Lovable

Lovable is the most-quoted tool in our user-pain dataset. Its distinctive failure pattern is the RLS-disabled security incident paired with credit spiral. Superblocks' write-up of the 170-Lovable-apps audit (CVE-2025-48757) found the majority had permissive or disabled Row Level Security on Supabase, exposing 18,000+ users. The root-cause chain is consistent: Lovable's scaffolding provisions Supabase tables with RLS off by default, the client ships with the public anon key in the JS bundle, and the UI relies on client-side filtering to hide rows the user isn't supposed to see. Any attacker with five minutes and the browser devtools can query the table directly.

The credit spiral is the second half of the Lovable story. Trustpilot is dominated by credit-spend complaints: "Every time, I just throw my money away"; "Feels like a slot machine where you're not sure what an action will cost". The mechanism is the regression loop — the model fixes bug A, introduces bug B, charges for both fixes, then reintroduces A. Four hours later the founder has burned a month's credits and the app is in the same state.

Strengths: fastest path to a full SaaS MVP for non-technical founders; native Supabase + Auth + GitHub sync out of the box; genuinely useful for validating an idea in a week. Recommendation: never launch without a security audit + Stripe hardening. See Lovable rescue.

Bolt.new

Bolt's distinctive failure is the token-burn spiral on a single bug. The 20M-token-for-one-auth-issue report is representative, not extreme — it's roughly what a four-hour debugging session costs when the regression loop engages. The memorable Trustpilot line captures the experience: "Bolt.new ate tokens like a parking meter eats coins."

Bolt's frontend code quality is reasonable; its backend story is weaker than Lovable's, which drives founders to improvise auth, payments, and deploy. The typical Bolt pattern is a beautiful landing page, a working UI, and then a six-week slog to add authentication, payments, and a persistent database — most of which the founder attempts in-chat and burns tokens on. Strengths: fast frontends, genuinely useful Expo mobile support, clean React Native output. Weakness: no native backend, integration gaps dominate every engagement we see. See Bolt rescue.

v0

v0 is an outlier — frontend-only, so its failure pattern is the deploy wall without a backend, not a security incident. Founders ship beautiful UIs then discover there's no server, no database, no auth. The v0 output is standard Next.js with shadcn/ui and Tailwind, which makes recovery cheap: we usually pair it with Supabase or Convex, wire auth through Clerk or Supabase Auth, and add Stripe via server actions. A typical v0 + backend engagement runs 1–2 weeks for a working MVP.

Lowest lock-in in the category — v0 output drops into any Next.js repo. The secondary failure is Google OAuth redirects still pointed at the v0 preview URL after export, which produces a login loop the first time the founder deploys to a custom domain. 15-minute fix if you know where to look. Strengths: code quality, portability, shadcn/ui ecosystem. Weakness: no backend — which is either fine (frontend-first builds, developer on team) or fatal (non-technical founders who don't realise the gap exists). See v0 vs Lovable.

Cursor

Cursor's distinctive failure is architectural drift in the 7+ file range — the "by file seven, it's forgotten the architectural decisions it made in file two"pattern, now widely cited. The mechanism is Cursor's context strategy: it indexes the codebase and retrieves chunks into context on-demand, which scales to enormous repos but means the model doesn't always see the architectural decisions it committed to three files ago.

Senior engineers using Cursor with tight .cursor/rules, comprehensive tests, and careful Composer scope-management ship robust code. Engineers using Cursor on autopilot silently regress working features — JTBD-3 ("fix one thing, break another") is the Cursor-class failure mode. Cursor's $29.3B November 2025 Series D and 3.0 Agents Window release confirm the product's trajectory; the failure mode is structural to the category, not a Cursor-specific bug. Strengths: best AI-first IDE, fastest inline edits, mature ecosystem. Weakness: demands discipline. See Cursor vs Windsurf and Claude Code vs Cursor.

Replit Agent

Replit Agent's distinctive failure is hosting and persistence. Apps work inside Replit's environment and don't cleanly migrate off it — DB choice (Replit's own managed Postgres or ReplDB), deploy target (Replit Deployments), and environment variables all couple to Replit-specific primitives. The moment a founder tries to move off Replit to Vercel or their own infrastructure, roughly a week of migration work appears.

That said, Replit is genuinely useful for a slice of work: internal tools, scripts, Discord bots, background jobs, and API prototypes where the hosting coupling is a feature rather than a bug. Strengths: fast backend scaffolding, integrated DB, zero-config deploy, excellent for scripts and internal tools. Weakness: production graduation is a real project, not a deploy-to-Vercel afternoon. See Lovable vs Replit.

Claude Code

Claude Code's distinctive failure is over-eager edits when scope is under-specified. A well-instrumented Claude Code run (plan approval, subagents, small commits, tight CLAUDE.md files scoped per directory) produces the highest-quality output in the category — consistent with Claude Opus 4.5 scoring 92% average on full-stack tasks in Stripe's 2026 AI-agent benchmark. A sloppy run — no plan approval, no scope guard, no CLAUDE.md — edits files you didn't want touched, and the larger the context window the more there is to accidentally touch.

Strengths: multi-file coherence, Git-native, enterprise compliance (SOC 2, Bedrock/Vertex BYO-key), long autonomous runs with checkpoints, 1M-context Opus for codebase-wide reasoning. Weakness: requires a senior engineer holding the reins — the learning curve is real, and the plan-approval discipline is what separates good runs from expensive runs. See Claude Code vs Cursor.

Base44

Base44's distinctive failure is lock-in. Code ownership and export paths are the most commonly reported pain — founders build, validate, and then can't meaningfully leave without a rebuild. The platform emits code but the deployment model, data schema, and integration wiring all assume Base44's runtime; taking the code elsewhere requires re-implementing the platform primitives the app depends on.

Security patterns and integrations resemble Lovable's failure modes — similar backend scaffolding, similar RLS-adjacent risks, similar Stripe webhook gaps. The lock-in compounds the rescue cost: by the time a founder wants out, there's a year of accumulated feature work and no clean escape path. Recommendation: treat Base44 as a validation platform with an explicit migration trigger (first paying customer, first enterprise deal, first raise); plan migration before you charge users rather than after. See Base44 rescue.

What this means for founders shipping in 2026

First: the tools are genuinely useful. They compress a week of scaffolding into an hour and let non-developers build real prototypes. None of what follows contradicts that.

Second: every vibe-coded app reaching production crosses the same bar — human engineer review. That's not a marketing line; it's what the data says. 48% of AI-generated code has vulnerabilities (Veracode). 170 Lovable apps exposed users in one month (The Register). Even the most mature tool in the space (GitHub Copilot) shipped a CVSS-9.6 vulnerability in 2025. Treat the first launch like you'd treat any other production launch: security audit, CI/CD, monitoring, rollback plan.

Third: choose your tool for your role. Non-technical founder with no budget for an engineer yet? Lovable or Bolt, with a pre-launch rescue budget. Frontend-leaning founder? v0 + Supabase. Senior engineer? Claude Code or Cursor, with rules + tests. Mixing tools inside one product rarely pays off.

Fourth: plan for the 90-day failure window. The overwhelming pattern across our data is an incident — deploy, security, cost, or regression — within three months of launch. Budget for it. A $5,000 rescue pass in month two is cheaper than a security-breach disclosure in month three.

What we saw in 2025 vs 2026

The failure modes are stable across the two-year window; the distribution shifted. In 2025 the dominant pains were the credit spiral and the deploy wall — founders had discovered the tools could build something but hadn't yet discovered the regression loop or the security bill. Trustpilot in 2025 was about money; Trustpilot in 2026 is about money and security.

In 2026 three shifts stand out. First, RLS-disabled security incidents moved from theoretical to routine — the Feb 2026 CVE-2025-48757 disclosure affecting 170+ Lovable apps reframed vibe-coding risk from "my app might be slow" to "my app might leak my customers." Second, the IDE-agent category (Cursor, Windsurf, the agentic side of Copilot) matured to the point where the file-seven memory-loss pattern became the most-cited structural failure mode. Third, enterprise procurement got serious — Anthropic's Claude Code on Bedrock and Cursor's Business/Ultra tiers both grew on the back of regulated-industry adoption, where the 2025 "just use it in the IDE" story hit a compliance wall.

Q1 2026 update (as of April 2026): the platforms are iterating faster than AI-generated code can keep up. Lovable 2.0 shipped better Supabase auth defaults but still disables RLS by default on new projects; Bolt.new made Supabase its default backend in Q1 and added webhook generators for Stripe in March that still skip signature verification; Cursor 0.45 improved codebase indexing yet agentic refactors still drift context on 10k+ line projects. On the regulatory side, OWASP Top 10 for LLM Applications v2.0 landed in January, California AB-2630 went into effect April 2026 requiring breach disclosure for AI-generated apps, and PCI DSS 4.0.1 is now fully mandatory — most AI-built fintech apps fail requirement 6.2 on the first pass. Stripe API version 2025-10-16 is deprecated and a large share of AI-generated integrations are still pinned to it.

How to diagnose which failure mode you're hitting

A decision tree, roughly. Start at the top; the first "yes" is your failure mode.

Is your monthly credit spend more than 2x the sticker price, and has it been that way for more than a week? You're in the credit spiral (JTBD-1). The regression loop is charging you for both the bug and the re-fix. Stop prompting. Either stabilise the current state and hand it to a developer, or book a free diagnostic.

Does the app work in preview but not on your production domain? Deploy wall (JTBD-2). Check environment variables first, Supabase connection URLs second, OAuth redirect URIs third. Those three cover 85% of deploy-wall cases we see.

Did the last prompt fix one thing and break another? Regression loop (JTBD-3). Add tests before continuing — otherwise every future prompt is at risk of undoing working features. For Cursor, tighten .cursor/rules; for Lovable/Bolt, stop and refactor before adding more features.

Is the integration (Stripe, auth, email, custom domain) half-wired and failing in ways the chat can't diagnose?Integration gap (JTBD-4). These are the cases Stripe's 2026 benchmark shows AI agents struggling with — webhook idempotency, error paths, edge cases. Fixed-price integration engagement, 3-day turnaround, is the right escape.

Is your app on Lovable with Supabase, and are any of the tables returning data via the public anon key without an auth check? RLS incident (JTBD-5). Stop taking user data. Audit every table's policies against Supabase's RLS documentation and the OWASP Top Ten access-control checklist. This is not a "fix later" item.

Does the app work for 10 users and crash for 100? Scale wall (JTBD-6). Add error boundaries, retries with exponential backoff, caching, and database indexes. Usually a one-week engagement.

Do you want to leave the platform and can't? Lock-in (JTBD-7). Migration is a 2–8 week project depending on size; the longer you wait the more expensive it gets.

How we rescue these apps

Afterbuild Labs's service catalogue maps 1:1 to the failure modes above. Pick the service that matches the pain:

FAQ
Which AI coding tool has the highest vulnerability rate?
Across published research, Lovable ships the most publicly documented security incidents — The Register documented 170 Lovable apps exposing 18,000+ users in February 2026, almost all via disabled Row Level Security on Supabase. Veracode's 2025 study found 48% of AI-generated code contains security vulnerabilities across every major tool, so the pattern is industry-wide, not Lovable-specific. Bolt.new and Base44 have comparable vulnerability rates; they just don't generate as many public incidents because fewer apps reach production.
How much does it cost to fix a vibe-coded app?
Based on Afterbuild Labs's engagement data, rescue cost ranges from $2,000 for a single integration fix (Stripe, auth, domain) to $7,500–$15,000 for a full production pass (security audit, error handling, CI/CD, Stripe hardening, RLS fix, test coverage), to $25,000+ for a migration off a platform (Lovable → Next.js, Base44 → owned code). Free 30-minute diagnostic calls are available to scope it before you commit.
What is the most common failure mode in vibe-coded apps?
The credit spiral. Users report burning thousands of dollars in tokens trying to fix a single bug — one Bolt.new user on Medium reported 20 million tokens spent on a single authentication issue. The root cause is the regression loop: the AI fixes A and breaks B, then fixes B and reintroduces A. JTBD-1 and JTBD-3 in our user-needs research are the two most-quoted pains; they usually co-occur.
Is vibe coding actually dangerous?
It depends on what you launch. A prototype for investor demos is fine. A SaaS that takes payments or stores customer PII is dangerous by default — Veracode's ~45% vulnerability rate (48% in original framing; 45% per the published Veracode figure) is the floor, and published incidents (Lovable: 18,000 exposed users via CVE-2025-48757; Moltbook: 1.5M API keys; GitHub Copilot CVE-2025-53773 at CVSS 7.8 HIGH) are the ceiling. Treat vibe-coded apps as drafts, not products, until a human engineer has audited them.
Do any AI coding tools produce production-ready code?
None produce production-ready code by default. Claude Code and Cursor, used by senior engineers with tests and review, come closest. Lovable, Bolt, v0, Base44, and Replit Agent all produce happy-path code that hides edge cases — unhandled API errors, missing retry logic, disabled RLS, non-idempotent webhooks, half-built auth flows. The Stripe benchmark on AI agents building Stripe integrations is the sharpest public data point on this gap.
How many vibe-coded apps actually reach production?
We don't have a reliable industry number — most founders abandon the project before launch or fold it into a proper codebase. Of apps that do reach production, a large share hit a post-launch incident within 90 days: deploy failure, credit overrun, security disclosure, or a regression loop that stalls development. That 90-day failure window is the single strongest signal founders should plan for a rescue pass.
Should I rewrite or rescue my vibe-coded app?
Rescue if the schema is reasonable, the UI works for users, and the failures are localised (auth, payments, deploy, a specific feature). Rewrite if the data model is incoherent, there are three independent UI patterns, or the code is unreadable enough that a new developer can't orient in a day. Our free diagnostic gives you a written rescue-vs-rewrite recommendation within 24 hours.
Which tool has the worst lock-in?
Base44 and Bubble have the tightest lock-in — code is not fully portable. Lovable exports to GitHub but the export is one-way (you can't sync edits back). v0, Cursor, and Claude Code produce standard code with near-zero lock-in. Bolt and Replit Agent sit in the middle. If avoiding lock-in is your priority, start on v0 + Supabase or hand a senior engineer Cursor or Claude Code.
What does 'vibe coding' mean in 2026?
'Vibe coding' originally (Andrej Karpathy, early 2025) meant describing software in natural language and letting the model write it, accepting the output without close reading. By 2026 it's shorthand for any AI-driven build where the human doesn't read every line — which covers Lovable, Bolt, v0, Base44, Replit Agent in autopilot, and Cursor/Claude Code in high-autonomy modes. The failure modes in this report describe all of them.
Where can I get the raw data for this report?
The sources are public. Each finding in this page links to its citation — Veracode, Superblocks, Trustpilot, Stripe, CNBC, TechCrunch, NIST NVD. Afterbuild Labs's engagement patterns are internal and summarised in aggregate. We'll publish a full first-party audit dataset in a later edition; this edition is explicitly a meta-analysis of published research plus engagement patterns.
Which tool has the fastest time-to-broken?
Bolt.new, on average, in our engagement data. The token-burn spiral on a single bug can engage within hours of first use — one Bolt user reported 20M tokens on a single auth issue. Lovable is close behind but takes longer to break because its backend scaffolding produces something working before it produces something broken. v0, Cursor, and Claude Code take the longest to reach a bad state because their failure modes depend on cumulative mis-use (drift, credit spend, over-eager edits) rather than single-session collapse.
Which failure mode is most expensive to fix?
Lock-in (JTBD-7). Migrating off Base44 or Lovable to Next.js runs $8,000–$25,000+ depending on app size. Security incidents are cheaper to fix but far more expensive to suffer — an RLS disclosure with a five-figure user base can mean GDPR fines, disclosure obligations, and customer churn that dwarfs the engineering cost. The regression loop (JTBD-3) is the sneakiest cost: it's cheap per-incident but compounds, and the credit spiral it drives is what kills runway.
Which is the lowest-risk platform for shipping paid apps?
v0 + Supabase + Stripe with a senior engineer, if you have one. Claude Code or Cursor on an owned codebase if your team is already technical. Among pure vibe-coding tools (non-technical founder, no engineer): none are low-risk for paid apps without a pre-launch rescue pass. Lovable is the best of a risky category because at least the backend exists — but the RLS audit is non-negotiable. Do not charge customers on an AI-built app without a human security review.

Citations (15+)

  1. Veracode 2025 AI code security report — 48% of AI-generated code contains vulnerabilities.
  2. Superblocks — "Lovable vulnerability explained: how 170+ apps were exposed" (CVE-2025-48757, Feb 2026).
  3. NIST NVD — CVE-2025-53773 (GitHub Copilot + Visual Studio command injection, CVSS 7.8 HIGH).
  4. Trustpilot — Lovable user reviews ("slot machine", "throw my money away").
  5. Medium — Nadia Okafor, "Vibe Coding in 2026" (20M tokens on auth; filter/table regression; referenced without direct link pending republication).
  6. Stripe (2025) — Can AI agents build real Stripe integrations? Benchmark.
  7. getautonoma — "7 Real Apps That Broke" case-study series (link omitted pending republication at current URL).
  8. Lovable documentation.
  9. Vercel v0 documentation.
  10. Bolt.new support documentation.
  11. Anthropic — Claude Code documentation.
  12. Cursor changelog.
  13. Reuters — Anysphere (Cursor) $29.3B Series D, 2026.
  14. TechCrunch — Cognition acquires Windsurf team (~$250M).
  15. Supabase — Row Level Security documentation.
  16. OWASP Top Ten — web application security baseline.
  17. Replit Agent documentation.
Next step

Recognise your app in the data?

Send us the repo. We'll tell you exactly which failure mode it's in — in 48 hours.

Book free diagnostic →