Frontend SLO/SLI and Error Budgets: Operating Quality at Scale

Medium•
Educational Frontend SLOs hero showing SLI, SLO, error budget, burn rate, and release policy.

Frontend reliability work starts by naming the terms correctly. A Service Level Indicator measures observed user experience, a Service Level Objective sets the target, a Service Level Agreement is an external promise, and the error budget converts the target into release policy.

Definitions First

TermFull formWhat it meansFrontend example
SLIService Level IndicatorThe measured signal.p75 Largest Contentful Paint, Interaction to Next Paint, JavaScript error-free sessions, checkout success rate.
SLOService Level ObjectiveThe target for an SLI over a time window.95% of product-detail page views have LCP under 2.5 seconds over 28 days.
SLAService Level AgreementAn external customer-facing commitment, often contractual.Enterprise contract promises availability or support response with penalties or credits.
Error budgetAllowed unreliabilityThe amount of failure allowed before the SLO is missed.If the SLO allows 0.5% failed checkout sessions, that 0.5% is the budget.

Frontend Mental Model

Measure the Experience Users Actually Receive

Backend uptime can be green while the frontend is broken. A user can still fail because JavaScript crashes, chunks fail to load, hydration breaks, a third-party script blocks the main thread, or the page is technically available but unusably slow.

Frontend SLIs should therefore represent browser-visible outcomes:

•Did the page load successfully?
•Did the route become usable fast enough?
•Did the interaction respond within the target?
•Did the critical flow complete without a frontend error?
•Did the user experience layout instability?

The staff-level move is translating reliability from infrastructure availability into user journey quality.

Choosing Frontend SLIs

Good SLIs Are Specific and Measurable

Useful frontend Service Level Indicators include:

•Page load success rate: percentage of route visits that load without fatal render, chunk, or network failure.
•JavaScript error-free session rate: percentage of sessions without uncaught runtime errors or unhandled promise rejections.
•Critical flow completion: percentage of checkout, signup, login, or search journeys that complete successfully.
•Core Web Vitals: LCP for loading, INP for responsiveness, CLS for visual stability.
•API success as seen by the browser: request success rate from the user agent, not just server-side logs.
•Route transition latency: time from navigation intent to useful content visible.

Avoid vague SLIs like average page speed. Percentiles and route-level segmentation are usually more useful because real user experience is not evenly distributed.

SLI Quality Checklist

A Good SLI Can Be Operated

Before turning a frontend metric into a Service Level Indicator, ask:

•Does it represent user pain or only engineering curiosity?
•Can it be measured reliably in real browsers?
•Can it be segmented by route, device, release, and cohort?
•Does the team know what action to take when it degrades?
•Can it be protected from bot traffic, duplicate events, and sampling bias?
•Is it stable enough to compare across releases?

A metric that cannot drive a decision is not a good SLI.

Writing Good Frontend SLOs

SLO = Metric + Target + Window + Scope

A Service Level Objective should be precise enough to operate:

95% of product-detail page views on mobile and desktop should have
Largest Contentful Paint under 2.5 seconds over a rolling 28-day window.

That sentence has the required parts:

•Metric: Largest Contentful Paint.
•Target: under 2.5 seconds.
•Population: product-detail page views.
•Window: rolling 28 days.
•Segmentation expectation: mobile and desktop should both be examined.

Another frontend example:

99.5% of checkout sessions should complete without a fatal frontend error
or failed payment UI transition over a rolling 30-day window.

Good SLOs are strict enough to protect users and loose enough to leave room for product velocity.

Core Web Vitals as SLIs

Web Vitals Fit Naturally Into Frontend SLOs

Core Web Vitals are good frontend SLIs because they map to user-centered outcomes:

•Largest Contentful Paint (LCP): loading experience.
•Interaction to Next Paint (INP): responsiveness.
•Cumulative Layout Shift (CLS): visual stability.

A strong frontend SLO does not simply say "make LCP good." It names the route, population, percentile, threshold, and time window.

Example:

At least 75% of landing-page visits should meet all Core Web Vitals thresholds,
segmented by mobile and desktop, over the last 28 days.

For business-critical flows, a team may define stricter internal targets than public search or tooling thresholds.

Error Budget Policy

Budgets Turn Metrics Into Decisions

If an SLO allows 0.5% failure over a window, that 0.5% is the error budget. The budget says how much unreliability the team can spend before changing behavior.

Example policy:

•Healthy budget: normal release cadence.
•Elevated burn: require extra review for risky launches.
•Fast burn: pause rollout, activate mitigation, page owner if user impact is high.
•Budget exhausted: prioritize reliability work and block unrelated risky releases.

The point is not punishment. The point is aligning product speed with user trust.

Burn Rate Alerting

Alert on Consumption Speed

Burn rate asks how quickly the system is consuming its error budget. A short, severe incident and a slow reliability leak need different alerting windows.

Useful alerting combines:

•fast-burn alerts for acute incidents
•slow-burn alerts for persistent degradation
•route, release, and cohort segmentation for diagnosis

Alert fatigue is a design failure. Alerts should represent user-impacting budget burn, not every noisy metric spike.

Segmentation

Aggregates Hide Broken Experiences

Frontend reliability must be sliced by:

•route or journey
•device class
•browser version
•geography
•release
•experiment or feature flag variant
•authenticated vs anonymous users

A global SLO can be green while mobile checkout is broken. Senior engineers inspect the distribution, not only the headline number.

Instrumentation Risks

The Metric Can Lie

Common pitfalls:

•dropped telemetry during crashes
•missing aborted requests
•duplicate session counting
•client clock issues
•sampled data hiding rare critical failures
•inconsistent route naming
•bot or synthetic traffic mixed with real users

If the instrumentation is wrong, the budget is fiction. Validate the measurement path before using it for release governance.

Interview Framing

Senior Answer Pattern

I would define Service Level Indicators around critical frontend user journeys, set Service Level Objectives with a target and time window, track error budget burn, segment by route/device/release, and tie release policy to budget health.

For frontend, I would include JavaScript error-free sessions, successful page loads, Core Web Vitals, interaction latency, chunk load success, and critical flow completion. I would not rely only on backend uptime, because users experience the browser, not just the server.

Key Takeaways

1SLI means Service Level Indicator: the measured signal.
2SLO means Service Level Objective: the target for an SLI over a time window.
3SLA means Service Level Agreement: an external customer-facing commitment, often contractual.
4An error budget is the amount of unreliability allowed before the SLO is missed.
5Frontend SLIs should measure real browser user experience, not only server uptime.
6A good frontend SLI must be measurable, segmentable, actionable, and tied to user pain.
7Core Web Vitals can be frontend SLIs when scoped to routes, percentiles, and windows.
8Error budgets convert reliability targets into release policy.
9Segmented metrics prevent aggregates from hiding broken cohorts.