. Design Google Search Latency Dashboard - Frontend System Design Interview Guide
Design a production-grade Latency Debug Dashboard for a Google Search–like system.
Goal: when latency spikes, an oncall should be able to answer quickly:
- Is the regression real? For which segments?
- Which stage got slower (edge, retrieval, ranking, ads, rendering, etc.)?
- Which queries/segments/clusters are driving the spike?
- What changed (deploys, config, experiments)?
- How do we reproduce and mitigate?
You are not required to build UI code.
Design the system: requirements, architecture, data pipeline, storage, APIs, dashboard UX, and debugging workflow.
When latency spikes in production, engineers need answers fast. This solution designs a latency dashboard that transforms raw metrics into a guided debugging workflow. Instead of just showing charts, we build a system that helps oncall engineers detect problems, identify root causes, and take action—all within minutes. The key insight: a good latency dashboard is a debugging tool, not just a monitoring tool.
HLD interview focus: Requirements, architecture, tradeoffs, data flow, and scaling decisions. Any implementation snippets shown are optional unless explicitly asked.
I'll start by defining what success looks like—what questions does an oncall engineer need to answer when latency spikes? Then I'll design the data pipeline that produces two things: (1) fast pre-aggregated metrics for interactive charts, and (2) sampled traces/logs for deep debugging. Finally, I'll design the dashboard UX as a guided workflow with correlation overlays and mitigation actions.
Why this approach?
Most candidates build a "chart dashboard"—just showing latency over time. Strong candidates build a "debugging workflow"—a system that guides engineers from problem detection to resolution. The difference is actionability: charts show what's wrong, workflows help you fix it.
For junior developers: Think of this like building a debugging tool in your browser DevTools. You don't just show network requests—you help developers understand which request failed, why it failed, and how to fix it. Same principle here.
For senior architects: This is about designing observability systems that scale. We need to balance query performance, storage costs, privacy constraints, and debugging depth. Every decision has trade-offs.
UI Implementation Guide
To implement this approach in the dashboard UI:
- Fast Pre-Aggregated Metrics (Charts): See "A - Architecture" section for data pipeline design. Use pre-computed histograms stored in metrics store (TSDB/OLAP). Query these for interactive charts that load in < 500ms.
- Sampled Traces/Logs (Debugging): See "A - Architecture" section for tail-based sampling strategy. Store sampled traces in trace store (Tempo/Jaeger). Use these for deep debugging when drilling into specific latency spikes.
- Guided Workflow UX: See "D - Dashboard UX" section for step-by-step workflow design. Implement as a multi-step interface: Overview → Segment Drilldown → Stage Breakdown → Trace Explorer → Actions.
- Correlation Overlays: See "D - Dashboard UX" → Step 1 (Overview) for change timeline overlay. Fetch from
/api/changesendpoint and overlay on latency charts.
- Mitigation Actions: See "D - Dashboard UX" → Step 6 (Actions) for one-click mitigation buttons. Connect to rollback/reroute/cache APIs.
Requirements Exploration Questions
DiscoveryBefore designing anything, let's define what success looks like. When latency spikes, an oncall engineer needs to answer these questions quickly:
Core Questions (The "Must-Haves")
- Is the regression real? Not just a blip, but a sustained problem affecting users
- Which segments are affected? Is it all users, or specific regions/devices/query types?
- Which stage got slower? Edge routing? Document retrieval? Ranking? Ads? Rendering?
- What changed? Was there a deploy, config change, or experiment rollout?
- How do we reproduce it? Can we find sample slow requests and compare them to normal ones?
- How do we fix it? Can we rollback, reroute traffic, or adjust cache policies?
Functional Requirements
Must HaveMonitoring (The "What's Happening" View)
- Real-time latency metrics: p50, p90, p95, p99 percentiles (see Core Web Vitals for performance metrics)
- Error and timeout rates: Overlay on latency charts to see correlation
- Segment breakdowns: Region, data center, device type, query class, language, cache hit/miss, experiment bucket, backend cluster
Debugging (The "Why Is It Happening" View)
- Stage decomposition: Waterfall view showing time spent in each stage (edge → retrieval → ranking → blending → ads → render)
- Top offenders: Worst-performing segments, clusters, shards, experiment buckets
- Privacy-safe query analysis: Hashed query fingerprints (not raw queries) to identify problematic patterns
- Trace explorer: Click on a latency spike → see sampled slow traces → compare against baseline traces
- Change correlation: Timeline overlay showing deploys, config pushes, incidents, experiment rollouts
Actions (The "How Do We Fix It" View)
- One-click mitigation: Links to rollback experiments, disable feature flags, reroute traffic, adjust cache policies
- Incident creation: Pre-fill incident tickets with context (affected segment, regressing stage, suspect change, sample traces)
Non-Functional Requirements
Quality BarPerformance
- Interactive charts: Dashboard queries should complete in < 500ms (pre-aggregated data, cached responses)
- Partial rendering: Show available data even if some metrics are delayed (handle ingestion lag gracefully)
Scalability
- Cardinality control: Bounded dimensions prevent query explosion (e.g., don't use raw query strings as dimensions)
- Efficient pagination: Support large datasets with cursor-based pagination
- Memory efficiency: Virtualize long lists (see List Virtualization)
Privacy & Security
- Default redaction: Raw query strings hidden by default
- Hashed fingerprints: Use cryptographic hashes for query analysis
- K-anonymity: Only show query exemplars if they meet minimum threshold (e.g., query appears in at least 10 requests)
- RBAC: Role-based access control for sensitive drill-downs
- Audit logs: Track who accessed what traces and when
Reliability
- Handle missing data: Graceful degradation when metrics are delayed or missing
- Data retention: Aggregates stored 30-90 days, trace samples 7-14 days (balance storage cost vs debugging needs)
Accessibility
- WCAG 2.1 AA compliance: Screen reader support, keyboard navigation, proper ARIA labels
- Color-blind friendly: Don't rely solely on color for critical information
Design Trade-offs
Key design decisions involve trade-offs between real-time vs batch processing, storage strategies (aggregates vs raw traces), sampling approaches, and more. These trade-offs are discussed in detail in the **Trade-offs and Design Decisions** section below.
Let's design the system architecture. I'll start with the data sources, then show how data flows through processing, storage, and finally to the dashboard.
System Architecture Overview
┌───────────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES (What We Collect) │
├───────────────────────────────────────────────────────────────────────────┤
│ • Client RUM: Time to First Byte (TTFB), Core Web Vitals (LCP, INP) │
│ • Edge Logs: Cache hits/misses, routing decisions, edge processing time │
│ • Backend Spans: Request traces with stage timers (retrieval, ranking) │
│ • Metrics: Pre-aggregated histograms, error counts, timeout rates │
│ • Change Events: Deploy timestamps, config pushes, experiment rollouts │
└───────────────┬───────────────────────────────┬──────────────────────────┘
▼ ▼
┌──────────────────────────────┐ ┌─────────────────────────────────────┐
│ Streaming Ingestion │ │ Batch Backfill (Optional) │
│ (Kafka/PubSub) │ │ (Recompute aggregates if needed) │
│ • Real-time processing │ │ • Historical data reprocessing │
│ • Low latency │ │ • Data correction │
└───────────────┬──────────────┘ └───────────────────┬─────────────────┘
▼ ▼
┌───────────────────────────────────────────────────────────────────────────┐
│ PROCESSING & ENRICHMENT LAYER │
│ │
│ 1. Normalize Schema │
│ • Standardize request_id, trace_id across all sources │
│ • Map different naming conventions to unified schema │
│ │
│ 2. Enrich Dimensions │
│ • Add region, device type, query class from request metadata │
│ • Determine cache status (hit/miss/partial) │
│ • Identify experiment bucket for A/B tests │
│ │
│ 3. Privacy Transforms │
│ • Hash query strings to fingerprints (SHA-256) │
│ • Apply k-anonymity thresholds (only show if ≥10 occurrences) │
│ • Redact PII from logs/traces │
│ │
│ 4. Compute Aggregates │
│ • Build histograms per time bucket and dimension │
│ • Calculate percentiles (p50, p90, p95, p99) from histograms │
│ • Break down latency by stage (edge, retrieval, ranking, ads, render) │
│ │
│ 5. Tail-Based Sampling │
│ • Sample slow requests (p95+) for deep debugging │
│ • Index traces by tags (cluster, shard, experiment bucket) │
└───────────────┬───────────────────────────────┬──────────────────────────┘
▼ ▼
┌──────────────────────────────┐ ┌─────────────────────────────────────┐
│ Metrics Store (Fast Queries) │ │ Trace/Log Store (Deep Debugging) │
│ (TSDB/OLAP Database) │ │ (Tempo, Jaeger, or S3) │
│ • Pre-aggregated histograms │ │ • Sampled traces with full detail │
│ • Time-series rollups │ │ • Indexed by tags for fast search │
│ • Optimized for chart queries │ │ • Retained 7-14 days │
│ • Retained 30-90 days │ │ │
└───────────────┬──────────────┘ └───────────────────┬─────────────────┘
▼ ▼
┌───────────────────────────────────────────────────────────────────────────┐
│ DASHBOARD BFF API │
│ (Backend-for-Frontend: Composes data, handles auth, provides fast APIs) │
│ │
│ • Chart Queries: Fast reads from metrics store │
│ • Drilldowns: Paginated segment breakdowns │
│ • Trace Search: Query trace store by filters │
│ • Trace Comparison: Diff slow trace vs baseline │
│ • Change Correlation: Overlay deploys/configs/experiments │
│ • Auth/RBAC: Enforce permissions, audit access │
│ • Caching: Cache frequent queries (5-60s TTL) │
└───────────────┬───────────────────────────────────────────────────────────┘
▼
┌───────────────────────────────────────────────────────────────────────────┐
│ DASHBOARD UI │
│ (React/Next.js frontend with guided debugging workflow) │
│ │
│ Overview → Segment Drilldown → Stage Breakdown → Trace Explorer → Actions│
└───────────────────────────────────────────────────────────────────────────┘Why This Architecture?
For junior developers: This is a classic "separation of concerns" pattern. Each layer has a specific job:
- Data sources collect raw data
- Processing cleans and enriches it
- Storage optimizes for different query patterns (fast aggregates vs detailed traces)
- BFF composes data and handles business logic
- UI focuses on presentation and user experience
For senior architects: This is a two-store model with a BFF pattern. We separate:
- Hot path (metrics store): Optimized for fast, frequent queries
- Cold path (trace store): Optimized for infrequent, deep queries
- BFF layer: Prevents the frontend from making N+1 queries, handles auth, and provides a stable API contract
Component Breakdown (Frontend Architecture)
DashboardShell (Root Layout)
├── GlobalFilters
│ ├── TimeRangePicker (last 1h, 6h, 24h, 7d, custom)
│ ├── RegionSelector (multi-select)
│ ├── DeviceTypeFilter (mobile, desktop, tablet)
│ └── QueryClassFilter (web, image, video, news)
│
├── OverviewPage
│ ├── SLOCockpit (p95/p99 latency, error rate, timeout rate)
│ ├── LatencyChart (time series with percentile lines)
│ ├── ChangeOverlay (deploys, configs, experiments on timeline)
│ └── AlertBanner (active incidents, SLO violations)
│
├── SegmentDrilldown
│ ├── TopContributors (table: segment → contribution %)
│ ├── Heatmap (region × device × latency)
│ └── SegmentChart (breakdown by selected dimension)
│
├── StageBreakdown
│ ├── WaterfallChart (edge → retrieval → ranking → ads → render)
│ ├── StageDelta (current vs baseline, highlight regressions)
│ └── StageTable (detailed timing per stage)
│
├── OffendersPanel
│ ├── TopClusters (worst-performing backend clusters)
│ ├── TopExperiments (A/B test buckets with latency impact)
│ └── TopQueries (privacy-safe query fingerprints)
│
├── TraceExplorer
│ ├── TraceSearch (filter by time, segment, stage, cluster)
│ ├── TraceList (paginated, sortable)
│ ├── TraceDetail (waterfall view, span details)
│ └── TraceComparison (side-by-side slow vs baseline)
│
└── ActionsPanel
├── RollbackButton (experiment/config rollback)
├── RerouteButton (traffic rerouting)
├── CacheTuneButton (cache policy adjustments)
└── CreateIncidentButton (pre-filled incident ticket)Key Design Decisions
StructuredTwo-Store Model
Why? Different query patterns need different optimizations.
- Metrics store: Optimized for time-series queries ("show me p95 latency over the last 24 hours")
- Trace store: Optimized for search queries ("find slow traces from region X between 2-3pm")
Trade-off: More complexity (two systems to maintain), but better performance and cost efficiency.
Pre-Aggregation
Why? Raw data is too large to query in real-time.
- Pre-compute histograms per time bucket and dimension
- Calculate percentiles from histograms (not from raw data)
- Store rollups at multiple granularities (1min, 5min, 1hour)
Trade-off: Storage cost vs query performance. More rollups = faster queries but more storage.
Tail-Based Sampling
Why? We can't store every trace (too expensive), but we need slow traces for debugging.
- Sample 100% of slow requests (p95+)
- Sample 1% of normal requests (for baseline comparison)
- Index by tags for fast search
Trade-off: Sampling accuracy vs storage cost. Higher sampling = better debugging but more storage.
BFF Pattern
Why? Frontend shouldn't make N+1 queries or handle complex data composition.
- BFF composes data from multiple sources
- Handles auth, caching, and rate limiting
- Provides stable API contract (frontend doesn't break when backend changes)
Trade-off: Additional service to maintain, but better separation of concerns and frontend performance.
Data Model: Keeping Cardinality Under Control
The Cardinality Problem
If we use raw query strings as a dimension, we could have millions of unique values. This explodes storage and makes queries slow. Instead, we use bounded dimensions.
Aggregates (Fast Charts)
// Store histograms per time bucket and bounded dimensions
interface LatencyHistogram {
timeBucket: string; // "2024-01-15T10:00:00Z"
region: string; // "us-east", "eu-west" (bounded: ~10 values)
device: string; // "mobile", "desktop", "tablet" (bounded: ~3 values)
queryClass: string; // "web", "image", "video" (bounded: ~5 values)
cacheStatus: string; // "hit", "miss", "partial" (bounded: ~3 values)
expBucket: string; // "control", "treatment-a" (bounded: ~10 values)
histogram: number[]; // Bucket counts for latency distribution
p50: number; // Pre-computed percentiles
p90: number;
p95: number;
p99: number;
}Stage Breakdown
// Per-stage histograms (for waterfall decomposition)
interface StageBreakdown {
timeBucket: string;
region: string;
// ... other dimensions
stages: {
edge: { histogram: number[]; p95: number; }; // Edge routing time
retrieval: { histogram: number[]; p95: number; }; // Document retrieval
ranking: { histogram: number[]; p95: number; }; // Ranking algorithm
ads: { histogram: number[]; p95: number; }; // Ads selection
render: { histogram: number[]; p95: number; }; // Result rendering
};
}Traces (Sampled)
// Full trace details for debugging
interface TraceSample {
traceId: string;
requestId: string;
timestamp: string;
totalLatency: number;
tags: {
region: string;
device: string;
cluster: string;
shard: string;
expBucket: string;
cacheMissReason?: string;
};
spans: Array<{
name: string; // "edge", "retrieval", "ranking", etc.
startTime: number;
duration: number;
tags: Record<string, string>;
}>;
}Query Exemplars (Privacy-Safe)
// Hashed query fingerprints (not raw queries)
interface QueryExemplar {
queryFingerprint: string; // SHA-256 hash of normalized query
count: number; // How many times this query appeared
avgLatency: number;
p95Latency: number;
// Safe metadata (not PII)
queryLength: number; // Character count
tokenCount: number; // Word/token count
language: string; // Detected language
// Only shown if k-anonymity threshold met (e.g., count >= 10)
}Why Histograms Instead of Raw Data?
For junior developers: Imagine you have 1 million requests per minute. Storing every single latency value would be huge. Instead, we group them into "buckets" (0-10ms, 10-20ms, 20-50ms, etc.) and count how many fall into each bucket. This is a histogram. From histograms, we can calculate percentiles (p95 = 95% of requests are faster than this value).
For senior architects: Histograms are space-efficient and allow percentile calculations without storing raw data. Use Prometheus-style histograms for efficient storage and querying.
The dashboard UI isn't just a collection of charts—it's a guided workflow that helps engineers debug latency issues step-by-step.
The Debugging Workflow (Step-by-Step)
StructuredStep 1: Overview ("Is There a Problem?")
Goal: Quickly identify if there's a real regression and when it started.
What to show:
- SLO Cockpit: Large, prominent p95/p99 latency with color coding (green/yellow/red)
- Time Series Chart: Latency over time with percentile lines (p50, p90, p95, p99)
- Error Overlay: Error rate and timeout rate on the same chart (correlation)
- Change Timeline: Overlay deploys, config pushes, experiments ("did something change?")
- Alert Banner: Active incidents, SLO violations
UX Patterns:
- Show loading states while data loads
- Handle empty states gracefully (no data available)
- Make charts responsive (mobile-friendly)
For junior developers: This is like the "homepage" of the dashboard. It answers: "Is everything okay?" If not, you drill down.
For senior architects: This is the "detection" phase. We need fast queries (< 500ms) and clear visual hierarchy.
Step 2: Segment Drilldown ("Who Is Affected?")
Goal: Identify which user segments are driving the regression.
What to show:
- Top Contributors Table: Ranked list of segments by contribution to regression
- Example: "Region: us-east contributes 45% of p95 latency increase"
- Heatmap: Visual grid showing latency by region × device type
- Segment Chart: Breakdown by selected dimension (region, device, query class, etc.)
UX Patterns:
- Click a segment → filter all downstream views
- Show percentage contribution ("this segment is 3x worse than baseline")
- Use sortable, paginated tables (see Data Table for implementation patterns)
- Add proper ARIA labels for accessibility
For junior developers: This is like filtering data in Excel. You're narrowing down: "Is it all users, or just mobile users in Europe?"
For senior architects: This is the "segmentation" phase. We need efficient queries that group by dimensions.
Step 3: Stage Breakdown ("Where Is It Slow?")
Goal: Identify which stage of the request pipeline is regressing.
What to show:
- Waterfall Chart: Visual breakdown of time spent in each stage
- Edge routing: 50ms
- Document retrieval: 200ms ← This is the problem!
- Ranking: 100ms
- Ads selection: 80ms
- Result rendering: 50ms
- Stage Delta: Compare current vs baseline (highlight regressions in red)
- Stage Table: Detailed timing per stage with percentiles
UX Patterns:
- Use waterfall visualization (similar to browser DevTools Network tab)
- Highlight regressions (red background, bold text)
- Show tooltips with detailed metrics
- Make stages clickable → filter traces by stage
For junior developers: This is like the Network tab in DevTools. You see which request took longest and why. Here, you see which stage of search took longest.
For senior architects: This is the "decomposition" phase. We need stage-level histograms and efficient percentile calculations.
Step 4: Offenders ("What's Causing It?")
Goal: Identify specific clusters, experiments, or query patterns driving the regression.
What to show:
- Top Clusters: Worst-performing backend clusters/shards
- Top Experiments: A/B test buckets with latency impact
- Top Queries: Privacy-safe query fingerprints (only if k-anonymity threshold met)
UX Patterns:
- Sortable tables with impact metrics ("this cluster is 2.5x slower")
- Click to filter traces by offender
- Show confidence intervals for experiments (statistical significance)
- Add privacy indicators ("query data anonymized")
For junior developers: This is like finding the "smoking gun." You've narrowed it down to a specific cluster or experiment that's causing the problem.
For senior architects: This is the "root cause identification" phase. We need efficient aggregation queries and privacy-safe exemplars.
Step 5: Trace Explorer ("How Do We Reproduce It?")
Goal: Inspect actual slow requests to understand what went wrong.
What to show:
- Trace Search: Filter by time, segment, stage, cluster
- Trace List: Paginated, sortable list of sampled traces
- Trace Detail: Waterfall view with span details (similar to Jaeger/Tempo UI)
- Trace Comparison: Side-by-side slow trace vs baseline trace (highlight differences)
UX Patterns:
- Use virtual scrolling for long trace lists (see List Virtualization)
- Add keyboard navigation (arrow keys, Enter to open)
- Show loading skeletons while traces load
For junior developers: This is like looking at a specific failed request in your API logs. You see exactly what happened, step-by-step.
For senior architects: This is the "reproduction" phase. We need efficient trace search and comparison.
Step 6: Actions ("How Do We Fix It?")
Goal: Provide one-click mitigation actions.
What to show:
- Rollback Button: Rollback experiment or config change
- Reroute Button: Reroute traffic away from problematic cluster
- Cache Tune Button: Adjust cache policies (TTL, invalidation)
- Create Incident Button: Pre-fill incident ticket with context
UX Patterns:
- Show confirmation dialogs for destructive actions
- Show action status ("Rolling back...", "Rollback complete")
- Use toast notifications for success/error
For junior developers: This is the "fix it" step. You've identified the problem, now you take action.
For senior architects: This is the "mitigation" phase. We need idempotent actions, audit logs, and rollback capabilities.
Step 7: Verify ("Did It Work?")
Goal: Confirm that the mitigation resolved the issue.
What to show:
- Recovery Indicator: Show latency returning to baseline
- Before/After Comparison: Side-by-side metrics
- Documentation: Link to incident postmortem
UX Patterns:
- Real-time updates (WebSocket or polling)
- Progress indicators ("Latency improving...")
- Success states ("Issue resolved")
UX Principles
StructuredProgressive Disclosure
Don't show everything at once. Start with overview, then drill down.
Context Preservation
When drilling down, keep the context visible (breadcrumbs, "back" button).
Fast Feedback
Show loading states, use optimistic updates, cache frequent queries.
Error Handling
Handle missing data gracefully, show warnings for delayed metrics.
Accessibility
Full keyboard navigation, screen reader support, color-blind friendly.
Let's define the APIs that the dashboard needs. These are the contracts between the frontend and backend.
API Design Principles
StructuredRESTful Design
Use standard HTTP methods and status codes.
Query Parameters for Filtering
Use query parameters for filters (not request body for GET requests).
Pagination
Use cursor-based pagination for large datasets (more efficient than offset-based).
Caching
Return appropriate cache headers (Cache-Control, ETag) for cacheable responses.
Error Handling
Return consistent error format with proper HTTP status codes.
Core APIs
1. Latency Overview
Endpoint: GET /api/latency/overview
Purpose: Get high-level SLO metrics for the overview page.
Query Parameters:
from(required): ISO 8601 timestamp (e.g.,2024-01-15T10:00:00Z)to(required): ISO 8601 timestampinterval(optional): Aggregation interval (1m,5m,1h,1d)
Response:
interface LatencyOverviewResponse {
timeRange: { from: string; to: string; };
intervals: Array<{
timestamp: string;
p50: number; // milliseconds
p90: number;
p95: number;
p99: number;
errorRate: number; // 0-1 (e.g., 0.01 = 1%)
timeoutRate: number; // 0-1
}>;
currentSLO: {
p95Target: number; // SLO target in ms
p95Current: number; // Current p95 latency
status: "healthy" | "warning" | "violation";
};
}Caching: Cache for 30-60 seconds (metrics update frequently but not in real-time).
2. Latency Breakdown
Endpoint: GET /api/latency/breakdown
Purpose: Get latency breakdown by dimensions (region, device, query class, etc.).
Query Parameters:
from,to,interval(same as overview)dims(required): Comma-separated dimensions (e.g.,region,device)filters(optional): JSON-encoded filters (e.g.,{"region":"us-east","device":"mobile"})
Response:
interface LatencyBreakdownResponse {
dimensions: string[]; // e.g., ["region", "device"]
breakdowns: Array<{
values: Record<string, string>; // e.g., { region: "us-east", device: "mobile" }
p95: number;
p99: number;
requestCount: number;
contribution: number; // 0-1, contribution to overall latency
}>;
}Caching: Cache for 60 seconds (drilldowns are less frequent than overview).
3. Stage Breakdown
Endpoint: GET /api/latency/stages
Purpose: Get latency decomposition by stage (edge, retrieval, ranking, ads, render).
Query Parameters:
from,to,interval,filters(same as above)
Response:
interface StageBreakdownResponse {
intervals: Array<{
timestamp: string;
stages: {
edge: { p50: number; p95: number; p99: number; };
retrieval: { p50: number; p95: number; p99: number; };
ranking: { p50: number; p95: number; p99: number; };
ads: { p50: number; p95: number; p99: number; };
render: { p50: number; p95: number; p99: number; };
};
total: { p50: number; p95: number; p99: number; };
}>;
baseline?: { // Optional: compare against baseline period
stages: { /* same structure */ };
total: { /* same structure */ };
};
deltas?: Array<{ // Optional: show deltas vs baseline
stage: string;
p95Delta: number; // milliseconds increase
p95DeltaPercent: number; // percentage increase
}>;
}4. Top Offenders
Endpoint: GET /api/offenders
Purpose: Get top contributors to latency regression (clusters, experiments, query patterns).
Query Parameters:
from,to,filters(same as above)groupBy(required):cluster|experiment|query_fingerprintlimit(optional): Number of results (default: 10)
Response:
interface OffendersResponse {
groupBy: string;
offenders: Array<{
key: string; // e.g., cluster name, experiment bucket, query fingerprint
p95: number;
p95Baseline: number; // For comparison
p95Delta: number; // Increase in ms
p95DeltaPercent: number; // Percentage increase
requestCount: number;
contribution: number; // 0-1, contribution to overall regression
}>;
}5. Trace Search
Endpoint: GET /api/traces/search
Purpose: Search for sampled traces matching filters.
Query Parameters:
from,to,filters(same as above)limit(optional): Number of results (default: 50, max: 500)cursor(optional): Pagination cursorsortBy(optional):latency|timestamp(default:latency)order(optional):asc|desc(default:desc)
Response:
interface TraceSearchResponse {
traces: Array<{
traceId: string;
requestId: string;
timestamp: string;
totalLatency: number;
tags: Record<string, string>; // region, device, cluster, etc.
}>;
nextCursor?: string; // For pagination
hasMore: boolean;
}6. Trace Detail
Endpoint: GET /api/traces/:traceId
Purpose: Get full trace details (spans, timing, tags).
Response:
interface TraceDetailResponse {
traceId: string;
requestId: string;
timestamp: string;
totalLatency: number;
tags: Record<string, string>;
spans: Array<{
spanId: string;
name: string; // "edge", "retrieval", "ranking", etc.
startTime: number; // Relative to trace start (ms)
duration: number; // milliseconds
tags: Record<string, string>;
children?: Array<Span>; // Nested spans
}>;
}7. Trace Comparison
Endpoint: GET /api/traces/compare
Purpose: Compare two traces side-by-side (slow vs baseline).
Query Parameters:
slowTraceId(required): Trace ID of slow requestbaselineTraceId(required): Trace ID of baseline (normal) request
Response:
interface TraceComparisonResponse {
slow: TraceDetailResponse;
baseline: TraceDetailResponse;
differences: Array<{
stage: string;
slowDuration: number;
baselineDuration: number;
delta: number; // milliseconds
deltaPercent: number; // percentage
}>;
}8. Changes (Deploys, Configs, Experiments)
Endpoint: GET /api/changes
Purpose: Get timeline of changes (deploys, config pushes, experiments, incidents).
Query Parameters:
from,to(same as above)filters(optional): Filter by type (deploy,config,experiment,incident)
Response:
interface ChangesResponse {
changes: Array<{
id: string;
type: "deploy" | "config" | "experiment" | "incident";
timestamp: string;
description: string;
metadata: Record<string, unknown>; // e.g., { version: "v1.2.3", service: "search-api" }
}>;
}Error Handling
All APIs should return consistent error format:
interface ErrorResponse {
error: {
code: string; // e.g., "INVALID_TIME_RANGE", "TRACE_NOT_FOUND"
message: string;
details?: Record<string, unknown>;
};
}HTTP Status Codes:
200 OK: Success400 Bad Request: Invalid parameters401 Unauthorized: Missing or invalid auth403 Forbidden: Insufficient permissions404 Not Found: Resource not found429 Too Many Requests: Rate limited500 Internal Server Error: Server error
Rate Limiting
- Overview/Breakdown APIs: 100 requests/minute per user
- Trace APIs: 50 requests/minute per user (more expensive)
- Changes API: 200 requests/minute per user
Return 429 Too Many Requests with Retry-After header when rate limited.
Caching Strategy
- Overview: 30-60 seconds (frequently accessed, but metrics update often)
- Breakdown: 60 seconds (less frequent, but still needs freshness)
- Stages: 60 seconds
- Offenders: 120 seconds (changes less frequently)
- Traces: No cache (always fresh, user-specific)
- Changes: 300 seconds (changes infrequently)
Use Cache-Control headers and ETag for conditional requests.
Every design decision has trade-offs. Let's discuss the key ones.
Real-Time vs Batch Processing
StructuredReal-Time (Streaming)
- ✅ Faster detection (seconds, not minutes)
- ✅ Better for critical alerts
- ❌ More complex infrastructure (Kafka, Pub/Sub, stream processing)
- ❌ Higher operational cost
Batch Processing
- ✅ Simpler to build and maintain
- ✅ Lower operational cost
- ✅ Easier to debug and reprocess data
- ❌ Delayed detection (5-15 minutes)
Recommendation
Start with batch processing (5-minute intervals). Add real-time if needed for critical alerts.
Storage: Aggregates vs Raw Data
StructuredPre-Aggregated Metrics Only
- ✅ Fast queries (< 100ms)
- ✅ Low storage cost
- ❌ Lose detail (can't drill down to individual requests)
- ❌ Can't answer "show me the slowest request"
Raw Data Only
- ✅ Full detail (can answer any question)
- ❌ Slow queries (seconds to minutes)
- ❌ High storage cost
Two-Store Model (Recommended)
- ✅ Fast queries for charts (aggregates)
- ✅ Deep debugging capability (sampled traces)
- ✅ Balanced cost and performance
- ❌ More complexity (two systems to maintain)
Recommendation
Use two-store model. It's the industry standard (Prometheus + Tempo, Datadog, New Relic all use this).
Sampling Strategy
StructuredHead-Based Sampling (Sample First N Requests)
- ✅ Simple to implement
- ✅ Representative of overall traffic
- ❌ May miss tail latency (the slow requests we care about)
Tail-Based Sampling (Sample Slow Requests)
- ✅ Captures the requests we care about most (slow ones)
- ✅ Better for debugging
- ❌ More complex (need to determine "slow" threshold)
- ❌ May miss edge cases if threshold is wrong
Hybrid Sampling (Recommended)
- Sample 100% of slow requests (p95+)
- Sample 1% of normal requests (for baseline comparison)
- Sample 10% of medium requests (p50-p95)
Recommendation
Use tail-based sampling with hybrid approach. It's more complex but provides better debugging capability.
Cardinality Control
StructuredUnbounded Dimensions (Use Raw Query Strings)
- ✅ Full flexibility (can drill down to any query)
- ❌ Explodes storage (millions of unique values)
- ❌ Slow queries (can't index efficiently)
- ❌ Privacy concerns (raw queries may contain PII)
Bounded Dimensions (Use Query Classes, Hashed Fingerprints)
- ✅ Controlled storage growth
- ✅ Fast queries (limited unique values)
- ✅ Privacy-safe (hashed fingerprints)
- ❌ Less flexibility (can't drill down to exact query)
Recommendation
Use bounded dimensions with hashed fingerprints. Provide "query exemplars" (top queries) only if k-anonymity threshold is met.
BFF Pattern vs Direct API Calls
StructuredDirect API Calls (Frontend Calls Backend Directly)
- ✅ Simpler architecture (one less service)
- ✅ Lower latency (one less hop)
- ❌ Frontend makes N+1 queries
- ❌ Frontend handles auth, caching, rate limiting
- ❌ Tight coupling (frontend breaks when backend changes)
BFF Pattern (Backend-for-Frontend)
- ✅ Single API contract (frontend doesn't break)
- ✅ Composes data from multiple sources
- ✅ Handles auth, caching, rate limiting
- ✅ Better separation of concerns
- ❌ Additional service to maintain
- ❌ Slightly higher latency (one more hop)
Recommendation
Use BFF pattern for complex dashboards. The benefits outweigh the costs.
Chart Rendering: Canvas vs SVG
StructuredSVG Charts
- ✅ Scalable (vector graphics)
- ✅ Easy to style with CSS
- ✅ Accessible (can add ARIA labels)
- ❌ Slow for large datasets (thousands of points)
Canvas Charts
- ✅ Fast for large datasets (raster graphics)
- ✅ Lower memory usage
- ❌ Less accessible (harder to add ARIA labels)
- ❌ Harder to style
Recommendation
Use canvas for large time series (performance), SVG for small charts (accessibility).
Caching Strategy
StructuredNo Caching
- ✅ Always fresh data
- ❌ Slow queries (every request hits database)
- ❌ High database load
Aggressive Caching (Long TTL)
- ✅ Fast queries (served from cache)
- ✅ Low database load
- ❌ Stale data (may miss recent changes)
Smart Caching (Recommended)
- Cache overview/breakdown for 30-60 seconds (frequently accessed, but needs freshness)
- Cache offenders for 120 seconds (changes less frequently)
- Don't cache traces (always fresh, user-specific)
- Use
ETagfor conditional requests
Recommendation
Use smart caching with appropriate TTLs. Balance freshness vs performance.
Privacy: K-Anonymity Threshold
StructuredNo Threshold (Show All Queries)
- ✅ Full visibility
- ❌ Privacy risk (may expose rare queries that identify users)
High Threshold (Show Only Common Queries)
- ✅ Privacy-safe
- ❌ Less useful (may hide important edge cases)
Balanced Threshold (Recommended)
- Show query exemplars only if they appear in ≥10 requests (k=10)
- Hash query fingerprints (SHA-256)
- Show safe metadata (length, token count, language)
Recommendation
Use k=10 threshold. It's a good balance between privacy and usefulness.
Summary: Key Trade-offs
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Processing | Real-time | Batch | Start with batch, add real-time if needed |
| Storage | Aggregates only | Two-store | Two-store (aggregates + traces) |
| Sampling | Head-based | Tail-based | Tail-based (hybrid) |
| Dimensions | Unbounded | Bounded | Bounded (with hashed fingerprints) |
| Architecture | Direct API | BFF | BFF (for complex dashboards) |
| Charts | SVG | Canvas | Canvas (large), SVG (small) |
| Caching | None | Aggressive | Smart (30-120s TTL) |
| Privacy | No threshold | High threshold | Balanced (k=10) |
Let's walk through a real incident scenario to see how this dashboard helps.
Scenario: Latency Spike at 2:00 PM
StructuredStep 1: Detection (2:00 PM)
Engineer opens dashboard, sees:
- p95 latency: 800ms (baseline: 200ms) ← 4x increase!
- Error rate: 2% (baseline: 0.1%)
- Alert banner: "SLO violation detected"
Action: Engineer confirms this is a real regression (not a blip).
Step 2: Segmentation (2:01 PM)
Engineer clicks "Segment Drilldown", sees:
- Top contributor: Region
us-eastcontributes 60% of latency increase - Device: Mobile devices are 3x slower than desktop
- Query class: Web search is affected, image search is fine
Action: Engineer filters by region=us-east, device=mobile, queryClass=web.
Step 3: Stage Breakdown (2:02 PM)
Engineer views stage breakdown, sees:
- Edge: 50ms (normal)
- Retrieval: 600ms ← This is the problem! (baseline: 150ms)
- Ranking: 100ms (normal)
- Ads: 80ms (normal)
- Render: 50ms (normal)
Action: Engineer identifies that document retrieval is the regressing stage.
Step 4: Correlation (2:03 PM)
Engineer views change timeline, sees:
- 1:45 PM: Deploy
search-api v1.2.3tous-eastcluster - 1:50 PM: Config push (cache TTL increased)
- 2:00 PM: Latency spike starts
Action: Engineer suspects the deploy or config change caused the regression.
Step 5: Offenders (2:04 PM)
Engineer views top offenders, sees:
- Cluster
us-east-1is 4x slower than baseline - Experiment
ranking-v2is 2x slower (but only 10% of traffic)
Action: Engineer focuses on us-east-1 cluster.
Step 6: Trace Inspection (2:05 PM)
Engineer searches traces:
- Filters:
region=us-east, device=mobile, cluster=us-east-1, stage=retrieval - Finds slow trace:
traceId=abc123, latency: 800ms - Compares with baseline trace:
traceId=def456, latency: 150ms
Trace comparison shows:
- Slow trace: Retrieval took 600ms (database query timeout, retried 3 times)
- Baseline trace: Retrieval took 150ms (normal database query)
Action: Engineer identifies root cause: database query timeout in us-east-1 cluster.
Step 7: Mitigation (2:06 PM)
Engineer takes action:
- Clicks "Reroute Traffic" → reroutes 50% of traffic away from
us-east-1 - Clicks "Create Incident" → creates ticket with context:
- Affected:
us-east, mobile, web search - Root cause: Database query timeout in
us-east-1 - Suspect change: Deploy
search-api v1.2.3 - Sample traces:
abc123(slow),def456(baseline)
Step 8: Verification (2:10 PM)
Engineer monitors recovery:
- p95 latency: 250ms (improving, but not back to baseline)
- Rerouting helped, but root cause still needs fixing
- Engineer escalates to database team
Step 9: Resolution (2:30 PM)
Database team fixes issue:
- Root cause: Slow query in
search-api v1.2.3(missing index) - Fix: Rollback to
v1.2.2or add missing index - Latency returns to baseline: 200ms
Key Takeaways from This Playbook
- Time to Detection: 1 minute (dashboard shows problem immediately)
- Time to Root Cause: 5 minutes (guided workflow finds the issue)
- Time to Mitigation: 6 minutes (one-click actions)
- Total Time to Resolution: 30 minutes (including fix)
Without this dashboard: Engineers would spend 30-60 minutes manually digging through logs, traces, and metrics. This dashboard reduces that to 5-10 minutes.
Common Patterns
StructuredPattern 1: Experiment Rollout
Symptoms: Latency spike in specific experiment bucket
Action: Rollback experiment
Time: 2-3 minutes
Pattern 2: Config Change
Symptoms: Latency spike after config push
Action: Revert config change
Time: 2-3 minutes
Pattern 3: Cluster Degradation
Symptoms: Latency spike in specific cluster
Action: Reroute traffic away from cluster
Time: 1-2 minutes
Pattern 4: Cache Miss Storm
Symptoms: High cache miss rate, retrieval stage slow
Action: Adjust cache TTL or warm cache
Time: 5-10 minutes
Best Practices
- Start with overview: Don't jump to details without understanding the big picture
- Use filters: Narrow down to affected segments before drilling down
- Correlate with changes: Always check the change timeline
- Compare traces: Slow vs baseline traces reveal the root cause
- Document incidents: Create tickets with full context for postmortem
- Verify fixes: Monitor recovery after taking action
Core Principles
- A latency dashboard is a debugging workflow, not just charts.
- Charts show what's wrong, workflows help you fix it.
- Design for actionability, not just visibility.
- Latency decomposition by stage is mandatory for actionability.
- You can't fix what you can't measure.
- Stage breakdown (waterfall) reveals where time is spent.
- Control cardinality aggressively; use sampling + privacy-safe exemplars.
- Unbounded dimensions explode storage and slow queries.
- Use bounded dimensions, hashed fingerprints, and k-anonymity.
- Correlate spikes with deploys/config/experiments and provide trace diffing.
- Most latency spikes are caused by changes.
- Change timeline + trace comparison = faster root cause identification.
- Optimize dashboard performance via rollups, caching, and paginated drilldowns.
- Pre-aggregate metrics for fast queries.
- Cache frequent queries, paginate large datasets.
- Use two-store model (fast aggregates + sampled traces).
For Junior Developers
- Start simple: Build overview page first, then add drilldowns.
- Learn the patterns: Two-store model, BFF pattern, tail-based sampling.
- Focus on UX: Make it easy to use, not just functional.
- Handle edge cases: Missing data, delayed metrics, errors.
For Senior Architects
- Design for scale: Cardinality control, efficient queries, smart caching.
- Balance trade-offs: Performance vs cost, real-time vs batch, detail vs speed.
- Privacy first: Hash queries, k-anonymity, audit logs.
- Observability: Monitor the dashboard itself (p95 latency, error rate).
Further Reading
- Core Web Vitals: Understanding p50, p90, p95, p99 and performance metrics
- List Virtualization: Efficiently render large lists
- Data Table: Building sortable, paginated tables
- Rendering Strategies: CSR vs SSR vs SSG vs ISR
Key Takeaways
- ✓A useful latency dashboard is a debugging workflow, not just charts. Design for actionability, not just visibility.
- ✓Latency decomposition by stage is mandatory for actionability. You can't fix what you can't measure.
- ✓Control cardinality aggressively; use sampling + privacy-safe exemplars. Unbounded dimensions explode storage.
- ✓Correlate spikes with deploys/config/experiments and provide trace diffing. Most latency spikes are caused by changes.
- ✓Optimize dashboard performance via rollups, caching, and paginated drilldowns. Use two-store model (fast aggregates + sampled traces).