May 22, 2026
How to Debug Flaky Playwright Tests in CI Without Guesswork
A practical workflow to debug flaky Playwright tests in CI, isolate timing issues, selector instability, environment differences, retries, and data dependencies.
Flaky Playwright tests in CI are frustrating because they fail for reasons that are often real, but not obvious. A test may pass locally, fail only in GitHub Actions, disappear after a retry, and leave behind a screenshot that does not explain much. The result is a debugging loop built on guesses, and guesses are expensive when the same suite runs on every pull request.
The good news is that most flaky CI failures follow a small number of patterns. They are usually caused by timing, selector instability, environment differences, shared test data, or state leakage between tests. If you debug them systematically, you can turn a random failure into a reproducible failure, then into a fix.
This guide focuses on a practical workflow for teams that use Playwright in continuous integration, especially when the failure only appears in CI and not on a developer laptop. The goal is not to eliminate all test flakiness overnight, but to isolate the source quickly and make the next failure cheaper to diagnose.
Start with the right mental model
A flaky test is not just a test that fails intermittently. It is a test whose outcome depends on something the test does not fully control or observe. In CI, that hidden dependency can be anything from machine load to network timing to a browser state carried over from a previous test.
That matters because the fix depends on the cause:
- A timing problem needs better synchronization, not more retries.
- A selector problem needs more stable locators, not longer timeouts.
- An environment problem needs parity between local and CI, not another
waitForTimeout. - A data problem needs isolation, not a rerun.
Retries can hide noise, but they do not explain it. Treat a retry as a clue, not a solution.
Before changing code, decide whether you are trying to reproduce the issue, observe the failure more clearly, or narrow the cause. Those are different tasks.
Step 1, make the failure observable
The first job is to capture enough evidence that the failure is actionable. A vague “test failed in CI” is not enough. You want the exact test name, the action that failed, the browser, the CI job, and the artifacts that show the page state.
Turn on the right artifacts
Playwright already gives you several useful debugging signals:
- traces
- screenshots
- video
- console logs
- network logs
For flaky CI failures, traces are usually the most valuable artifact because they combine actions, DOM snapshots, network activity, and timing in one place.
A typical Playwright config for CI might enable traces on first retry:
import { defineConfig } from '@playwright/test';
export default defineConfig({ retries: process.env.CI ? 2 : 0, use: { trace: ‘on-first-retry’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’ } });
If a test fails only in CI, inspect the trace before changing the test. The trace often shows whether the failure came from the app, the test, or the environment.
Add targeted logging, not noise
Avoid dumping large logs on every step. Instead, log only the values that answer questions you actually have:
- Which selector was used?
- Which URL was loaded?
- Which API response did the test wait for?
- Which test data record was created?
If the test depends on a backend call, log the request identifier or response status. If it depends on generated data, log the fixture or seed.
Step 2, classify the failure before changing code
Once you have artifacts, classify the failure into one of four buckets.
1. Timing or synchronization
These tests usually fail because the test moves faster than the app. Common symptoms include:
TimeoutErroronclick,fill,expect, or navigation- assertion passes locally but fails in CI
- element exists in the DOM but is not yet visible or enabled
- intermittent failure after navigation or modal opening
This is often caused by waiting for the wrong thing. For example, waiting for a DOM node to appear is not enough if the app still renders a skeleton state, blocks input, or updates in the next animation frame.
2. Selector instability
These failures happen when the locator depends on text or structure that changes too often. Common symptoms include:
- locator finds multiple elements
- test clicks the wrong element after a layout change
- failure starts after a small UI redesign
- works in one locale but not another
3. Environment differences
The test passes locally and fails in CI because the runtime is not the same. Common causes:
- smaller CPU or memory allocation in CI
- different browser version or channel
- headless rendering differences
- missing fonts, locale, or timezone settings
- network latency or blocked third-party resources
4. Data or state dependence
These tests fail because they rely on previous tests, shared records, or mutable backend state. Common symptoms include:
- failure depends on test order
- rerun passes after a failed first run
- record already exists, duplicate key, or stale session issues
- one test creates state that another test assumes is absent
Step 3, reproduce under CI-like conditions locally
If the failure only happens in CI, try to make your local environment more like CI instead of the other way around. This is often faster than staring at a green local run and a red pipeline.
Match browser and runtime versions
Use the same Playwright version, browser channels, Node version, and OS container image as the CI job when possible. Small version differences can affect timing, layout, or browser behavior.
If your CI uses a Docker image, run that image locally. If your pipeline uses Ubuntu, do not debug a Linux-only issue from macOS and assume equivalence.
Reduce local performance advantage
Your laptop is often faster than CI. That matters because a test that accidentally depends on speed may pass locally and fail in CI.
You can simulate this by:
- running the suite in headless mode
- limiting local CPU resources with containers
- disabling debug-only waits
- running with the same viewport and locale as CI
If the issue disappears when the machine is faster, that points toward synchronization or state timing.
Use the exact failing seed or data set
If the test generates random data, capture the seed, payload, or request body from the failing run. Then rerun with that same input. Randomized data is useful, but only if you can replay it.
Step 4, inspect the trace like a timeline
A Playwright trace is more than a debugging artifact. It is a timeline of what the browser saw and what the test did.
When reviewing the trace, look for these questions:
- Did the test click before the element was fully ready?
- Was the locator resolved to the expected element?
- Did the app navigate or re-render unexpectedly?
- Did a network request fail or take longer than expected?
- Was the page already in a bad state before the failing step?
A useful habit is to compare the last successful action to the first failing one. Often the problem is not at the exact line that failed, but one step earlier.
Common timing bug example
A test that opens a drawer and immediately clicks inside it may fail if the drawer animates in or if the input is disabled until hydration completes.
typescript
await page.getByRole('button', { name: 'Open settings' }).click();
await expect(page.getByRole('dialog')).toBeVisible();
await page.getByLabel('Display name').fill('Test User');
The important part is not the wait itself, but the condition. Waiting for the dialog to be visible is more meaningful than sleeping for a fixed number of milliseconds.
Step 5, fix synchronization with signals, not sleeps
The most common anti-pattern in flaky Playwright tests is adding arbitrary delay. It can make the failure rarer, but it also makes the suite slower and the underlying issue harder to see.
Prefer assertions that wait for state
Playwright assertions already wait for the condition to become true within the timeout. Use that behavior instead of manual polling where possible.
typescript
await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();
await expect(page.getByTestId('save-button')).toBeEnabled();
Wait on network or application signals when needed
If your UI depends on a backend response, wait for the response or for the visible result of that response. The best choice depends on what the user actually experiences.
typescript
const responsePromise = page.waitForResponse(resp =>
resp.url().includes('/api/profile') && resp.status() === 200
);
await page.getByRole('button', { name: 'Save' }).click();
await responsePromise;
This is better than a blind timeout because it ties the wait to an observable event.
Avoid over-waiting
A test can also become flaky when it waits for the wrong terminal state. For example, waiting for the network to be idle may be inappropriate on pages that keep background polling active. In that case, wait for the specific app state that proves readiness.
Step 6, make selectors stable and human-readable
Selector instability is a major source of intermittent failures, especially after UI refactors.
Use roles and accessible names first
When possible, prefer locators that represent user intent instead of layout structure.
typescript
await page.getByRole('button', { name: 'Submit order' }).click();
await page.getByRole('textbox', { name: 'Email address' }).fill('qa@example.com');
These locators survive DOM restructuring better than brittle CSS chains or text that changes with copy updates.
Reserve test IDs for ambiguous cases
Sometimes several elements have the same role and label, or the app has repeated patterns. In those cases, a stable data-testid is acceptable. The goal is not to avoid test IDs entirely, but to use them when they improve clarity and stability.
Check for duplicate matches
If a locator matches more than one element, a test can become flaky as the DOM changes. Debug duplicates explicitly by narrowing the locator or by asserting on the count.
typescript
const rows = page.getByRole('row', { name: /invoice/i });
await expect(rows).toHaveCount(1);
This turns an ambiguous selector into a failure you can understand.
Step 7, isolate data dependencies and state leakage
Some Playwright flaky tests are not UI problems at all. They are test isolation problems.
Create and destroy your own test data
If one test depends on a record created by another test, it is fragile by design. Each test should set up the data it needs and clean up after itself when possible.
For API-backed test setups, create fixtures through the backend instead of clicking through the UI every time. This makes the test faster and reduces unrelated UI failure points.
Use unique data per run
Shared usernames, emails, and tenant names are a common source of collisions in CI. Use a run-specific suffix, timestamp, or UUID.
Reset storage and authentication state
Browser storage, cookies, and local state can leak between tests if you reuse contexts incorrectly. In Playwright, isolate sessions by using fresh browser contexts unless you intentionally share state.
If a flaky test passes only after a clean rerun, inspect whether the first run left state behind that the second run inherited.
A rerun that succeeds after cleanup is a strong hint that the bug is in state management, not in the assertion.
Step 8, compare local and CI environment differences
If the test still seems random, the next step is to compare environments systematically. Do not guess. Check the variables that change behavior.
Browser and OS differences
Record the browser version, headless or headed mode, OS image, and viewport. Layout issues can show up only under a specific resolution or font set.
Locale, timezone, and language
Dates, number formatting, and even text direction can affect UI output. If your assertions compare visible text, ensure the test environment uses the expected locale and timezone.
Resource limits
CI runners may have fewer CPU cores, less memory, or more contention. An app that hydrates slowly or a server that starts under load can cause timing-sensitive tests to fail.
Network behavior
External dependencies are a frequent source of test failures. If a page loads third-party scripts or analytics that are not relevant to the test, consider stubbing them. For API-driven tests, use request interception or a test double when the real service is not under your control.
Step 9, use retries carefully, then remove the illusion of safety
Retries are useful in CI because they can keep a pipeline moving while you investigate. But they should be a temporary mitigation, not a root-cause strategy.
What retries can tell you
- Pass on retry often indicates timing or environment sensitivity.
- Fail consistently across retries points toward a deterministic defect.
- Fail once, pass once, and then fail again can indicate data race or shared state.
What retries cannot tell you
- why the test failed
- whether the fix is correct
- whether the app or test is healthy
A better pattern is to keep retries low and require investigation for repeated flaky signatures. Otherwise, you build a pipeline that normalizes instability.
Step 10, build a small debugging playbook for the team
The easiest way to debug flaky Playwright tests in CI without guesswork is to make the process repeatable for everyone.
A practical triage checklist
When a CI test fails, ask these questions in order:
- Did the test fail on the first attempt or only after retries?
- What exact action failed, click, fill, wait, assertion, or navigation?
- Do the trace and screenshot show the expected page state?
- Is the locator unique and stable?
- Is the failure tied to a specific browser, viewport, or runner image?
- Does the test rely on shared or pre-existing data?
- Can the failure be reproduced with the same seed, input, or account?
Add labels to flaky failures
If your CI or test reporting system supports it, tag failures by symptom, such as:
- selector
- timeout
- auth state
- data collision
- environment
This helps you see patterns across builds. A single failing test can look random. Ten tests failing for the same reason is a signal.
Keep failure notes near the code
When you fix a flaky test, add a comment or commit note about why the previous approach failed. Future maintainers should not have to rediscover the same issue.
A few concrete patterns and their fixes
Pattern: click fails only in CI after navigation
Likely cause, the app is still rendering or the element is covered.
Try:
- waiting for a visible, enabled control
- checking that the page has reached the expected route
- using a locator tied to the user-facing control, not a CSS container
Pattern: test passes locally but fails with a timeout in CI
Likely cause, CI is slower or the environment is under load.
Try:
- replacing sleeps with assertions
- waiting for a response or state transition
- checking whether the app has an extra async dependency in CI
Pattern: one test fails only after another test ran first
Likely cause, shared state or cleanup failure.
Try:
- isolating user accounts, browser contexts, and test data
- making setup and teardown explicit
- running the test alone and in a shuffled order
Pattern: selector works until the UI changes slightly
Likely cause, fragile locator strategy.
Try:
- switching to role-based locators
- using test IDs for repeated elements
- asserting uniqueness before interacting
A sample CI workflow that surfaces flakes faster
A good pipeline does more than run tests. It helps you answer whether the failure is reproducible, environment-specific, or state-related.
name: e2e
on: [pull_request]
jobs: playwright: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright install –with-deps - run: npx playwright test –reporter=line - if: failure() uses: actions/upload-artifact@v4 with: name: playwright-artifacts path: test-results/
The useful part here is not the YAML itself, but the habit behind it: install the same browsers the tests expect, preserve artifacts on failure, and make inspection cheap.
When to refactor the test versus the app
Sometimes the test is wrong, and sometimes the app exposes a real user-facing timing bug. The distinction matters.
Refactor the test when:
- the locator is brittle
- the wait condition is artificial
- the test depends on shared state
- the assertion is too indirect
Investigate the app when:
- the UI remains unresponsive for a long time
- the page becomes interactive before data is ready
- navigation or modal state is inconsistent
- the same issue appears in manual testing
If the page is genuinely not ready when the user can still interact with it, the product may have a bug, not just the test.
A simple rule for deciding what to do next
If you can reproduce the failure locally with the same environment and data, fix the underlying cause. If you cannot reproduce it locally, instrument the gap between local and CI until you can. If a retry hides the issue, assume it is still there until proven otherwise.
That workflow sounds obvious, but it prevents the most common wasteful response to flaky CI failures, which is to tweak timeouts until the red builds become less frequent and the root cause becomes harder to find.
Closing thoughts
To debug flaky Playwright tests in CI without guesswork, focus on evidence, reproduction, and isolation. Use traces and targeted logs to understand the failure. Classify the issue into timing, selector instability, environment differences, or data dependence. Then fix the cause, not the symptom.
That discipline pays off quickly. Stable selectors reduce false failures. Better waits reduce timing noise. Cleaner test data reduces order dependence. And a CI pipeline with good artifacts makes the next failure faster to explain.
For teams building a broader test automation strategy, this is the same pattern that applies across software testing, test automation, and continuous integration, the tools differ, but the debugging logic stays the same. When the pipeline tells you exactly what happened, flaky tests stop being a mystery and start becoming fixable engineering work.