Flaky Playwright tests in CI are frustrating because they fail for reasons that are often real, but not obvious. A test may pass locally, fail only in GitHub Actions, disappear after a retry, and leave behind a screenshot that does not explain much. The result is a debugging loop built on guesses, and guesses are expensive when the same suite runs on every pull request.

The good news is that most flaky CI failures follow a small number of patterns. They are usually caused by timing, selector instability, environment differences, shared test data, or state leakage between tests. If you debug them systematically, you can turn a random failure into a reproducible failure, then into a fix.

This guide focuses on a practical workflow for teams that use Playwright in continuous integration, especially when the failure only appears in CI and not on a developer laptop. The goal is not to eliminate all test flakiness overnight, but to isolate the source quickly and make the next failure cheaper to diagnose.

Start with the right mental model

A flaky test is not just a test that fails intermittently. It is a test whose outcome depends on something the test does not fully control or observe. In CI, that hidden dependency can be anything from machine load to network timing to a browser state carried over from a previous test.

That matters because the fix depends on the cause:

  • A timing problem needs better synchronization, not more retries.
  • A selector problem needs more stable locators, not longer timeouts.
  • An environment problem needs parity between local and CI, not another waitForTimeout.
  • A data problem needs isolation, not a rerun.

Retries can hide noise, but they do not explain it. Treat a retry as a clue, not a solution.

Before changing code, decide whether you are trying to reproduce the issue, observe the failure more clearly, or narrow the cause. Those are different tasks.

Step 1, make the failure observable

The first job is to capture enough evidence that the failure is actionable. A vague “test failed in CI” is not enough. You want the exact test name, the action that failed, the browser, the CI job, and the artifacts that show the page state.

Turn on the right artifacts

Playwright already gives you several useful debugging signals:

  • traces
  • screenshots
  • video
  • console logs
  • network logs

For flaky CI failures, traces are usually the most valuable artifact because they combine actions, DOM snapshots, network activity, and timing in one place.

A typical Playwright config for CI might enable traces on first retry:

import { defineConfig } from '@playwright/test';

export default defineConfig({ retries: process.env.CI ? 2 : 0, use: { trace: ‘on-first-retry’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’ } });

If a test fails only in CI, inspect the trace before changing the test. The trace often shows whether the failure came from the app, the test, or the environment.

Add targeted logging, not noise

Avoid dumping large logs on every step. Instead, log only the values that answer questions you actually have:

  • Which selector was used?
  • Which URL was loaded?
  • Which API response did the test wait for?
  • Which test data record was created?

If the test depends on a backend call, log the request identifier or response status. If it depends on generated data, log the fixture or seed.

Step 2, classify the failure before changing code

Once you have artifacts, classify the failure into one of four buckets.

1. Timing or synchronization

These tests usually fail because the test moves faster than the app. Common symptoms include:

  • TimeoutError on click, fill, expect, or navigation
  • assertion passes locally but fails in CI
  • element exists in the DOM but is not yet visible or enabled
  • intermittent failure after navigation or modal opening

This is often caused by waiting for the wrong thing. For example, waiting for a DOM node to appear is not enough if the app still renders a skeleton state, blocks input, or updates in the next animation frame.

2. Selector instability

These failures happen when the locator depends on text or structure that changes too often. Common symptoms include:

  • locator finds multiple elements
  • test clicks the wrong element after a layout change
  • failure starts after a small UI redesign
  • works in one locale but not another

3. Environment differences

The test passes locally and fails in CI because the runtime is not the same. Common causes:

  • smaller CPU or memory allocation in CI
  • different browser version or channel
  • headless rendering differences
  • missing fonts, locale, or timezone settings
  • network latency or blocked third-party resources

4. Data or state dependence

These tests fail because they rely on previous tests, shared records, or mutable backend state. Common symptoms include:

  • failure depends on test order
  • rerun passes after a failed first run
  • record already exists, duplicate key, or stale session issues
  • one test creates state that another test assumes is absent

Step 3, reproduce under CI-like conditions locally

If the failure only happens in CI, try to make your local environment more like CI instead of the other way around. This is often faster than staring at a green local run and a red pipeline.

Match browser and runtime versions

Use the same Playwright version, browser channels, Node version, and OS container image as the CI job when possible. Small version differences can affect timing, layout, or browser behavior.

If your CI uses a Docker image, run that image locally. If your pipeline uses Ubuntu, do not debug a Linux-only issue from macOS and assume equivalence.

Reduce local performance advantage

Your laptop is often faster than CI. That matters because a test that accidentally depends on speed may pass locally and fail in CI.

You can simulate this by:

  • running the suite in headless mode
  • limiting local CPU resources with containers
  • disabling debug-only waits
  • running with the same viewport and locale as CI

If the issue disappears when the machine is faster, that points toward synchronization or state timing.

Use the exact failing seed or data set

If the test generates random data, capture the seed, payload, or request body from the failing run. Then rerun with that same input. Randomized data is useful, but only if you can replay it.

Step 4, inspect the trace like a timeline

A Playwright trace is more than a debugging artifact. It is a timeline of what the browser saw and what the test did.

When reviewing the trace, look for these questions:

  • Did the test click before the element was fully ready?
  • Was the locator resolved to the expected element?
  • Did the app navigate or re-render unexpectedly?
  • Did a network request fail or take longer than expected?
  • Was the page already in a bad state before the failing step?

A useful habit is to compare the last successful action to the first failing one. Often the problem is not at the exact line that failed, but one step earlier.

Common timing bug example

A test that opens a drawer and immediately clicks inside it may fail if the drawer animates in or if the input is disabled until hydration completes.

typescript

await page.getByRole('button', { name: 'Open settings' }).click();
await expect(page.getByRole('dialog')).toBeVisible();
await page.getByLabel('Display name').fill('Test User');

The important part is not the wait itself, but the condition. Waiting for the dialog to be visible is more meaningful than sleeping for a fixed number of milliseconds.

Step 5, fix synchronization with signals, not sleeps

The most common anti-pattern in flaky Playwright tests is adding arbitrary delay. It can make the failure rarer, but it also makes the suite slower and the underlying issue harder to see.

Prefer assertions that wait for state

Playwright assertions already wait for the condition to become true within the timeout. Use that behavior instead of manual polling where possible.

typescript

await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();
await expect(page.getByTestId('save-button')).toBeEnabled();

Wait on network or application signals when needed

If your UI depends on a backend response, wait for the response or for the visible result of that response. The best choice depends on what the user actually experiences.

typescript

const responsePromise = page.waitForResponse(resp =>
  resp.url().includes('/api/profile') && resp.status() === 200
);
await page.getByRole('button', { name: 'Save' }).click();
await responsePromise;

This is better than a blind timeout because it ties the wait to an observable event.

Avoid over-waiting

A test can also become flaky when it waits for the wrong terminal state. For example, waiting for the network to be idle may be inappropriate on pages that keep background polling active. In that case, wait for the specific app state that proves readiness.

Step 6, make selectors stable and human-readable

Selector instability is a major source of intermittent failures, especially after UI refactors.

Use roles and accessible names first

When possible, prefer locators that represent user intent instead of layout structure.

typescript

await page.getByRole('button', { name: 'Submit order' }).click();
await page.getByRole('textbox', { name: 'Email address' }).fill('qa@example.com');

These locators survive DOM restructuring better than brittle CSS chains or text that changes with copy updates.

Reserve test IDs for ambiguous cases

Sometimes several elements have the same role and label, or the app has repeated patterns. In those cases, a stable data-testid is acceptable. The goal is not to avoid test IDs entirely, but to use them when they improve clarity and stability.

Check for duplicate matches

If a locator matches more than one element, a test can become flaky as the DOM changes. Debug duplicates explicitly by narrowing the locator or by asserting on the count.

typescript

const rows = page.getByRole('row', { name: /invoice/i });
await expect(rows).toHaveCount(1);

This turns an ambiguous selector into a failure you can understand.

Step 7, isolate data dependencies and state leakage

Some Playwright flaky tests are not UI problems at all. They are test isolation problems.

Create and destroy your own test data

If one test depends on a record created by another test, it is fragile by design. Each test should set up the data it needs and clean up after itself when possible.

For API-backed test setups, create fixtures through the backend instead of clicking through the UI every time. This makes the test faster and reduces unrelated UI failure points.

Use unique data per run

Shared usernames, emails, and tenant names are a common source of collisions in CI. Use a run-specific suffix, timestamp, or UUID.

Reset storage and authentication state

Browser storage, cookies, and local state can leak between tests if you reuse contexts incorrectly. In Playwright, isolate sessions by using fresh browser contexts unless you intentionally share state.

If a flaky test passes only after a clean rerun, inspect whether the first run left state behind that the second run inherited.

A rerun that succeeds after cleanup is a strong hint that the bug is in state management, not in the assertion.

Step 8, compare local and CI environment differences

If the test still seems random, the next step is to compare environments systematically. Do not guess. Check the variables that change behavior.

Browser and OS differences

Record the browser version, headless or headed mode, OS image, and viewport. Layout issues can show up only under a specific resolution or font set.

Locale, timezone, and language

Dates, number formatting, and even text direction can affect UI output. If your assertions compare visible text, ensure the test environment uses the expected locale and timezone.

Resource limits

CI runners may have fewer CPU cores, less memory, or more contention. An app that hydrates slowly or a server that starts under load can cause timing-sensitive tests to fail.

Network behavior

External dependencies are a frequent source of test failures. If a page loads third-party scripts or analytics that are not relevant to the test, consider stubbing them. For API-driven tests, use request interception or a test double when the real service is not under your control.

Step 9, use retries carefully, then remove the illusion of safety

Retries are useful in CI because they can keep a pipeline moving while you investigate. But they should be a temporary mitigation, not a root-cause strategy.

What retries can tell you

  • Pass on retry often indicates timing or environment sensitivity.
  • Fail consistently across retries points toward a deterministic defect.
  • Fail once, pass once, and then fail again can indicate data race or shared state.

What retries cannot tell you

  • why the test failed
  • whether the fix is correct
  • whether the app or test is healthy

A better pattern is to keep retries low and require investigation for repeated flaky signatures. Otherwise, you build a pipeline that normalizes instability.

Step 10, build a small debugging playbook for the team

The easiest way to debug flaky Playwright tests in CI without guesswork is to make the process repeatable for everyone.

A practical triage checklist

When a CI test fails, ask these questions in order:

  1. Did the test fail on the first attempt or only after retries?
  2. What exact action failed, click, fill, wait, assertion, or navigation?
  3. Do the trace and screenshot show the expected page state?
  4. Is the locator unique and stable?
  5. Is the failure tied to a specific browser, viewport, or runner image?
  6. Does the test rely on shared or pre-existing data?
  7. Can the failure be reproduced with the same seed, input, or account?

Add labels to flaky failures

If your CI or test reporting system supports it, tag failures by symptom, such as:

  • selector
  • timeout
  • auth state
  • data collision
  • environment

This helps you see patterns across builds. A single failing test can look random. Ten tests failing for the same reason is a signal.

Keep failure notes near the code

When you fix a flaky test, add a comment or commit note about why the previous approach failed. Future maintainers should not have to rediscover the same issue.

A few concrete patterns and their fixes

Pattern: click fails only in CI after navigation

Likely cause, the app is still rendering or the element is covered.

Try:

  • waiting for a visible, enabled control
  • checking that the page has reached the expected route
  • using a locator tied to the user-facing control, not a CSS container

Pattern: test passes locally but fails with a timeout in CI

Likely cause, CI is slower or the environment is under load.

Try:

  • replacing sleeps with assertions
  • waiting for a response or state transition
  • checking whether the app has an extra async dependency in CI

Pattern: one test fails only after another test ran first

Likely cause, shared state or cleanup failure.

Try:

  • isolating user accounts, browser contexts, and test data
  • making setup and teardown explicit
  • running the test alone and in a shuffled order

Pattern: selector works until the UI changes slightly

Likely cause, fragile locator strategy.

Try:

  • switching to role-based locators
  • using test IDs for repeated elements
  • asserting uniqueness before interacting

A sample CI workflow that surfaces flakes faster

A good pipeline does more than run tests. It helps you answer whether the failure is reproducible, environment-specific, or state-related.

name: e2e

on: [pull_request]

jobs: playwright: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright install –with-deps - run: npx playwright test –reporter=line - if: failure() uses: actions/upload-artifact@v4 with: name: playwright-artifacts path: test-results/

The useful part here is not the YAML itself, but the habit behind it: install the same browsers the tests expect, preserve artifacts on failure, and make inspection cheap.

When to refactor the test versus the app

Sometimes the test is wrong, and sometimes the app exposes a real user-facing timing bug. The distinction matters.

Refactor the test when:

  • the locator is brittle
  • the wait condition is artificial
  • the test depends on shared state
  • the assertion is too indirect

Investigate the app when:

  • the UI remains unresponsive for a long time
  • the page becomes interactive before data is ready
  • navigation or modal state is inconsistent
  • the same issue appears in manual testing

If the page is genuinely not ready when the user can still interact with it, the product may have a bug, not just the test.

A simple rule for deciding what to do next

If you can reproduce the failure locally with the same environment and data, fix the underlying cause. If you cannot reproduce it locally, instrument the gap between local and CI until you can. If a retry hides the issue, assume it is still there until proven otherwise.

That workflow sounds obvious, but it prevents the most common wasteful response to flaky CI failures, which is to tweak timeouts until the red builds become less frequent and the root cause becomes harder to find.

Closing thoughts

To debug flaky Playwright tests in CI without guesswork, focus on evidence, reproduction, and isolation. Use traces and targeted logs to understand the failure. Classify the issue into timing, selector instability, environment differences, or data dependence. Then fix the cause, not the symptom.

That discipline pays off quickly. Stable selectors reduce false failures. Better waits reduce timing noise. Cleaner test data reduces order dependence. And a CI pipeline with good artifacts makes the next failure faster to explain.

For teams building a broader test automation strategy, this is the same pattern that applies across software testing, test automation, and continuous integration, the tools differ, but the debugging logic stays the same. When the pipeline tells you exactly what happened, flaky tests stop being a mystery and start becoming fixable engineering work.