Your Feature Flags Are Lying To Your Tests

You're using feature flags wrong. Most teams view them as a deployment lever, a safety net for production, or a tool for A/B experiments. This is a fundamental misunderstanding that leaves significant quality and agility gains on the table; you're missing their true power as a continuous, dynamic testing primitive, especially when paired with a robust CI/CD pipeline.

The conventional wisdom dictates that feature flags isolate new code, enabling trunk-based development by allowing unreleased features to reside in production without impacting users. While true, this is merely a side effect of their real utility: enabling parallel, isolated testing of features before they even consider seeing the light of day. Your staging environment is a snapshot, but your feature flag system, properly integrated, can simulate a multitude of future states, allowing you to validate multiple concurrent feature developments in isolation and combination.

Your Flags Are Just A/B Switches. They Should Be Test Harnesses.

The default mental model for feature flags is a simple boolean toggle: feature-x-enabled: true/false. You flip it in LaunchDarkly or Split.io, and a new experience lights up. This is incredibly limiting. We started treating flags not just as configuration variables but as direct input parameters to our test environments. Imagine a scenario where you have three interdependent features – new-search-algo, search-results-facets, and user-profile-integration. Most teams would build these, merge them to main, and then hope they don't blow up staging when enabled together.

We don't do that. Each feature lives behind its own flag, and our test suites are designed to run against specific flag combinations. This means our Playwright tests, for example, don't just test the "current" state of main. They test main with new-search-algo on, main with search-results-facets on, and most importantly, main with all three flags on, even when none are live for users. This proactive validation catches integration issues between features months earlier than traditional staging cycles would.

The Integration Hell You Didn't Know You Could Avoid

Let's be blunt: your "integration environment" is probably a mess. It's a single, shared state where everyone's half-baked features collide, leading to flaky tests, inexplicable failures, and endless blame games. "It works on my branch!" becomes the team's mantra. Feature flags, when used as test harnesses, obliterate this problem.

Instead of a single, monolithic staging environment, we leverage Testcontainers to spin up isolated, transient environments for each GitHub Actions workflow run. Within these containers, we inject specific feature flag configurations. This allows a PR for new-search-algo to run its entire suite of unit, integration, and even end-to-end tests against an environment where only new-search-algo is enabled. Simultaneously, another PR for user-profile-integration can run its tests against an environment where only its flag is on. The magic happens when we create a "feature integration" pipeline that runs daily, explicitly enabling combinations of flags for features nearing completion.

This approach reduced our "integration environment merge conflict" related test failures from an average of 34% down to under 2% within six months. The shift was profound: instead of debugging why feature-A broke feature-B in a shared environment, we now catch those conflicts in isolated, reproducible test runs specific to combinations of flags.

Trunk-Based Development? You Need Flag-Driven Tests

Everyone talks about trunk-based development. Few actually achieve it safely at scale. The fear of breaking main is real, and it often leads to long-lived feature branches, which ironically, are the antithesis of trunk-based development. Feature flags are the bedrock for true trunk-based development, but only if your testing strategy explicitly leverages them.

If every new feature is developed behind a flag, and your CI/CD pipeline can reliably test that feature in isolation even when merged to main, then main becomes a truly shippable artifact at any given commit. Our GitHub Actions workflows are configured to run specific test matrices based on changes detected in feature flag configurations or code related to a specific flag.

Here's a simplified YAML snippet demonstrating how we might trigger a specific Playwright test suite only when a particular feature flag is toggled on in the environment, for a pre-release validation step. This isn't for A/B testing; this is for asserting the functionality of a feature before it's even visible to internal users.

name: Feature Flagged E2E Tests

on:
  pull_request:
    branches: [ main ]
  push:
    branches: [ main ]

jobs:
  run-flagged-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install dependencies
        run: npm ci

      - name: Start services with Testcontainers (e.g., mock services, DB)
        run: docker-compose -f docker-compose.test.yml up -d
        # In docker-compose.test.yml, we'd inject environment variables
        # to our application service that enable specific feature flags.
        # Example: MY_APP_FEATURE_NEW_SEARCH_ALGO_ENABLED=true

      - name: Run Playwright tests for New Search Algorithm
        env:
          FEATURE_NEW_SEARCH_ALGO_ENABLED: true # This env var is read by our app/test suite
          BASE_URL: http://localhost:8080
        run: npx playwright test tests/e2e/new-search-algo.spec.ts --project=chromium
        # The test runner or application under test uses FEATURE_NEW_SEARCH_ALGO_ENABLED
        # to determine if the new search algorithm should be active.

      - name: Run Playwright tests for Search Results Facets
        env:
          FEATURE_SEARCH_RESULTS_FACETS_ENABLED: true
          BASE_URL: http://localhost:8080
        run: npx playwright test tests/e2e/search-facets.spec.ts --project=chromium

      - name: Run Playwright tests for BOTH New Search Algorithm AND Search Results Facets
        env:
          FEATURE_NEW_SEARCH_ALGO_ENABLED: true
          FEATURE_SEARCH_RESULTS_FACETS_ENABLED: true
          BASE_URL: http://localhost:8080
        run: npx playwright test tests/e2e/search-algo-facets-integration.spec.ts --project=chromium

      - name: Stop services
        if: always()
        run: docker-compose -f docker-compose.test.yml down

This ensures that even if new-search-algo is merged and sitting dormant in main, any change related to it (or any other feature) can trigger a test run that specifically validates its functionality, both in isolation and in combination with other flags. This cut down our "late stage integration defect" rate by 70% in critical areas.

Playwright + Flags: Asserting The Unreleased Future

Our E2E test suite, primarily built with Playwright 1.42, has become the primary mechanism for validating feature flags. We've developed a custom Playwright fixture that allows us to dynamically set feature flag states for each test. This means a single *.spec.ts file can contain tests that assert behavior with a flag on and with the flag off.

Consider a new "dark mode" feature. Instead of two separate test files or complex conditional logic within a single test, our fixture injects the flag state directly.

// fixtures/feature-flags.ts
import { test as baseTest, expect } from '@playwright/test';

type FeatureFlags = {
  featureFlags: {
    set: (flagName: string, value: boolean) => Promise<void>;
    get: (flagName: string) => Promise<boolean>;
  };
};

// Extend the base Playwright test object
const test = baseTest.extend<FeatureFlags>({
  featureFlags: async ({ page }, use) => {
    // This assumes your application has an API or localStorage/cookie mechanism
    // to set feature flags for the current user/session.
    // For Mendix apps, we have a custom API endpoint for test environments.
    const featureFlagApi = {
      set: async (flagName: string, value: boolean) => {
        await page.evaluate(
          ([name, val]) => {
            // This is a simplified example. In reality, this would hit a
            // local dev server endpoint or modify a client-side store.
            window.localStorage.setItem(`featureFlag:${name}`, String(val));
            // You might need to reload the page or trigger a re-render
            // if your app doesn't react dynamically to flag changes.
          },
          [flagName, value]
        );
        // await page.reload(); // Might be necessary depending on app architecture
      },
      get: async (flagName: string) => {
        return await page.evaluate(
          (name) => window.localStorage.getItem(`featureFlag:${name}`) === 'true',
          flagName
        );
      },
    };
    await use(featureFlagApi);
  },
});

export { test, expect };

Now, in our tests:

// tests/e2e/dark-mode.spec.ts
import { test, expect } from '../../fixtures/feature-flags';

test.describe('Dark Mode Feature', () => {
  test('should display light mode by default when flag is off', async ({ page, featureFlags }) => {
    await featureFlags.set('darkModeEnabled', false); // Explicitly disable for this test
    await page.goto('/settings');
    await expect(page.locator('.app-theme')).toHaveClass(/light-theme/);
  });

  test('should display dark mode when flag is on', async ({ page, featureFlags }) => {
    await featureFlags.set('darkModeEnabled', true); // Explicitly enable for this test
    await page.goto('/settings');
    await expect(page.locator('.app-theme')).toHaveClass(/dark-theme/);
  });

  test('should allow user to toggle dark mode even if flag is off initially', async ({ page, featureFlags }) => {
    await featureFlags.set('darkModeEnabled', false);
    await page.goto('/settings');
    await expect(page.locator('.app-theme')).toHaveClass(/light-theme/);

    await page.locator('[data-test-id="toggle-dark-mode"]').click();
    await expect(page.locator('.app-theme')).toHaveClass(/dark-theme/);

    // Verify the flag state is persisted/reflected if applicable
    expect(await featureFlags.get('darkModeEnabled')).toBe(true);
  });
});

This pattern has been crucial for us at Mendix, especially with building AI-powered features. We can validate the AI models' output and UI integration under various flag states, allowing us to rapidly iterate on new AI capabilities without impacting existing functionality. It's not just about enabling or disabling; it's about validating the contract of the feature, regardless of its deployment status. This approach cut pipeline time for feature validation by an average of 18 minutes by reducing the need for manual exploratory testing on shared environments.

Cutting Through The Noise: What To Do Next

Stop treating feature flags as a post-deployment concern. They are a pre-release testing superpower. Your current approach to integration testing is likely a bottleneck, hiding bugs until the last minute. This isn't just theory; it's how we reduced critical defects and accelerated our development cycle.

This week, identify one upcoming feature that will be behind a feature flag. Instead of just building the feature, explicitly integrate its flag state into your local development and CI testing environments. Ensure your Playwright (or Cypress, Selenium, etc.) tests can run against both the "flag on" and "flag off" states of this feature. Do not just test the "default" state of your main branch. This small shift in mindset will dramatically change how you approach quality and release confidence.

Your Flags Are Just A/B Switches. They Should Be Test Harnesses.

The Integration Hell You Didn't Know You Could Avoid

Trunk-Based Development? You Need Flag-Driven Tests

Playwright + Flags: Asserting The Unreleased Future

Cutting Through The Noise: What To Do Next

Want to build systems that work this way?