Beyond A/B Testing: A Practical Introduction to Switchback Experiments
- Iavor Bojinov
- Aug 30
- 5 min read
Most technology companies are now incredibly adept at running A/B tests. This is great as they are the gold standard for causal inference and the engine of data-driven decision-making. But what happens when the very structure of your product or problem makes a traditional A/B test impossible?
Imagine you've developed a new pricing algorithm for a ride-sharing service in a specific city. You can't split the city in half—riders and drivers from the "control" group would inevitably interact with those in the "treatment" group, contaminating the experiment. Similarly, consider a third-party seller wanting to test a new bidding algorithm for their ads on an e-commerce site. If they tried to A/B test on users, the two bidding strategies would compete in the same ad auctions, creating marketplace interference that is incredibly difficult to analyze. To avoid this, they must apply one strategy to their entire account at a time. In these scenarios, the unit of experimentation—the city, the seller's account, the server—is singular.
This is a common challenge that my colleagues and I have seen across many industries. The solution is an experimental design that is both powerful and elegant: the switchback experiment. This post will introduce you to what they are and how they work without any scary mathematics.
What is a Switchback Experiment?
At its core, a switchback experiment, also known as a crossover design or time series experiment, is simple. Instead of splitting a population of units into distinct control and treatment groups, we apply both treatments to the same unit, just at different points in time.
The unit of experimentation switches back and forth between treatments according to a randomized schedule. For example, over six hours, a single unit might receive treatments in the following order:
Hour 1: A (Control) | Hour 2: B (Treatment) | Hour 3: B (Treatment) | Hour 4: A (Control) | Hour 5: B (Treatment) | Hour 6: A (Control)
The primary benefit here is that switchback experiments transform the problem of interference (if you are not familiar with this concept, check out this blog post) to one of carryover effects.
Designing Your First Switchback Experiment
When setting up a switchback experiment, you have three key decisions to make:
Length of Periods: How long should each treatment period be? This is a trade-off. The period must be long enough for the treatment effect to be detected and stabilize.
Number of Periods: How many times should you switch? More switches can help average out time-based noise (like seasonality or day-of-week effects), but each switch introduces the risk of carryover.
Randomization: It is crucial to randomize the sequence of treatments.
There is a considerable amount of research on how to pick these, but I will leave that for another post. If you are interested, a good starting point is my paper Design and Analysis of Switchback Experiments that provides a formal treatment of the subject.
The Challenge of Carryover Effects
Of course, this design introduces its own unique challenge: carryover effects. What happens when the effect of Treatment "leaks" or "carries over" into the period when the unit is receiving the Control?
For instance, if users have a fantastic experience with the new pricing algorithm (Treatment B), their positive sentiment might persist even after we switch them back to the control experience. This residual effect from the previous period can bias the results of the current period, making the control look better than it is and masking the true effect of the treatment.
The most common mitigation strategy is to introduce a "washout period" between treatment switches. This is a buffer of time during which we don't use the data in the analysis, allowing the effects of the previous period to dissipate. However, washout periods are not a silver bullet; they can be costly in terms of time and lost data.
The Tricky Part: Analyzing the Results
Once the experiment is done, it’s tempting to simply take the average of all the 'A' periods and compare it to the average of all the 'B' periods with a t-test. This is almost always incorrect. This naïve approach ignores the complexity caused by temporal trends and, more importantly, can be severely biased by carryover effects.
In my paper with Neil Shephard, Time series experiments and causal estimands: exact randomization tests and trading, we developed two powerful, non-parametric approaches that rely purely on the randomization of the treatment, without needing to model the underlying time-series data.
The Exact Randomization Test: This method is ideal for testing the "sharp null hypothesis," which is the strong claim that the treatment has no effect at any point in time. If we assume this null is true, then the outcome we observed at any given time would have been the same regardless of which treatment was assigned. This key insight allows us to use a computer to simulate thousands of alternative random treatment assignments that could have happened. For each simulation, we calculate our test statistic (e.g., the difference in means). This process builds the exact distribution of the test statistic under the null hypothesis. We can then compare our actual observed result to this distribution to get an exact p-value. This is often called a Fisher Randomization test, and you can learn more about it in this blog post.
The Conservative (Asymptotic) Test: Often, we're interested in a weaker question: is the average treatment effect zero? For this, the sharp null is too strong, but without it, we have to use an asymptotic result. Now, I am not going to get into the details here, but the point is that you can develop a test statistic and corresponding (asymptotic) confidence interval for checking if there is a treatment effect. Importantly, the test statistic does not require any parametric assumption on the outcome model and is therefore non-parametric. This is often referred to as the Neymanian approach, and you can learn more about it in this blog post.
The key takeaway is that, with modern methods, switchback experiments can be as robust and trustworthy as traditional A/B tests.
When Should You Use a Switchback Experiment?
This design is a specialized tool, but it's invaluable in the right circumstances. Consider using it when:
There is complex interference: This occurs in marketplaces like ride-sharing or food delivery, where experimenting on one driver's incentives inevitably affects other drivers and riders, making a standard A/B test impossible to interpret.
You have only a single unit to experiment on: This is common in psychology, where these are called "N-of-1 trials," but it also applies to business settings like testing a change on a single warehouse, store, or an entire market.
There is high heterogeneity between units: If you have a few very different units (e.g., a handful of enterprise customers with wildly different usage patterns), letting each unit act as its own control is a powerful way to reduce variance.
The treatment effect is expected to be quick to manifest and reverse: Switchback designs are ideal when the impact of a treatment appears quickly and fades quickly after the treatment is removed, minimizing carryover effects.
Switchback experiments are a powerful addition to your experimentation toolkit. While they require more careful planning and more sophisticated analysis than a standard A/B test, they unlock the ability to learn and iterate in situations that were previously untestable.

Comments