Most Statsig problems are not obvious. Events fire, experiments run, dashboards populate, and everything looks fine until someone questions the data or an experiment result does not match reality. The issues are almost always in the setup: wrong SDK, contaminated environments, inconsistent event naming, or misconfigured experiments. This article covers the 10 most common Statsig setup issues, what causes them, and exactly how to fix each one.
Statsig is one of the most powerful experimentation and feature flagging platforms available. But power does not equal reliability by default. The gap between a Statsig setup that works and one that produces data you can trust comes down to implementation quality, and most teams do not know there is a gap until something goes wrong.
Here are the 10 most common reasons Statsig setups produce unreliable data, and what to do about each one.
1. Using the Wrong SDK for Your Use Case
Statsig offers multiple SDK types: client-side SDKs for browser and mobile applications, server-side SDKs for backend services, and edge SDKs for CDN-level evaluation. Each is designed for a specific context and produces different behavior.
The most common mistake is using a client-side SDK where a server-side SDK is needed, typically in cases where experiment assignment needs to happen before the page renders, or where the evaluation logic involves sensitive business data that should not be exposed to the client.
Using the wrong SDK does not always cause obvious errors. It causes subtle issues: flicker in web experiments, inconsistent assignment across sessions, or exposure events firing at the wrong point in the user journey.
Fix: review your SDK choice against Statsig's official SDK documentation for your specific tech stack. Client-side SDKs evaluate feature gates and experiments on the device. Server-side SDKs evaluate on the server before responding to the client. The choice depends on where in your stack the assignment needs to happen.
2. Not Separating Environments With Correct API Keys
Statsig uses separate API keys for different environments, development, staging, and production. Each environment should have its own key, and events or experiment exposures logged in development should never mix with production data.
Teams that use the same API key across environments contaminate their production data with internal traffic, test events, and QA sessions. This inflates event counts, skews experiment results, and makes it impossible to trust any analysis that spans a significant time period.
Fix: in your Statsig console, create separate projects or use Statsig's environment configuration to generate distinct API keys for each environment. Ensure your deployment pipeline uses the correct key for each context, development builds use the development key, staging builds use the staging key, production builds use the production key. Never hardcode production keys into development environments.
3. Incorrect Exposure Event Logging
In Statsig, an exposure event is logged when a user is assigned to an experiment variant. If exposures are logged at the wrong point, too early, too late, or multiple times per session, your experiment results will be unreliable.
Logging exposures before the user actually sees the variant inflates your experiment population with users who were never meaningfully exposed. Logging them after the primary metric event creates selection bias. Logging them multiple times per session inflates exposure counts and skews statistical calculations.
Fix: log the exposure event at the exact moment the user encounters the variant, not before, not after. For feature gates, this means calling checkGate or getExperiment at the point where the feature is rendered or the decision is made. Use Statsig's diagnostics tool to verify that exposure counts match your expected traffic volume.
4. Inconsistent Event Naming Conventions
Event naming is one of the most common long-term data quality issues in any analytics implementation. In Statsig, inconsistent naming like mixing snake_case and camelCase, abbreviating event names differently across platforms, using different names for the same action on web and mobile creates fragmented data that cannot be reliably aggregated or compared.
This becomes a significant problem when you try to use events as metrics in experiments. If the same action is tracked as add_to_cart on web and addToCart on mobile, Statsig treats them as two different events. Your experiment metrics will be incomplete and your results will understate the true effect.
Fix: establish a naming convention before instrumenting any events and apply it consistently across all platforms and teams. Statsig recommends using a consistent format such as object_action (for example cart_item_added or checkout_completed). Document the convention in a shared tracking plan and review new events against it before they go to production. This is one of the core issues a Statsig audit identifies and corrects.
5. Missing or Incorrect Event Properties
Events without properties are significantly less useful for experiment analysis. If your checkout_completed event does not include properties like order value, product category, or user type, you cannot segment experiment results by these dimensions or use them as metrics in more granular analyses.
Incorrect properties are equally problematic. A property that sometimes passes a string and sometimes passes a number for the same field, or a user ID property that is sometimes null, will produce inconsistent results in any analysis that depends on it.
Fix: define the required properties for every key event in your tracking plan before implementation. For experiment metrics, identify which properties you will need for segmentation and ensure they are always present and correctly typed. Use Statsig's event explorer to spot-check incoming events and verify that properties are populating as expected.
6. Incorrect Experiment Targeting and Exposure Rules
Statsig allows you to define who is eligible for an experiment through targeting rules, based on user properties, custom attributes, or environment conditions. Misconfigured targeting is one of the most common causes of experiments that produce inconclusive or misleading results.
Common mistakes include targeting rules that are too narrow (producing insufficient sample sizes), rules that accidentally exclude key user segments (creating a biased experiment population), or rules that are correctly configured in staging but not in production.
Fix: review your targeting rules against your intended experiment population before launch. Verify that the user properties used in targeting are being correctly passed to Statsig at the point of SDK initialization. Use Statsig's diagnostic tools to confirm that the experiment population matches your expectations after the first 24 hours of running.
7. Not Using Statsig's Holdout Groups Correctly
Statsig supports holdout groups: a percentage of users held back from all experiments during a period -- to measure the cumulative impact of your experimentation program. Teams that run multiple experiments simultaneously without holdouts cannot accurately measure the combined effect of all their changes.
Teams either do not know holdouts exist or configure them incorrectly, setting holdout percentages too high (which reduces traffic available for experiments) or too low (which makes the holdout measurement statistically unreliable).
Fix: configure a global holdout group of 5-10% of your user base in the Statsig console. This gives you a control group that receives no experimental changes, allowing you to measure the aggregate revenue or engagement impact of your full experiment program over time. Do not use holdouts as a substitute for individual experiment control groups, they serve a different analytical purpose.
8. Not Enabling Sequential Testing
By default, Statsig uses fixed-horizon testing: you set a target sample size, run the experiment, and evaluate results at the end. The problem is that most teams peek at results before reaching the sample size and make decisions based on early data, which inflates false positive rates.
Statsig supports sequential testing, which adjusts the significance threshold dynamically as data accumulates and allows you to monitor results continuously without inflating false positive rates. Most teams either do not know this feature exists or have not enabled it.
Fix: enable sequential testing in your experiment configuration before launch. In the Statsig console, navigate to your experiment settings and select the sequential testing option. This is particularly valuable for teams running high-velocity experimentation programs where waiting for fixed-horizon results creates bottlenecks. The guide to avoiding false winners covers why this matters in practice.
9. Not Using CUPED for Variance Reduction
CUPED: Controlled-experiment Using Pre-Experiment Data is a variance reduction technique that uses historical user behavior to reduce noise in experiment results. In practical terms, it makes your experiment results more reliable with smaller sample sizes, which means faster and more confident decisions.
Statsig supports CUPED natively but it requires pre-experiment data to be available. Teams that have not been logging events consistently before running experiments cannot use CUPED effectively, and teams that have the data but have not enabled the feature are leaving statistical power on the table.
Fix: enable CUPED in your experiment analysis settings in Statsig. The feature uses pre-experiment metric data from the same users in your experiment to reduce variance in the treatment effect estimate. Ensure you have at least two weeks of pre-experiment event data for the metric you are using as your primary outcome before relying on CUPED results.
10. No Guardrail Metrics Configured
Guardrail metrics are secondary metrics that should not move negatively as a result of your experiment, even if the primary metric improves. A variant that increases checkout conversion but also increases page load time, support ticket volume, or refund rate has not produced a net positive result.
Teams that only configure a primary metric and no guardrails miss these tradeoffs entirely. They ship a variant that looks like a win and discover the downstream damage weeks later.
Fix: for every experiment, configure at least two guardrail metrics in addition to your primary metric. Guardrail metrics should sit both upstream and downstream of the primary metric. In Statsig, add these as secondary metrics in your experiment configuration and set the direction to "neutral or positive" -- any statistically significant negative movement in a guardrail metric should pause the shipping decision until investigated. This is a core part of running reliable A/B tests regardless of which experimentation tool you use.
The Pattern Behind All 10 Issues
Looking across these 10 issues, the pattern is consistent. None of them are caused by Statsig not working. They are caused by implementation decisions made without a clear understanding of best practices, .often by teams moving fast under pressure to get experiments running.
The good news is that all 10 are fixable. A structured audit of your Statsig setup identifies which of these are present in your implementation and produces a prioritized remediation plan. Most fixes are straightforward once the issue is clearly identified.
See how we fixed these issues for Unravel
Unravel came to us needing Statsig set up correctly and fast. We audited their implementation, identified the misconfigurations, delivered a live best practices session, and handed over a complete findings document, all in two weeks.
👉Read the Unravel Statsig Audit case study
Think your Statsig setup might have issues?
Book a free 30-minute call and we will review your setup and tell you exactly what needs fixing.
FAQ
Why is my Statsig experiment data unreliable?
The most common causes of unreliable Statsig experiment data are incorrect SDK type for the use case, API keys not separated across environments, exposure events logged at the wrong point in the user journey, inconsistent event naming across platforms, and missing or incorrect event properties. A structured audit of your implementation identifies which of these are present and how to fix them.
How do I separate Statsig environments correctly?
Create separate API keys for development, staging, and production in your Statsig console. Ensure your deployment pipeline uses the correct key for each environment. Never use a production API key in a development or staging context -- doing so contaminates your production event data with internal traffic and test sessions.
What is the difference between sequential testing and fixed-horizon testing in Statsig?
Fixed-horizon testing requires you to define a target sample size before launch and evaluate results only after reaching it. Sequential testing adjusts the significance threshold dynamically as data accumulates, allowing you to monitor results at any point without inflating false positive rates. Sequential testing is the better choice for teams that need to check results regularly during an experiment.
What are guardrail metrics in Statsig and why do I need them?
Guardrail metrics are secondary metrics that should not move negatively as a result of your experiment. They catch cases where a variant improves the primary metric at the cost of something else -- for example, increasing conversion rate while also increasing page abandonment or refund rate. Configure at least two guardrail metrics for every experiment in addition to your primary metric.
How does CUPED work in Statsig?
CUPED uses pre-experiment data from the same users in your experiment to reduce variance in the treatment effect estimate. This makes results more reliable with smaller sample sizes, enabling faster and more confident decisions. It requires at least two weeks of pre-experiment event data for the metric being used as the primary outcome.
How do I fix inconsistent event naming in Statsig?
Establish a consistent naming convention -- such as object_action in snake_case -- and apply it across all platforms and teams. Document the convention in a shared tracking plan, review new events against it before production, and use Statsig's event explorer to audit existing events for inconsistencies. Rename events that do not conform to the convention and update all references in your experiment metric configurations.





