One of the fundamental tenets of creating better customer experiences (CX) is embracing continuous change and measuring outcomes you hope to influence through change.
In the contact center this often means creating pilot programs using a set of agents as your control group and another set of agents who are using different behaviors as your test group. Often, we see companies dismiss potentially meaningful change because their pilots didn't show the results they wanted or run wildly successful pilots that don't translate into real world results.
The underlying cause of these two issues is often a poor or incomplete understanding of the application of statistical principles to run a successful pilot.
DISCLAIMER: the following post is by no means meant to be an exhaustive understanding of the types of statistical techniques used to measure pilot performance but will cover the fundamental aspects and common mistakes encountered using a common statistical technique called a t-test often used in pilot programs.
Let's start with the basics. What's a t-test?
A t-test is a statistical hypothesis test used to determine whether there is a significant difference between the means of two groups.
In layman's terms we are measuring the average performance between one group and another to see if the differences between groups are meaningful and not by "accident". The tricky part becomes understanding what type of t-test to run.
What are the types of t-tests and how do I choose?
Rule #1 - t-tests are only meant to measure the difference between two groups. If you want to measure differences between more than two groups, you need another type of test.
Since the purpose of a pilot is to measure improvement (i.e. we want to see if something increased or decreased in a way that is beneficial to our business) a one-tailed test is the appropriate choice.
In addition, you have to determine what type of t-test to use. The two most commonly used in the contact center are independent samples t-tests and paired samples t-tests.
What's an independent samples t-test and when should I use it?
An independent samples t-test is when we are comparing differences between two distinct groups. In practice, this is selecting your "control" group of agents who will use the same methods, systems, and techniques you have been using in your day-to-day operations and then selecting a second group of agents to be the "guinea pigs" for your new process.
This is the most common method used in the contact center but there several things to be careful of before going to this as your default choice. One of the most common mistakes we see is priming the pilot group with your best agents because they will "pick it up the fastest" when it comes to new systems and procedures.
This often leads to one of the common issues in pilot programs where the pilot results don't materialize in the real world even though the pilot was a success. You have already selected a group for the pilot that outperforms the control group thus negating any measured improvement between groups.
How do I select my pilot group?
There are a few common methods used to select pilot groups:
- Random sampling is considered one of the best methods to ensure that your sample is representative of the entire population. In this method, every individual or element in the population has an equal chance of being selected.
Simple random sampling can be achieved using random number generators, drawing names from a hat, or other randomization techniques.
- Stratified sampling is another popular technique. This is where we divide the population (all your agents that could be affected by the change you are testing) into distinct subgroups based on certain characteristics and then conducting random sampling in each subgroup.
You may want to create subgroups by shift, experience level, types of calls handled, etc. The goal of stratification is to ensure appropriate representation across your population where a random sample may leave people out because they represent a smaller part of the population (e.g. overnight shift agents in contact centers that run small shifts due to low call volumes at that time).
How do I know if I sampled properly?
First you must ensure that you selected enough people in your sample. You can find sample size calculators online. There are 3 primary values for these calculators: population size, confidence level (95% is considered standard), and margin of error (5% is standard).
Population size is very straightforward. This is the number of agents that would be impacted if you rolled out the changes in your pilot program to "everyone"
Confidence level and margin of error are better understood together. You probably are most familiar with these during elections when exit poll results are shared. You will often see exit polls saying candidate A is leading candidate B by 3% points and hear the coverage note that this is within the margin of error with a 95% confidence interval. So what does all that mean?
In simple terms, it means we can be 95% confident that the poll accurately reflects the outcome of the election. The margin of error represents the range of possibility we have inaccurate results (industry standard is +/- 5%).
If you are a fan of candidate A in our example above, you shouldn't celebrate just yet because the 3% points are within our margin of error. If candidate A had an 8% point lead you can feel much more confident popping that bubbly.
So what does this mean for our pilot group selection process? If you want to be really sure that the results of your pilot will stick, go with the sample size calculated using a 95% confidence level and a 5% margin of error.
Now that you have selected the right size pilot group there are a few more sanity checks to go through prior to launching your pilot:
Check to ensure your control and pilot group have a normal distribution of data across the metric you hope to improve (we all know this as the bell curve)
That you have a similar amount of variance in each group (you can calculate this in excel)
The mean value of your success measure is roughly the same in each group (i.e. if we wanted to see if AHT is improving we don't want groups A's mean AHT to be 5m45s and group B's to be 10m34s).
If you are failing the tests above, consider resampling or using stratified sampling if you used a random sample to select your pilot group.
Should you use a paired samples t-test?
Paired-samples t-tests measure the performance differences between the same group of agents at two points in time. At first glance, this seems to solve many of the problems associated with independent samples t-tests since we are measuring the same agents before and after the change but it comes with some unique challenges.
Pro tip: You still need to ensure you select a large enough sample so you can't skip using the sample size calculator.
The biggest challenge is ensuring that your pilot program changes are actually behind any performance changes. In our independent samples t-test both our control group and pilot group are being measured under similar conditions (i.e. same types of calls at the same time of year, etc.).
In our paired samples t-test conditions other than our pilot changes may be impacting performance since we are measuring our results in a different time frame than our "control" period. If you have seasonal variance in the metric you are trying to improve, this may not be a good fit.
You should also try to control for other changes during your post change measurement period. For example, having your post change measurement period coincide with a new product or service launch (which typically tanks call center metrics since you get more calls, agents are still learning, etc.) is setting you up for a scenario where you may throw out meaningful changes because your pilot didn't perform as well as you wanted.
How long should I run a pilot program for?
In general, you want to ensure that you see stability in the metric you want to improve in your pilot group's performance and then add 30 days. If your pilot group's FCR rate is improving every week (because they are getting used to new processes and systems), before plateauing for a week around day 30 measure performance of your pilot group from day 30-60 and then you can safely perform your t-test.
How do I know if it worked?
You can use excel or a number of online tools to calculate your p-value. YouTube is a great resource to learn how to use these tools.
Remember the key things we covered when selecting the type of test to run as you will need to select this in your tool of choice. Pick the appropriate test type (paired or independent sometimes called two sample) and look at the output generated for a one-tail test (since we want to measure improvement; two tails tests just tell us if things are statistically different without regard to positive or negative change).
If your p-value is less than 0.05, you can reject the null hypothesis (i.e. no statistically significant improvement was made) and celebrate running a pilot that produced real results to improve your business.
If your p-value is equal to or greater than 0.05, we have to accept that any measured change even if it was positive may be a natural occurrence not related to your pilot but due to normal variances in your data and performance. This doesn't mean you should throw your pilot out. While 0.05 is considered the gold standard for research you can "accept" results with higher p-values.
All p-values tell us is the chances of witnessing the result as part of natural variance so a p-value of .0001 means there is a one-1,000 chance of observing your pilot results "in the wild" which means it's unlikely to be caused by natural variance.
For the contact center you may want to use a p-value of 0.10 since this tells us there is only a 10% chance that your pilot results are part of the natural ebbs and flows of normal variations that occur in your day-to-day workings and not due to the changes you made.
Want help improving your contact center performance and your customer experience by selecting the right tools to pilot and making sure your pilot is run right? Contact us here.