What is the process of using ANOVA to detect a difference among samples?

Author Name
Answered by: Nick, An Expert in the Statistics and Research Category
Electronic Vaulting is one of Iron Mountain’s fastest growing lines. While technically quite sophisticated, the back-up solution is quite easy to understand. Electronic Vaulting, or EV for short, allows users to install a software program on their PC or Server that automatically backups the contents of the system on a daily or scheduled basis. The entire content of the system, on which EV is installed, is sent periodically over the Internet to a receiving “Vault server” that is hosted underground. In the event that the users systems fail due to a virus, employee sabotage, or some other type of disaster, the user can retrieve back, via the same Internet connection, his/her latest version of the “backed-up system”.

Electronic vaulting is a new approach to making sure a system is properly backed up. The more prevalent way is for a user to back up his/her system to a media tape and then to have this tape stored in a secure and off-site facility. Electronic Vaulting is popular to many users due to the ability to get data back more quickly. Moreover, the process is automated requiring little involvement from the user.

Lead Qualification and Prioritization

The popularity of the EV system has thankfully created widespread interest in the solution. As a result, Iron Mountain has hundreds of prospects in the pipeline throughout the United States. One of the biggest concerns is to identify how much time should be spent to convert the lead before moving on to the next. Proper qualification of the lead is important because it will allow it to (1) be assigned to the properly qualified sales person and (2) provide a focus with which sales people can use to convert the lead in question.

Iron Mountain currently uses a process of classifying its leads according to the number of employees and locations the company has.

Iron Mountain has very general company segments for classifying its prospects and customers. The segments are:

Small and Medium Business (SMB) = More than 100 employees and only one location

Multi-Location (ML) = Any number of employees and more than than 1 location,

Key = Greater than 100 employees and only one location

It is the current thought that more time should be spent to convert bids from Multi- location companies because the deals should likely be larger than a small and medium sized business prospect. The current practice is that if a multi-location company is a prospect, the sales person will spend more time developing the lead. The lead will be considered active for approximately 6 months Moreover, the lead goes to a more senior sales person. An SMB lead, however, will only be considered active for 2 months is handled by a more junior sales person. The thought is that because these firms have fewer locations and often fewer employees, the SMB prospect is worth less than a Multi-location lead.

Is the average Multi-location prospect significantly larger than the average SMB prospect?

A key question is to determine if Multi-location prospects are significantly larger than SMB prospects and thus warrant the current focus they receive.

Many people within the company are influenced large one time deals which are much larger than the typical new deal. Most of the time these deals are also converted from large multi-location prospects as well. It seems that these deals receive too much focus, however, and are not representative of our typical business. It will be interesting to study these large prospects (if any occur in the sample) to see if the data points are true outliers from the other data.

Suggested tests to determine how prospect sizes compare

Because the Multi-Location (ML) segment can have a very large range in number of employees, we would be able to do a more sophisticated analysis by actually breaking this segment into two categories : (1) Small and Medium Multi-location (<100 employees) and (2) Large Multi-location (>100 employees).

With three groups, I can test to see if the mean prospect sizes are all the same or if they differ. In order to test my 3 groups, I suggest using an ANOVA to detect a difference - the Analysis of Variance Test.

While it is more sophisticated to test the 3 groups, I will only be able to conclude that the means differ or they don’t. While this will be great information for Iron Mountain, if the means do in fact, differ people will be yearning for more information. Using ANOVA to detect a difference among samples would be a statistically appropriate test.

If I find that the means of the 3 groups do, in fact, differ, I then think it would be appropriate to follow up with a t-test to compare two of the groups. Specifically I would use a Two-sample T-Test for the difference between the means of two independent groups if it seems appropriate.

If I do fine out that there is a difference between the means of 2 groups, I will certainly complete 95% Confidence Interval that shows the range of how much larger the mean of one segment is over the other.

Data Collection

In order to obtain the data necessary for analysis, I will contact our 20 sales people and ask them to send me their current prospects in progress, projected deal size, and a few details about the prospect such as total number of employees and number of locations. I will plan to contact sales people from all regions across the United states. I know that each sales person typically has about 10-15 current Electronic Vaulting prospects. Moreover, I will get timely responses back from roughly 15 sales people.

Sample Size

I would like to get a large enough sample that I can assume that my models are safe to use even with slightly skewed data. To this end, my target is to have 40 prospects in each of the 3 segments for a total of 120 prospects. From past experience with the sales team, I believe that there are currently 200 active Electronic Vaulting prospects.

I recently completed a large survey for which I needed data from sales people. For this recently completed project I received data back in a timely manner from approximately 75% of the team. Hence if my assumptions hold I should be able to receive the following number of data points:

20 sales people * 10 current prospects * .75 = 150 prospects

Whatever number of total prospects that I obtain, I would like to try to have an equal number in each group or close to an equal number although I know that this is not a necessary condition.

The number of 120-150 prospects is really the maximum amount of data that I will be able to obtain given the time frame. Moreover, there are most likely not more than 200 active prospects available throughout the United States.

At the moment, I am not even sure that I will be completing a confidence interval for the difference between two groups. However, assuming I do, an outsider my conclude that my margin of error is large. However, after discussing with senior sales executives at my firm, they seemed to be very comfortable with the margin of error that my projected sample sizes will yield. This analysis is really the first of its kind and we will make it a goal to add to the data and build a larger sample size if the analysis proves useful. The important point is that my firm feels that a margin of error of .05 will still yield meaningful results.

Ensuring a Random Sample

One of my biggest goals with this project is to make sure I am making comparisons of prospects that truly are representative of what sales people might experience. To this end I will put some special controls on the data collection such as:

3 Month Active Limit: Only send prospects that have been active at most for 3 months. Some sales people may only want to send in their largest deals. Because these deals really only come up once every 6 months, on average, by just taking deals that have been active for 3 months, I will lessen the ability to send in a bunch of large deals only.

Regional Representation: While we have not identified any variation in electronic vaulting deals across the country, I still want to make sure that our 4 regions: West, Midwest, Southeast, and Northeast are fairly represented. As the data comes in I will keep track of how many prospects are coming from each area and make sure that the regions are as equal as possible. If one region is becoming over represented I will eliminate some prospects or put more pressure on the sales person from an under represented region to send in their data. It should be noted that I think it will be relatively easy to obtain equal representation as the business currently is almost equal among all four regions.

“Blind-Data Collection”: In order to avoid having sales people send me the deals they want to, I will not tell them about my experiment or may make up some other type of project description but again not tell them my goals with this test that I am doing.

I think by following the blind data collection and regional representation goals, I will be able to get a representative list of prospects. However, just for a challenge, I thought I would try to add a bit more randomization to my data collection.

Once I get of list current prospects, I will make up cards and write down each detail about the deal on a recipe card. Each card will contain:


     Company Name

     Total Company Employees

     Total Locations

     Iron Mountain Segment

     Prospective Deal Size

I will then sort the cards into the three segments. I believe that each segment will have roughly the same number the cards in each. I will also thoroughly shuffle the cards to make sure there is no order concerning the regions (such as the majority of the cards from the Northeast coming in the beginning.

After the cards are sorted into the three groups and shuffled thoroughly, I will select the cards for my analysis. In order to select the cards I will use a random number table. If the random number is even, I will select the card (containing the prospect data) for my analysis, if the random number is between 0-4, a card will not be selected if the random number is 5-9

For example for my first 10 cards

Card Position               1 2 3 4 5 6 7 8 9 10

Random Number 9 6 2 9 9 0 7 1 9 8

Table Page (A49)

From this example I would select the 3, 6, and 8 card. I would repeat this sampling process until I had about 40 prospect in each segment.

Meeting the Assumptions

In order to use the two tests that I propose it will be very important that the data satisfies the assumptions for each test. The following outlines the assumptions for each test that are necessary to be met and how I will ensure that my data meets them.

Using ANOVA to detect a difference among samples (Conditions)

Even before I begin this test, I will make sure to complete side by side box plots of the 3 groups to look for outliers and the spread of the different data groups.

Independence Assumption:

Check and Verify Independence of data within group: This is probably the most important thing to think about. Theoretically, a 2 deals could come from different divisions within the same company. As a result, if one division signed up it might have a greater influence oh how likely it would be for the prospect to become an actual customer. Moreover, if a prospect is just a branch within a larger firm that is a customer already, it would certainly be more likely to become a customer than with a firm that had no existing relationship with Iron Mountain for Electronic Vaulting. To avoid this, I will NOT accept multiple prospects from the same organization nor will I accept prospects within existing customers. Sales people will be instructed only to send data on truly NEW prospects. I will rigorously check these prospects to make sure the company name is not within our existing database of customers. The randomization design of choosing the prospects for my analysis is a good security check to make sure the data points are independent.

Check and verify Independence of groups: This is really the less important assumption because the segments most definitely are independent of each other. Moreover, companies are almost never move out a segment they are classified in. Moreover, my randomization design should ensure that these

Equal Variance Assumption

Check and verify similar Variance Condition: by looking at side by side box plots of the groups to see if they have similar spreads. I could also look at residuals plotted against the predicted values as well.

Normal Population Assumption

Check and verify nearly Normal Condition: I wanted to try to get a relatively large sample so that I could assume the population is approach the normal model. As mentioned previously, I am aiming to have at least 40 prospect data points for each segment. Nevertheless, there is a possibility that I will not meet this number for one or all segments. Moreover, instead of just assuming normality and moving on, I believe it makes sense to take a look at the data as long as the tests are relatively straightforward. Personally, It will be interesting to see how the data is distributed. Even if I do have a sample of 40 or more in each segment, I will still do the following checks:

--Box Plots: While I have mentioned boxes plots many times, I want to just reiterate it in this point as it is a helpful too for looking at how skewed the distributions are or if there are any outliers. I will most likely have quite a few observations so I will be able to detect outliers. I will do Box Plots for each group.

--Normal Probability Plot:

I will also do a normal probability plot of all the residuals together.


I will also do a histogram of the residuals to see if the distribution is reasonably symmetric

Two-sample T-Test for the difference between the means of two independent groups

***My approach would be to contact our 20 sales people and ask them to send me their current deals in progress and projected size as well as what type of company the prospect is (SMB or Mulitlocation). I will plan to survey sales people from all regions accorss the United states. Once I get a list of about 150 current deals in progress (the range of responses I could get may be from (100-150), I will make up cards and write down each detail about the deal on a recipe card. I will number the cards on the back and shuffle them thoroughly. I will then use a random number generator to pick about 100 total cards (I should have roughly 50 SMBs and 50 MLs--could be 60-40)

***I am not completely sure of all the tests that would need to be done, but I think the main one would be:

1.) To test the difference of the means between the SMB and ML deal size (independent samples) ****I think this would be the most likely test

2.) I could possibly have 3 company categories to look at if we had SMB, small ML and large ML based on employee size ----In this case, I could try to look at the ANOVA test. This statistics research method is an appropriate method to test for a difference.

Author Name Like My Writing? Hire Me to Write For You!

Related Questions