Quantcast
Channel: Statistics | Minitab
Viewing all 143 articles
Browse latest View live

When Should You Mistrust Statistics?

$
0
0

Figures lie, so they say, and liars figure. A recent post at Ben Orlin's always-amusing mathwithbaddrawings.com blog nicely encapsulates why so many people feel wary about anything related to statistics and data analysis. Do take a moment to check it out, it's a fast read.

ask about the meanIn all of the scenarios Orlin offers in his post, the statistical statements are completely accurate, but the person offering the statistics is committing a lie of omission by not putting the statement in context. Holding back critical information prevents an audience from making accurate assessment of the situation.

Ethical data analysts know better.

Unfortunately, unethical data analysts know how to spin outcomes to put them in the most flattering, if not the most direct, light. Done deliberately, that's the sort of behavior that leads many people to mistrust statistics completely.

Lessons for People Who Consume Statistics

So, where does this leave us as consumers of statistics? Should we mistrust statistics? The first question to ask is whether we trust the people who deliver statistical pronouncements. I believe most people try to do the right thing.

However, we all know that it's easy—all too easy—for humans to make mistakes. And since statistics can be confusing, and not everyone who wants or needs to analyze data is a trained statistician, great potential exists for erroneous conclusions and interpretive blunders.

Bottom line: whether their intentions are good or bad, people often cite statistics in ways that may be statistically correct, but practically misleading. So how can you avoid getting fooled?

The solution is simple, and it's one most statisticians internalized long ago, but doesn't necessarily occur to people who haven't spent much time in the data trenches:

Always look at the underlying distribution of the data.

Especially if the statistic in question pertains to something extremely important to you—like mean salary at your company, for example—ask about the distribution of the data if those details aren't volunteered. If you're told the mean or median as a number, are you also given a histogram, boxplot, or individual value plot that lets you see how the data are arranged? My colleague Michelle Paret wrote an excellent post about this. 

If someone is trying to keep the distribution of the data a mystery, then the ultimate meaning of parameters like mean, median, or mode is also unknown...and your mistrust is warranted.

Lessons for People Who Produce Statistics

As purveyors and producers of statistics, who need to communicate results with people who aren't statistically savvy, what lessons can we take from this? After reading the Math with Bad Drawings blog, I thought about it and came up with two rules of thumb.

1. Don't use statistics to obscure or deflect attention from a situation.

Most people do not deliberately set out to distort the truth or mislead others. Most people would never use the mean to support one conclusion when they know the median supports a far different story. Our conscience rebels when we set out to deceive others. I'm usually willing to ascribe even the most horrendous analysis to gross incompetence rather than outright malice. On the other hand, I've read far too many papers and reports that torture language to mischaracterize statistical findings.

Sometimes we don't get the outcomes we expected. Statisticians aren't responsible for what the data show—but we are responsible for making sure we've performed appropriate analyses, satisfied checks and assumptions, and that we have trustworthy dataIt should go without saying that we are ethically compelled to report our results honestly, and... 

2. Provide all of the information the audience needs to make informed decisions.

When we present the results of an analysis, we need to be thorough. We need to offer all of the information and context that will enable our audience to reach confident conclusions. We need to use straightforward language that helps people tune in, and avoid jargon that makes listeners turn off.

That doesn't mean that every presentation we make needs to be laden with formulas and extended explanations of probability theory; often the bottom line is all a situation requires. When you're addressing experts, you don't need to cover the introductory material. But if we suspect an audience needs some background to fully appreciate the results of an analysis, we should provide it. 

There are many approaches to communicating statistical results clearly. One of the easiest ways to present the full context of an analysis in plain language is to use the Assistant in Minitab. As many expert statisticians have told us, the Assistant doesn't just guide you through an analysis, it also explains the output thoroughly and without resorting to jargon.

And when statistics are clear, they're easier to trust.

 

Bad drawing by Ben Orlin, via mathwithbaddrawings.com

 


Taking a Stratified Sample in Minitab Statistical Software

$
0
0

The Centers for Medicare and Medicaid Services (CMS) updated their star ratings on July 27. Turns out, the list of hospitals are a great way to look at how easy it is to get random samples from data within Minitab.

Roper Hospital in Charleston, South Carolina

Say for example, that you wanted to look at the association between the government’s new star ratings and the safety rating scores provided by hospitalsafetyscore.org. The CMS score is about overall quality, which includes components that aren't explicitly about safety, such as the quality of the communication between patients and doctors.

The safety score judges patient safety, using components like how often patients begin antibiotics before surgery and whether the process by which doctors order medications is reliable.

The CMS score gives out 1 to 5 stars. The safety score gives out A through F grades. The two measures aren't supposed to be duplicates, but it would be interesting to know whether there's an association between being a safer hospital and being a higher-quality hospital.

The government, kindly, provides the ability to download all 4,788 rows of data in their star ratings, but hospitalsafetyscore.org prefers to provide information by location so that potential patients can quickly examine hospitals near them or find a particular hospital. To compare the star ratings and the safety scores, we need both values.

One solution would be to search hospitalsafetyscore.org for the names of all 4,788 hospitals in the government’s database and record all the scores we found. (Though even if we did this, we wouldn't find all of them. For example, hospitals in Maryland aren't required to provide the data hospitalsafetyscore.org uses.) However, searching 4,788 hospitals is time-consuming.

A faster solution is to study the relationship using a sample of the data. We’ll use the government’s star score data as our sampling frame.

 A simple random sample

It’s easy to get a simple random sample in Minitab. If you already have the government's star data in Minitab, you can try this (or, you can skip getting it from the government and use this Minitab worksheet version I created):

  1. Choose Calc > Random Data > Sample From Columns.
  2. In Number of Rows to Sample, enter 50.
  3. In From columns, enter c1-c29. That lets you get all of the information from a row of data into your new sample.
  4. In store sample in, enter c30-c58. Click OK.
  5. Copy the column headers from the original data to the sample data.

Now, you have a sample of 50 hospitals chosen where each row in the original data set was equally likely.

A stratified sample

Of course, every simple random sample that you draw might not give you something representative, especially if your sample is small. For example, in the government’s star rating, only 2.82% of hospitals achieved 5 stars (102 hospitals). Even worse, nearly 25% of the hospitals in the data don't have a star rating (1,171 hospitals with no star rating).

If we do a hypergeometric probability calculation on a sample of size 50, assuming 102 events in a population of 3617, we find that roughly 25% of the random samples we could take would have 0 hospitals that achieved 5 stars. A simple random sample without any 5-star hospitals could tell us about the general association, but wouldn’t give us much information about what expected safety ratings for hospitals that achieved 5-star rank.

One way to fix the problem would be to take a larger simple random sample. If you take a sample of size 100 instead of a sample of size 50, then the probability that you don’t get any 5-star hospitals is almost down to 5%. Another method would be to modify your sampling scheme to make sure that you got some of every hospital ranking into your sample. Usually, you break your sample down into different groups, or strata. Then you take a simple random sample from each strata. At the end, you combine your multiple simple random samples to form your final sample.

The exact way that you determine how many observations to take from each strata depends on your goals, but let’s say that for this case, we’re going to get 10 hospitals for each star rating. We start by dividing the data:

  1. Choose Data > Split Worksheet.
  2. In By variables, enter ‘Hospital overall rating’. Click OK.

Now, we have separate worksheets with the hospitals that achieved each number of stars. We repeat the simple random-sampling process on each worksheet so that we have a sample of 10 from each ranking.

Now we want to combine those samples from the different star rating data.

  1. Choose Data > Stack Worksheets.
  2. Move the worksheets with the star rating data from Available Worksheets to Worksheets to stack.
  3. Name the new worksheet and click OK.

If you’d like the worksheet to be just your final sample, you can go one step further.

  1. Choose Data > Copy > Columns to Columns.
  2. In Copy from Columns, enter c29-c58.
  3. Name the new worksheet.
  4. Click Subset the data.
  5. Select Rows that match and click Condition.
  6. In Condition, enter c42 <>’*’. Click OK in all 3 dialog boxes.

Now you have a worksheet with 50 hospitals, 10 for each star rating.

Hospital Data

At hospitalsafetyscore.org, I was able to find safety ratings for 30 of the hospitals in my sample of hospitals with government star ratings. I have a little bit of concern because I was more likely to find safety ratings on hospitals with lower star ratings than with higher star ratings, but I did find at least 4 hospitals in each category. Because I'm interested in the relationship between the scores and not in the evaluating individual hospitals, I can proceed with my smaller sample size to see if I can get a rough idea about the relationship.

My sample data suggest a relationship between the safety score and the star rating from the government. If we treat the variables as ordinal, the Spearman's rho that measures their correlation is about 0.73 and significantly different from 0. We would not expect perfect agreement because the two ratings are intended to measure different constructs. Still, in the stratified sample, we can see that no 1-star hospital achieved a safety score better than a C and that no 5-star hospital had a safety rating less than a B.

As the overall rating from the government increases, so does the safety score.

Ready for more on Minitab? Read about the role Minitab played om helping Akron Children's Hospital could reduce costs while improving patient care.

The image of Roper-Saint Francis Hospital in Charleston, South Carolina, is by ProfReader and is licensed under this Creative Commons License.

How to Calculate BX Life, Part 2b: Handling Triangular Matrix Data

$
0
0
I thought 3 posts would capture all the thoughts I had about B10 Life. That is, until this question appeared on the Minitab LinkedIn group:

pic1

In case you missed it, my first post, How to Calculate B10 Life with Statistical Software, explains what B10 life is and how Minitab calculates this value. My second post, How to Calculate BX Life, Part 2, shows how to compute any BX life in Minitab. But before I round out my BX life blog series with rationale for why BX life is one of the best measures for reliability, I thought I’d take this opportunity to address the LinkedIn question—as you might wonder the same thing.

B10 Life and Warranty Analysis

BX Life can be a useful metric for establishing warranty periods for products. Why? Because it indicates the time at which X% of items in a population will fail. So a manufacturer might set a warranty period after a product’s B10 life, for instance, with the goal of minimizing the number of customers who will take advantage of the warranty should the product they purchase fail within the warranty period. Naturally, someone doing warranty analysis in Minitab should want to compute this value too! But looking at raw reliability field data, which are recorded in the form of a triangular matrix, it’s not obvious how to compute B10 life!

Warranty Input in Triangular Matrices

It’s common to keep track of reliability field data in the form of number of items shipped and number of items returned from a particular shipment over time. And when several shipments are made at different dates and their corresponding returns noted, the recorded data are in the form of a triangular matrix.

Minitab has a tool that helps you convert shipping and warranty return data from matrix form into a standard reliability data form of failures. 

Convert your data from a matrix form for easy analysis!

To demonstrate, let’s start with a new example and new data. If you’d like to follow along and you’re using Minitab 17.3, navigate to Help > Sample Data and select the Compressor.MTW file.

pic1

Here is what the data looks like:

pic3

From here, you can use Minitab’s Pre-Process Warranty Data to reshape your data from triangular matrix format into interval censoring format. Select Stat > Reliability/Survival > Warranty Analysis > Pre-Process Warranty Data. For “Shipment (sale) column,” enter Ship. For “Return (failure) columns,” enter Month1-Month12. Click OK.

pic4

The Pre-Process step creates Start time, End time, and Frequencies columns in your worksheet! 

pic5

You can now use these columns to obtain BX life using Stat > Reliability/Survival > Distribution Analysis (Arbitrary Censoring) > Parametric Distribution Analysis. Enter Start time in “Start variables,” End time in “End variables,” and Frequencies in “Frequency columns (optional).” Also, make sure you have the appropriate assumed distribution selected. We’ll assume the Weibull distribution fits our data.

pic6

Click the Estimate button to enter percents to be estimated in addition to what’s provided in the default output (In our case, let’s ask for B15 Life—so enter a 15 in “Estimate percentiles for these additional percents”). 

pic7

When we OK out of these dialogs, Minitab performs the analysis. Among the output Minitab provides is our handy Table of Percentiles, including our value for B15 life—or the time at which 15% of the items in our population will fail.

pic8

And there you have it!

Collecting warranty data and doing warranty analysis in Minitab shouldn’t prevent you from using reliability tools and metrics, such as BX life. In fact, letting Minitab reshape your data through the Pre-Process Warranty Data tool only makes your life easier when you dive into your reliability analysis!

Now, I promise, we’re well on our way to rounding out this series of posts, and in my next installment we'll look at the reasons BX life is a good metric to have in your reliability tool belt.

To Infinity and Beyond with the Geometric Distribution

$
0
0

See if this sounds fair to you. I flip a coin.

pennies

Heads: You win $1.
Tails: You pay me $1.

You may not like games of chance, but you have to admit it seems like a fair game. At least, assuming the coin is a normal, balanced coin, and assuming I’m not a sleight-of-hand magician who can control the coin.

How about this next game?

You pay me $2 to play.
I flip a coin over and over until it comes up heads.
Your winnings are the total number of flips.

So if the first flip comes up heads, you only get back $1. That’s a net loss of $1. If it comes up tails on the first flip and heads on the second flip, you get back $2, and we’re even. If it comes up tails on the first two flips and then heads on the third flip, you get $3, for a net profit of $1. If it takes more flips than that, your profit is greater.

It’s not quite as obvious in this case, but this would be considered a fair game if each coin flip has an equal chance of heads or tails. That’s because the expected value (or mean) number of flips is 2. The total number of flips follows a geometric distribution with parameter p = ½, and the expected value is 1/p.

geometric distribution with p=0.5

Now it gets really interesting. What about this next game?

You pay me $x dollars to play.
I flip a coin over and over until it comes up heads.
Your winnings start at $1, but double with every flip.

This is a lot like the previous game with two important differences. First, I haven’t told you how much you have to pay to play. I just called it x dollars. Second, the winnings grow faster with the number of flips. It starts off the same with $1 for one flip and $2 dollars for two flips. But then it goes to $4, then $8, then $16. If the first head comes up on the eighth flip, you win $128. And it just keeps getting better from there.

So what’s a fair price to play this game? Well, let’s consider the expected value of the winnings. It’s $∞. You read that right. It’s infinity dollars! So you shouldn’t be too worried about what x is, right? No matter what price I set, you should be eager to pay it. Right? Right?

I’m going to go out on a limb and guess that maybe you would not just let me name my price. Sure, you’d admit that it’s worth more than $2. But would you pay $10? $50? $100,000,000? If the fair price is the expected winnings, then any of these prices should be reasonable. But I’m guessing you would draw the line somewhere short of $100, and maybe even less than $10.

This fascinating little conundrum is known as the Saint Petersburg paradox. Wikipedia tells me it goes by that name because it was addressed in the Commentaries of the Imperial Academy of Science of Saint Petersburg back in 1738 by that pioneer of probability theory, Daniel Bernoulli.

The paradox is that while theory tells us that no price is too high to pay to play this game, nobody is willing to pay very much at all to play it.

What’s more, even if you decide what you're willing to pay, you won't find any casinos that even offer this game, because the ultimate outcome is just as unpredictable for the house as it is for the player.

The paradox has been discussed from various angles over the years. One reason I find it so interesting is that it forces me to think carefully about things that are easy to take for granted about the mean of a distribution.

Such as…

The mean is a measure of central tendency.

This is one of the first things we learn in statistics. The mean is in some sense a central value, which the data tends to vary around. It’s the balancing point of the distribution. But when the mean is infinite, this interpretation goes out the window. Now every possible value in the distribution is less than the mean. That’s not very central!

The sample mean approaches the population mean.

One of the most powerful results in statistics is the law of large numbers. Roughly speaking, it tells us that as your sample size grows, you can expect the sample average to approach the mean of the distribution you are sampling from. I think this is a good reason to treat the mean winnings as the fair price for playing the game. If you play repeatedly at the fair price your average profit approaches zero. But here’s the catch: the law of large numbers assumes the mean of the distribution is finite. So we lose one of the key justifications of treating the mean as the fair price when it’s infinite.

The central limit theorem.

Another extremely important result in statistics is the central limit theorem, which I wrote about in a previous blog post. It tells us that the average of a large sample has an approximate normal distribution centered at the population mean, with a standard deviation that shrinks as the sample size grows. But the central limit theorem requires not only a finite mean but a finite standard deviation. I’m sorry to tell you that if the mean of the distribution is infinite, then so is the standard deviation. So not only do we lack a finite mean that our average winnings can gravitate toward, we don’t have a nicely behaved standard deviation to narrow down the variability of our winnings.

Let’s end by using Minitab to simulate these two games where the payoff is tied to the number of flips until heads comes up. I generated 10,000 random values from the geometric distribution. The two graphs show the running average of the winnings in the two games. In the first case, we have expected winnings of 2, and we see the average stabilizes near 2 pretty quickly. 

Time Series Plot of Expected Game Winnings - $2

In the second case, we have infinite expected winnings, and the average does not stabilize.

Time Series Plot of Infinite Expected Game Winnings

If you'd like to do some simulation on this paradox yourself, here's how to do it in Minitab. First, use Calc > Make Patterned Data > Simple Set of Numbers... to make a column with the numbers 1 to 10,000. Next, open Calc > Random Data > Geometric... to create a separate column of 10,000 random data points from the geometric distribution, using .5 as the Event Probability. 

Now we can compute the running average of the random geometric data in C2 with Minitab's Calculator, using the PARS function. PARS is short for “partial sum.” In each row it stores the sum of the data up to and including that row. To get the running average of a game where the expected winnings are $2, divide the partial sums by C1, which just contains the row numbers:

calculator with formula for running average

The computation for the game with infinite mean is the same, except that the winnings double in value when C2 increases by 1. Therefore, we take the partial sums of 2(C2 – 1) instead of just C2, and divide each by C1. That formula is entered in the calculator as shown below: 

calculator with running average formula for game with infinite winnings

Finally, select Time Series Analysis > Time Series Plot... and plot the running average of the games with expected winnings of $2 and $∞. 

So, how much would you pay to play this game? 

Data Not Normal? Try Letting It Be, with a Nonparametric Hypothesis Test

$
0
0

So the data you nurtured, that you worked so hard to format and make useful, failed the normality test.

not-normal

Time to face the truth: despite your best efforts, that data set is never going to measure up to the assumption you may have been trained to fervently look for.

Your data's lack of normality seems to make it poorly suited for analysis. Now what?

Take it easy. Don't get uptight. Just let your data be what they are, go to the Stat menu in Minitab Statistical Software, and choose "Nonparametrics."

nonparametrics menu

If you're stymied by your data's lack of normality, nonparametric statistics might help you find answers. And if the word "nonparametric" looks like five syllables' worth of trouble, don't be intimidated—it's just a big word that usually refers to "tests that don't assume your data follow a normal distribution."

In fact, nonparametric statistics don't assume your data follow any distribution at all. The following table lists common parametric tests, their equivalent nonparametric tests, and the main characteristics of each.

correspondence table for parametric and nonparametric tests

Nonparametric analyses free your data from the straitjacket of the normality assumption. So choosing a nonparametric analysis is sort of like removing your data from a stifling, conformist environment, and putting it into a judgment-free, groovy idyll, where your data set can just be what it is, with no hassles about its unique and beautiful shape. How cool is that, man? Can you dig it?

Of course, it's not quite that carefree. Just like the 1960s encompassed both Woodstock and Altamont, so nonparametric tests offer both compelling advantages and serious limitations.

Advantages of Nonparametric Tests

Both parametric and nonparametric tests draw inferences about populations based on samples, but parametric tests focus on sample parameters like the mean and the standard deviation, and make various assumptions about your data—for example, that it follows a normal distribution, and that samples include a minimum number of data points.

In contrast, nonparametric tests are unaffected by the distribution of your data. Nonparametric tests also accommodate many conditions that parametric tests do not handle, including small sample sizes, ordered outcomes, and outliers.

Consequently, they can be used in a wider range of situations and with more types of data than traditional parametric tests. Many people also feel that nonparametric analyses are more intuitive.

Drawbacks of Nonparametric Tests

But nonparametric tests are not completely free from assumptions—they do require data to be an independent random sample, for example.

And nonparametric tests aren't a cure-all. For starters, they typically have less statistical power than parametric equivalents. Power is the probability that you will correctly reject the null hypothesis when it is false. That means you have an increased chance making a Type II error with these tests.

In practical terms, that means nonparametric tests are less likely to detect an effect or association when one really exists.

So if you want to draw conclusions with the same confidence level you'd get using an equivalent parametric test, you will need larger sample sizes. 

Nonparametric tests are not a one-size-fits-all solution for non-normal data, but they can yield good answers in situations that parametric statistics just won't work.

Is Parametric or Nonparametric the Right Choice for You?

I've briefly outlined differences between parametric and nonparametric hypothesis tests, looked at which tests are equivalent, and considered some of their advantages and disadvantages. If you're waiting for me to tell you which direction you should choose...well, all I can say is, "It depends..." But I can give you some established rules of thumb to consider when you're looking at the specifics of your situation.

Keep in mind that nonnormal data does not immediately disqualify your data for a parametric test. What's your sample size? As long as a certain minimum sample size is met, most parametric tests will be robust to the normality assumptionFor example, the Assistant in Minitab (which uses Welch's t-test) points out that while the 2-sample t-test is based on the assumption that the data are normally distributed, this assumption is not critical when the sample sizes are at least 15. And Bonnett's 2-sample standard deviation test performs well for nonnormal data even when sample sizes are as small as 20. 

In addition, while they may not require normal data, many nonparametric tests have other assumptions that you can’t disregard. For example, the Kruskal-Wallis test assumes your samples come from populations that have similar shapes and equal variances. And the 1-sample Wilcoxon test does not assume a particular population distribution, but it does assume the distribution is symmetrical. 

In most cases, your choice between parametric and nonparametric tests ultimately comes down to sample size, and whether the center of your data's distribution is better reflected by the mean or the median.

  • If the mean accurately represents the center of your distribution and your sample size is large enough, a parametric test offers you better accuracy and more power. 
  • If your sample size is small, you'll likely need to go with a nonparametric test. But if the median better represents the center of your distribution, a nonparametric test may be a better option even for a large sample.

 

What the Heck Are Sums of Squares in Regression?

$
0
0

In regression, "sums of squares" are used to represent variation. In this post, we’ll use some sample data to walk through these calculations.

squaresThe sample data used in this post is available within Minitab by choosing Help> Sample Data, or File> Open Worksheet> Look in Minitab Sample Data folder (depending on your version of Minitab).  The dataset is called ResearcherSalary.MTW, and contains data on salaries for researchers in a pharmaceutical company.

For this example we will use the data in C1, the salary, as Y or the response variable and C4, the years of experience as X or the predictor variable.

First, we can run our data through Minitab to see the results: Stat> Regression> Fitted Line Plot.  The salary is the Y variable, and the years of experience is our X variable. The regression output will tell us about the relationship between years of experience and salary after we complete the dialog box as shown below, and then click OK:

fitted line plot dialog

In the window above, I’ve also clicked the Storage button, selected the box next to Coefficients to store the coefficients from the regression equation in the worksheet.  When we click OK in the window above, Minitab gives us two pieces of output:

fitted line plot and output

On the left side above we see the regression equation and the ANOVA (Analysis of Variance) table, and on the right side we see a graph that shows us the relationship between years of experience on the horizontal axis and salary on the vertical axis. Both the right and left side of the output above are conveying the same information. We can clearly see from the graph that as the years of experience increase, the salary increases, too (so years of experience and salary are positively correlated).  For this post, we’ll focus on the SS (Sums of Squares) column in the Analysis of Variance table.

Calculating the Regression Sum of Squares

We see a SS value of 5086.02 in the Regression line of the ANOVA table above. That value represents the amount of variation in the salary that is attributable to the number of years of experience, based on this sample. Here's where that number comes from. 

  1. Calculate the average response value (the salary). In Minitab, I’m using Stat> Basic Statistics> Store Descriptive Statistics:

dialog boxes

In addition to entering the Salary as the variable, I’ve clicked Statistics to make sure only Mean is selected, and I’ve also clicked Options and checked the box next to Store a row of output for each row of input. As a result, Minitab will store a value of 82.9514 (the average salary) in C5 35 times:

data

  1. Next, we will use the regression equation that Minitab gave us to calculate the fitted values. The fitted values are the salaries that our regression equation would predict, given the number of years of experience. 

Our regression equation is Salary = 60.70 + 2.169*Years, so for every year of experience, we expect the salary to increase by 2.169. 

The first row in the Years column in our sample data is 11, so if we use 11 in our equation we get 60.70 + 2.169*11 = 84.559.  So with 11 years of experience our regression equation tells us the expected salary is about $84,000. 

Rather than calculating this for every row in our worksheet manually, we can use Minitab’s calculator: Calc> Calculator (I used the stored coefficients in the worksheet to include more decimals in the regression equation that I’ve typed into the calculator):

calculator

After clicking OK in the window above, Minitab will store the predicted salary value for every year in column C6. NOTE: In the regression graph we obtained, the red regression line represents the values we’ve just calculated in C6.

  1. Now that we have the average salary in C5 and the predicted values from our equation in C6, we can calculate the Sums of Squares for the Regression (the 5086.02). We’ll use Calc> Calculator again, and this time we will subtract the average salary from the predicted values, square those differences, and then add all of those squared differences together:

calculator

We square all the values because some of the predicted values from our equation are lower than the average, so those predicted values would be negative. If we sum together both positive and negative values, they will cancel each other out. But because we square the values, all observations will be taken into account.

We have just calculated the Sum of Squares for the regression by summing the squared values.  Our results should match what we’d seen in the regression output previously:

output

Calculating the Error Sum of Squares

The Error Sum of Squares is the variation in the salary that is not explained by number of years of experience. For example, the additional variation in the salary could be due to the person’s gender, number of publications, or other variables that are not part of this model. Any variation that is not explained by the predictors in the model becomes part of the error term.

  1. To calculate the error sum of squares we will use the calculator (Calc > Calculator) again to subtract the fitted values (the salaries predicted by our regression equation) from the observed response (the actual salaries):                

calculator

In C9, Minitab will store the differences between the actual salaries and what our equation predicted.

  1. Because we’re calculating sums of squares again, we’re going to square all the values we stored in C9, and then add them up to come up with the sum of squares for error:

calculator

When we click OK in the calculator window above, we see that our calculated sum of squares for error matches Minitab’s output:

output

Finally the Sum of Squares total is calculated by adding the Regression and Error SS together: 5086.02 + 1022.61 = 6108.63.

I hope you’ve enjoyed this post, and that it helps demystify what sums of squares are.  If you’d like to read more about regression, you may like some of Jim Frost’s regression tutorials.

Sunny Day for A Statistician vs. Dark Day for A Householder with Solar Panels

$
0
0

In 2011 we had solar panels fitted on our property. In the last few months we have noticed a few problems with the inverter (the equipment that converts the electricity generated by the panels from DC to AC, and manages the transfer of unused electric to the power company). It was shutting down at various times throughout the day, typically when it was very sunny, resulting in no electricity being generated.solar panels

I contacted the inverter manufacturer for some help to diagnose the problem. They asked me to download their monitoring app, called Sunny Portal. I did this and started a communication process with the inverter via Bluetooth, which not only showed me the error code but also delivered a time series of the electricity generated by the hour since the panels were installed.

I thought I had gone to statistician heaven! By using this data, I could establish if this problem was significantly reducing the amount of electricity generated and, consequently, reducing the amount of cash I was being paid for generating electricity. 

The Sunny Portal, does have some basic bar charts to plot time series, by the month, day, and 5-minute interval; however, each chart automatically works out the scale according to the data so it is difficult to compare time periods.  

Top Minitab Tip: If you want to compare multiple charts measuring the same thing for different time periods or groups, make sure the Y-axis scales are the same. In many Minitab graphs and charts, if you select the Multiple Graphs button you will be given the option to select the same Y-axis scale.

Getting the Data into Minitab

I realized that I could output the data to text files, which meant I could use my statistical skills and Minitab to answer my questions. For each month between Sept 2011 and June 2016 I exported a file like the example shown below. For each day I have the date, the cumulative units generated since the inverter was commissioned, and the daily generation.

These were easily read into Minitab, using File > Open, specifying the first row of data as row 9, and changing the delimiter from comma to semicolon. 

I read all of these monthly files into individual Minitab worksheets and then used Data > Stack Worksheets to create a single worksheet that contained all the data.  

Creating and Reviewing the Time Series Plots

Using Graph > Time Series Plot, I created the following time series plots. To get each year in different colours, I double-clicked on an individual data point in the chart, chose the "Groups" tab in the Edit Symbols dialog box, and put Year as the grouping variable.

Looking at this plot, it was clear that the most electricity is generated in the summer months and least in the winter months, but it was not easy to identify if the amount of electricity generated had been declining. I needed to consider another analytical approach.

Since I have only noticed this problem in the last 6 months, (Jan to June 2016) I decided to compare the electricity generated in the first 6 months of the year for the years 2012–2016.  I did this using Assistant > Hypothesis Tests > One Way Anova. The descriptive results were as follows:

Just looking at the summary statistics, I can clearly see that the average electric units generated per day for the first six months of 2016 is much lower at 5.71 units than it was in the previous years, which range between 8.15 in 2012 and 9.22 in 2014.  However by using the results from the one-way ANOVA I can work out if 2016 is significantly worse than previous years. 

From this chart you can see that based on the analysis of my data, the probability of electricity generated being the same for each year is less than 0.001, hence confirming that there are significant differences in the each year’s average. By using the Means Comparision Chart, shown below I can also see that 2016 is significantly lower than all the other years.

However, you might be thinking that first six months 2016 in England were darker than an average year, and there has been significantly less UV light. This might be a fair point, so to check this I looked at data produced by the UK Met Office, (www.metoffice.gov.uk/climate/uk/summaries/anomalygraphs)These charts, called anomaly graphs, compare the sunshine levels by month for particular years to the average sunshine levels for the previous decade.

The results for 2016 and 2012, the two worst years for average electricity generated per day, are as follows: 

When I compare Met Office data for the amount of sunshine in the first six months of 2016 in England (red bar), with 2012, the second-worst year according to my the summary statistics, I can see that only Jan and March were better in 2012. It should also be noted you generate more electricity when there are more daylight hours. So a bad June has a bigger influence on electricity generated than a bad January, and June in 2012 was worse than 2016.

Consequently, I can see that the English weather cannot be blamed for the lower electricity generation figures and the fault is with my inverter. The next steps are to determine when this problem with the inverter started, and estimate what it has cost. 

After I shared my results, the helpdesk at the manufacturer identified the problem with the Inverter: it had been set up with German power grid settings, and apparently the UK grid has more voltage fluctuation.  The settings were changed on 15th July, and I'm looking forward to collecting more data and analyzing it in Minitab to determine whether this problem has been solved

 

How to Pick the Right Statistical Software

$
0
0

If you’re in the market for statistical software, there are many considerations and more than a few options for you to evaluate.

questions to ask

Check out these seven questions to ask yourself before choosing statistical software—your answers should help guide you towards the best solution for your needs!

1. Who uses statistical software in your organization?

Are they expert statisticians, novices, or a mix of both? Will they be analyzing data day-in, day-out, or will some be doing statistics on a less frequent basis? Is data analysis a core part of their jobs, or is it just one of many different hats some users have to wear? What's their relationship with technology—do they like computers, or just use them because they have to? 

Figuring out who needs to use the software will help you match the options to their needs, so you can avoid choosing a package that does too much or too little.

If your users span a range of cultures and nationalities, be sure to see if the package you're considering is available in multiple languages.

2. What types of statistical analysis will they be doing?

The specific types of analysis you need to do could play a big part in determining the right statistical software for your organization. The American Statistical Association's software page lists highly specialized programs for econometrics, spatial statistics, data mining, statistical genetics, risk modeling, and more. However, if your company has employees who specialize in the finer points of these kinds of analyses, chances are good they already identified and have access to the right software for their needs.

Most users will want a general statistical software package that offers the power and flexibility to do all of the most commonly used types of analysis, including regression, ANOVA, hypothesis testing, design of experiments, capability analysis, control charts, and more. If you're considering a general statistical software package, check its features list to make sure it does the kinds of analysis you need. Here is the complete feature list for Minitab Statistical Software. 

3. How easy is it to use the statistical software?

Data analysis is not simple or easy, and many statistical software packages don’t even try to make it any easier. This is not necessarily a bad thing, because "ease of use" is different for different users.

An expert statistician will know how to set up data correctly and will be comfortable entering statistical equations in a command-line interface—in fact, they may even feel slowed down by using a menu-based interface. On the other hand, a less experienced user may be intimidated or overwhelmed by a statistical software package designed primarily for use by experts. 

Since ease of use varies widely, look into what kinds of built-in guidance statistical software packages offer to see which would be easiest for the majority of your users.

4. What kind of support is offered?

If people in your organization will need help using statistical software to analyze their data, how will they get it? Does your company have expert statisticians who can provide assistance when it's needed, or is access to that kind of expertise limited? 

If you think people in your organization are going to contact the software's support team for assistance, it's smart to check around and see what kinds of assistance different software companies offer. Do they offer help with analysis problems, or only with installation and IT issues? Do they charge for it?

Look around in online user and customer forums to see what people say about the customer service they've received for different types of statistical software. Some software packages offer free technical support from experts in statistics and IT; others provide more limited, fee-based customer support; and some packages provide no support at all.

5. Where will the software be used?

Will you be doing data analysis in your office? At home? On the road? All of the above? Will people in your organization be using the software at different locations across the country, or even the world?  What are the license requirements for software packages in that situation? Does each machine need a separate copy of the software, or are shared licenses available?

Check on the options available for the packages you're considering. A good software provider will seek to understand your organization's unique needs and work with you to find the most cost-effective solution.

6. Are there special considerations for your industry?

Some professions have specialized data analysis needs due to regulations, industry requirements, or the unique nature of their business. For example, the pharmaceutical and medical device industry needs to meet FDA recommendations for testing, which may involve statistical techniques such as Design of Experiments.

Depending on the needs of your business, one or more of these highly specialized software packages may be appropriate. However, general statistical software packages with a full range of tools may provide the functionality your industry requires, so be sure to investigate and compare these packages with the more specialized, and often more expensive, programs used in some industries.

7. What do statistical software packages cost?

Last but not least, you will need to consider the cost of the software package, which can range from $0 for some open-source programs to many thousands of dollars per license for more specialized offerings.

It’s important to compare not just the unit-copy price of a software package (i.e., what it costs to install a single copy of the software on a single machine), but to find out what licensing options for statistical software are available for your situation. 

Have more questions?

If you have questions about data analysis software, please contact Minitab to discuss your unique situation in detail. We are happy to help you identify the needs of your organization and find a solution that will best fit them!


Is Alabama Going Undefeated this Year? Creating Simulations in Minitab

$
0
0

The college football season is here, and this raises a very important question:

Is Alabama going to be undefeated when they win the national championship, or will they lose a regular-season game along the way?

Alabama

Okay, so it's not a given that Alabama is going to win the championship this year, but when you've won 4 of the last 7 you're definitely the odds-on favorite.

However, what if we wanted to take a quantitative look at Alabama's chances of going undefeated instead of just giving hot takes like the one above? How could we determine a probability of Alabama winning a specific number of games this year?

The answer is easy: a Monte Carlo Simulation.

Monte Carlo simulations use repeated random sampling to simulate data for a given mathematical model and evaluate the outcome. Sounds like the perfect situation for Minitab Statistical Software. We're going to use a Monte Carlo simulation to have Alabama play their schedule 100,000 times! But we need to establish a few things before we get started.

The Transfer Equation

First, we need a model to use in our simulation. This can be a known formula from your specific area of expertise, or it could be a model created from a designed experiment (DOE) or regression analysis. In our situation, we already know the transfer equation. It's just the summation of the number of games that Alabama wins during the season: 

Game1 + Game2 + Game3 ... + Game12

The Variables

Next, we need to define the distribution and parameters for the variables in our equation. We have 12 variables, one for each game Alabama will play.

For each game, Alabama can either win or lose. So each variable comes from the binomial distribution because there are only two outcomes.

Now we just need to determine the probability Alabama has of winning each game. For that, I'll turn to Bill Connelly's S&P+ rankings. These rankings use play-by-play and drive data from every game to rank college football teams. But most importantly, these rankings can be used to generate win probabilities for individual games. And that's where the probability for our 12 binomial variables will come from.

Alabama probabilities

Generate the Random Data

Now that we have our variables, it's time to generate the random data for each one. We'll start with Alabama's opening game against USC, which is a binomial random variable with a probability of 0.71. To generate this data in Minitab, go to Calc > Random Data > Binomial. Then complete the dialog as follows.

Binomial Distribution

We're going to simulate this game 100,000 times, so that is the number of rows of data we want to generate. We want each row to represent a single game, so the number of trials is 1. And lastly, Alabama has a 71% chance of winning, so the event probability is 0.71. 

After we repeat this for the other 11 games, we'll have simulated Alabama's regular season 100,000 times! Now all that's left to do is to analyze the results!

Note: The probability for Alabama beating Chattanooga is 100%, but the probability for the binomial distribution has to be less than 1. So I used a value of 0.9999. Out of 100,000 games Chattanooga actually won twice! Hey, it's sports, anything can happen!

Analyze the Simulation

Remember that transfer equation we came up with at the beginning? Now that we have the data for all of our variables, it's time to use it! Go to Calc > Calculator, and set up the equation to store the results in a new column.

Calculator

I created a new column called "Alabama Wins" and entered the sum of the individual game columns in the expression. This will give me the number of wins Alabama will have for 100,000 different seasons! We can use a histogram to view the results.

Histogram

The most common outcome was a 10-win season, which Alabama did approximately 29.6% of the time. And the simulation suggests it doesn't look good for Alabama going undefeated. That only happens in 4.6% of the simulations. In fact, there is a better chance that Alabama wins 7 games than all 12! A 7-5 Alabama team sounds impossible. But this is sports, and as our simulation has just shown, anything can happen!

Monte Carlo simulations can be applied to a wide variety of areas outside of sports too. If you want more, check out this article that illustrates how to use Minitab for Monte Carlo simulations using both a known engineering formula and a DOE equation.

 

Creating Value from Your Data

$
0
0

There may be huge potential benefits waiting in the data in your servers. These data may be used for many different purposes. Better data allows better decisions, of course. Banks, insurance firms, and telecom companies already own a large amount of data about their customers. These resources are useful for building a more personal relationship with each customer.

Some organizations already use data from agricultural fields to build complex and customized models based on a very extensive number of input variables (soil characteristics, weather, plant types, etc.) in order to improve crop yields. Airline companies and large hotel chains use dynamic pricing models to improve their yield management. Data is increasingly being referred as the new “gold mine” of the 21st century.

A couple of factors underlie the rising prominence of data (and, therefore, data analysis):

Afficher l'image d'origine

Huge volumes of data

Data acquisition has never been easier (sensors in manufacturing plants, sensors in connected objects, data from internet usage and web clicks, from credit cards, fidelity cards, Customer Relations Management databases, satellite images etc…) and it can easily be stored at costs that are lower than ever before (huge storage capacity now available on the cloud and elsewhere). The amount of data that is being collected is not only huge, it is growing very fast… in an exponential way.

Unprecedented velocity

Connected devices, like our smart phones, provide data in almost real time and it can be processed very quickly. It is now possible to react to any change…almost immediately.

Incredible variety

The data collected is not be restricted to billing information; every source of data is potentially valuable for a business. Not only is numeric data getting collected in a massive way, but also unstructured data such as videos, pictures, etc., in a large variety of situations.

But the explosion of data available to us is prompting every business to wrestle with an extremely complicated problem:

How can we create value from these resources ?

Very simple methods, such as counting words used in queries submitted to company web sites, do provide a good insight as to the general mood of your customers and its evolution. Simple statistical correlations are often used by web vendors to suggest a purchase just after buying a product on the web. Very simple descriptive statistics are also useful.

Just guess what could be achieved from advanced regression models or powerful statistical multivariate techniques, which can be applied easily with statistical software packages like Minitab.

A simple example of the benefits of analyzing an enormous database

Let's consider an example of how one company benefited from analyzing a very large database.

Many steps are needed (security and safety checks, cleaning the cabin, etc.) before a plane can depart. Since delays negatively impact customer perceptions and also affect productivity, airline companies routinely collect a very large amount of data related to flight delays and times required to perform tasks before departure. Some times are automatically collected, others are manually recorded.

A major worldwide airline company intended to use this data to identify the crucial milestones among a very large number of preparation steps, and which ones often triggered delays in departure times. The company used Minitab's stepwise regression analysis to quickly focus on the few variables that played a major role among a large number of potential inputs. Many variables turned out to be statistically significant, but two among them clearly seemed to make a major contribution (X6 and X10).

Analysis of Variance1

Source                  DF               Seq SS              Contribution                Adj SS                Adj MS               F-Value            P-Value

  X6                        1                337394                   53.54%                      2512                  2512.2                  29.21               0.000

  X10                     1                112911                   17.92%                     66357                 66357.1               771.46               0.000

When huge databases are used, statistical analyses may become overly sensitive and detect even very small differences (due to the large sample and power of the analysis). P values often tend to be quite small (p < 0.05) for a large number of predictors.

However, in Minitab, if you click on Results in the regression dialogue box and select Expanded tables, contributions from each variable will get displayed. X6 and X10 when considered together were contributing to more than 80% of the overall variability (with the largest F values by far), the contributions from the remaining factors were much smaller. The airline then ran a residual analysis to cross-validate the final model. 

In addition, a Principal Component Analysis (PCA, a multivariate technique) was performed in Minitab to describe the relations between the most important predictors and the response. Milestones were expected to be strongly correlated to the subsequent steps.

The graph above is a Loading Plot from a principal component analysis. Lines that go in the same direction and are close to one another indicate how the variables may be grouped. Variables are visually grouped together according to their statistical correlations and how closely they are related.

A group of nine variables turned out to be strongly correlated to the most important inputs (X6 and X10) and to the final delay times (Y). Delays at the X6 stage obviously affected the X7 and X8 stages (subsequent operations), and delays from X10 affected the subsequent X11 and X12 operations.

Conclusion

This analysis provided simple rules that this airline's crews can follow in order to avoid delays, making passengers' next flight more pleasant. 

The airline can repeat this analysis periodically to search for the next most important causes of delays. Such an approach can propel innovation and help organizations replace traditional and intuitive decision-making methods with data-driven ones.

What's more, the use of data to make things better is not restricted to the corporate world. More and more public administrations and non-governmental organizations are making large, open databases easily accessible to communities and to virtually anyone. 

Control Charts and Capability Analysis: How to Setup Your Data

$
0
0

To assess if a process is stable and in statistical control, you can use a control chart. It lets you answer the question "is the process that you see today going to be similar to the process that you see tomorrow?" To assess and quantify how well your process falls within specification limits, you can use capability analysis.

Both of these tools are easy to use in Minitab, but you first need to properly setup your data. Here’s how.

Chronological Order

Your data should be entered in the order in which it was collected. The first measurement you take should be recorded in row 1. Then the next measurement belongs in row 2, etc. Data should never be sorted (e.g., arranged from smallest to largest) when creating control charts or running capability analysis.

control chart data setup - 1

Data Collected in Subgroups

If you collect your data in subgroups– say you collect 5 parts every hour – then those 5 individual data points don’t necessarily need to be in chronological order. However, in your Minitab worksheet, the first set of 5 data points collected needs to fall before the next set of 5 data points collected, and so on. You can then enter ‘5’ for your subgroup size in the Stat > Control Charts, Stat > Quality Tools > Capability Analysis, and Assistant dialog boxes.

control chart data setup - data collected in subgroups

Missing Data

Suppose you intend to collect 5 data points every hour, but during one of the hours you collect only 4 data points. In the case that your sample size falls short, you can represent the missing data point(s) with an asterisk (*).

control chart data setup - subgroup with missing measurement

Subgroup Indicator

Suppose you intend to collect 5 data points every hour, but during one of the hours you collect 6 data points. Rather than tossing out perfectly good data, you can create a subgroup indicator column to let Minitab know that the subgroup size varies. You can then enter this subgroup column in the Stat > Control Charts and Stat > Quality Tools > Capability Analysis dialog boxes.

control chart setup - subgroup indicator

Note that you can use a subgroup indicator column for any of the cases above. It’s only absolutely required when your subgroup size varies.

When creating control charts or running capability analysis, the order in which your data appears directly impacts the resulting charts and calculations. Therefore, it’s important to make sure you enter your data properly.

When to Use a Pareto Chart

$
0
0

I confess: I'm not a natural-born decision-maker. Some people—my wife, for example—can assess even very complex situations, consider the options, and confidently choose a way forward. Me? I get anxious about deciding what to eat for lunch. So you can imagine what it used to be like when I needed to confront a really big decision or problem. My approach, to paraphrase the Byrds, was "Re: everything, churn, churn, churn."question to answer

Thank heavens for Pareto charts.

What Is a Pareto Chart, and How Do You Use It?

A Pareto chart is a basic quality tool that helps you identify the most frequent defects, complaints, or any other factor you can count and categorize. The chart takes its name from Vilfredo Pareto, originator of the "80/20 rule," which postulates that, roughly speaking, 20 percent of the people own 80 percent of the wealth. Or, in quality terms, 80 percent of the losses come from 20 percent of the causes.

You can use a Pareto chart any time you have data that are broken down into categories, and you can count how often each category occurs. As children, most of us learned how to use this kind of data to make a bar chart:

bar chart

A Pareto chart is just a bar chart that arranges the bars (counts) from largest to smallest, from left to right. The categories or factors symbolized by the bigger bars on the left are more important than those on the right.

Pareto Chart

By ordering the bars from largest to smallest, a Pareto chart helps you visualize which factors comprise the 20 percent that are most critical—the "vital few"—and which are the "trivial many."

A cumulative percentage line helps you judge the added contribution of each category. If a Pareto effect exists, the cumulative line rises steeply for the first few defect types and then levels off. In cases where the bars are approximately the same height, the cumulative percentage line makes it easier to compare categories.

It's common sense to focus on the ‘vital few’ factors. In the quality improvement arena, Pareto charts help teams direct their efforts where they can make the biggest impact. By taking a big problem and breaking it down into smaller pieces, a Pareto chart reveals where our efforts will create the most improvement.

If a Pareto chart seems rather basic, well, it is. But like a simple machine, its very simplicity makes the Pareto chart applicable to a very wide range of situations, both within and beyond quality improvement.

Use a Pareto Chart Early in Your Quality Improvement Process

At the leadership or management level, Pareto charts can be used at the start of a new round of quality improvement to figure out what business problems are responsible for the most complaints or losses, and dedicate improvement resources to those. Collecting and examining data like that can often result in surprises and upend an organization's "conventional wisdom." For example, leaders at one company believed that the majority of customer complaints involved product defects. But when they saw the complaint data in a Pareto chart, it showed that many more people complained about shipping delays. Perhaps the impression that defects caused the most complaints arose because the relatively few people who received defective products tended to complain very loudly—but since more customers were affected by shipping delays, the company's energy was better devoted to solving that problem.

Use a Pareto Chart Later in Your Quality Improvement Process

Once a project has been identified, and a team assembled to improve the problem, a Pareto chart can help the team select the appropriate areas to focus on. This is important because most business problems are big and multifaceted. For instance, shipping delays may occur for a wide variety of reasons, from mechanical breakdowns and accidents to data-entry mistakes and supplier issues. If there are many possible causes a team could focus on, it's smart to collect data about which categories account for the biggest number of incidents. That way, the team can choose a direction based on the numbers and not the team's "gut feeling."

Use a Pareto Chart to Build Consensus

Pareto charts also can be very helpful in resolving conflicts, particularly if a project involves many moving parts or crosses over many different units or work functions. Team members may have sharp disagreements about how to proceed, either because they wish to defend their own departments or because they honestly believe they know where the problem lies. For example, a hospital project improvement team was stymied in reducing operating room delays because the anesthesiologists blamed the surgeons, while the surgeons blamed the anesthesiologists. When the project team collected data and displayed it in a Pareto chart, it turned out that neither group accounted for a large proportion of the delays, and the team was able to stop finger-pointing. Even if the chart had indicated that one group or the other was involved in a significantly greater proportion of incidents, helping the team members see which types of delays were most 'vital' could be used to build consensus.

Use Pareto Charts Outside of Quality Improvement Projects

Their simplicity also makes Pareto charts a valuable tool for making decisions beyond the world of quality improvement. By helping you visualize the relative importance of various categories, you can use them to prioritize customer needs, opportunities for training or investment—even your choices for lunch.

How to Create a Pareto Chart

Creating a Pareto chart is not difficult, even without statistical software. Of course, if you're using Minitab, the software will do all this for you automatically—create a Pareto chart by selecting Stat > Quality Tools > Pareto Chart... or by selecting Assistant > Graphical Analysis > Pareto Chart. You can collect raw data, in which each observation is recorded in a separate row of your worksheet, or summary data, in which you tally observation counts for each category.

1. Gather Raw Data about Your Problem

Be sure you collect a random sample that fully represents your process. For example, if you are counting the number of items returned to an electronics store in a given month, and you have multiple locations, you should not gather data from just one store and use it to make decisions about all locations. (If you want to compare the most important defects for different stores, you can show separate charts for each one side-by-side.)

2. Tally Your Data

Add up the observations in each of your categories.

3. Label your horizontal and vertical axes.

Make the widths of all your horizontal bars the same and label the categories in order from largest to smallest. On the vertical axis, use round numbers that slightly exceed your top category count, and include your measurement unit.

4. Draw your category bars.

Using your vertical axis, draw bars for each category that correspond to their respective counts. Keep the width of each bar the same.

5. Add cumulative counts and lines.

As a final step, you can list the cumulative counts along the horizontal axis and make a cumulative line over the top of your bars. Each category's cumulative count is the count for that category PLUS the total count of the preceding categories. If you want to add a line, draw a right axis and label it from 0 to 100%, lined up with the with the grand total on the left axis. Above the right edge of each category, mark a point at the cumulative total, then connect the points.

Whatever Happened to…the Ozone Hole?

$
0
0

ozone holeToday, September 16, is World Ozone Day. You don't hear much about the ozone layer any more.

In fact, if you’re under 30, you might think this is just another trivial, obscure observance, along the lines of International Dot Day (yesterday) or National Apple Dumpling Day (tomorrow).

But there’s a good reason that, almost 30 years ago, the United Nations designated today to as a day to raise awareness of the ozone layer: unlike dots and apple dumplings, this fragile shield of gas in the stratosphere, which acts as a natural sunscreen against dangerous levels of UV radiation, is critical to sustain life on our planet. 

In this post, we'll join the efforts of educators around the globe who organize special activities on this day, by using Minitab to statistically analyze ozone-related data. You can follow along using the data in this Minitab project. If you don't already have it, you can download Minitab here and use it free for 30 days.

Orthogonal Regression: Can You Trust Your Data?

NIST dataBefore you analyze data, it's important to verify that your measuring system is accurate. Orthogonal regression, also known as Deming regression, is a tool used to evaluate whether two instruments or methods provide comparable measurements.

The following sample data is from the National Institute of Standards (NIST) web site. The predictor variable (x) is the NIST measurement of ozone concentration. The response variable (y) is the measurement of ozone concentration using a customer's measuring device.

In Minitab, choose Stat > Regression > Orthogonal Regression.Enter C1 as the Response (Y) and NIST as the Predictor (X). Enter 1.5 as the Error Variance ratio (Y/X) and click OK.

Note: The error variance ratio is based on historic data, not the sample data. Because the ratio is not available for these data, we'll use 1.5  purely for illustrative purposes. To learn more about this ratio, and how to estimate it, see the comments following this Minitab blog post

Orthogonal Regression Analysis: Device versus NIST The fitted line plot shows the two sets of measurements appear almost identical. That's about as good as it gets:fitted plot

Now look at the numerical output. If there's perfect correlation, and no bias, you'd expect to see a constant value of 0 and a slope of 1 in the regression equation. 

Error Variance Ratio (Device/NIST): 1.5

Regression Equation
Device = - 0.263 + 1.002 NIST

Coefficients

Predictor      Coef     SE Coef          Z         P       Approx 95% CI
Constant  -0.26338  0.232819    -1.1313  0.258  (-0.71969, 0.19294)
NIST       1.00212  0.000430  2331.6058  0.000  ( 1.00128, 1.00296)

To assess this, look at the 95% confidence intervals for the coefficients. The confidence interval for constant includes 0. The confidence interval for the predictor variable (NIST) is extremely close to 1, but does not include 1. Technically, there is some bias, although it may be too small to be relevant. In cases like this, rely on your practical knowledge in the field to determine whether the amount of bias is important.

I'm no ozone expert, but given the sample measurements, I'd speculate that this tiny amount of bias is not critical. 

Plotting the Size of the Ozone Hole 

Usually holes just get bigger over time. Like the holes in my socks and sweaters.   

But what about the size of the hole in the ozone layer above Antarctica? 

As part of the Ozone Hole Watch project, NASA scientists have been tracking the size of the ozone hole of the Southern Hemisphere for years. I copied the data into a Minitab project, and then used Graph > Time Series Plot > Multiple to plot both the mean ozone hole area and the maximum daily ozone hole area, by year. 

Time series plot

The plot shows why the ozone hole was such a big deal back in the 1980's. The size of the hole was increasing at extremely high rates, trending toward a potential environmental crisis. No wonder, then, that on September 16, 1987, the United Nations adopted the Montreal Protocol, an international agreement to reduce ozone-depleting substances such as chlorofluorocarbons. That agreement, eventually signed by nearly 200 nations, is credited with stabilizing the size of the ozone hole at the end of the 20th century, according to NASA and the World Meteorological Organization

One-Way ANOVA: Seasonal Changes in the Ozone Layer

The ozone layer is not static, but varies by latitude, season, and stratospheric conditions. On average, the "typical" thickness of the ozone layer is about 300 Dobson units (DU). 

The Lauder Ozone worksheet in the Minitab project linked above contains random samples of total ozone column measurements taken in Lauder, New Zealand in 2013. For this analysis, the seasons are defined as Summer = Dec-Feb, Fall = Mar-May, Winter = June-August, and Spring = Sept-Nov. 

To evaluate whether there are statistically significant differences in mean ozone by season using Minitab, choose Stat > ANOVA > One-Way... In the dialog box, select Response data are in a separate column for each factor level. As Responses, enter Summer, Fall, Winter, Spring.  Click Options, and uncheck Assume equal variances. Click Comparisons and check Games-Howell. After you click OK in each dialog box, Minitab returns the following output.

interval plot ozone

ozone session window

At a 0.05 level of significance, the p-value (≈ 0.000) is less than alpha. Thus, we can conclude that there is a statistically significant difference in mean ozone thickness by season. The plot shows that the mean ozone is lowest in Summer and Fall, and highest in Spring. 

Look at the 95% confidence intervals (CI). Are any seasons likely to have a mean ozone thickness less than 300 DU? Greater than 300 DU?  Based on the pairwise comparisons chart, for which seasons does the mean ozone layer significantly differ?

The ozone layer is just one factor in the myriad complex relationships between human activity and the global environment. So these analyses are just the tip of the iceberg—one that's melting as we speak.

Problems Using Data Mining to Build Regression Models

$
0
0

Picture of mining truck filled with numbersData mining uses algorithms to explore correlations in data sets. An automated procedure sorts through large numbers of variables and includes them in the model based on statistical significance alone. No thought is given to whether the variables and the signs and magnitudes of their coefficients make theoretical sense.

We tend to think of data mining in the context of big data, with its huge databases and servers stuffed with information. However, it can also occur on the smaller scale of a research study.

The comment below is a real one that illustrates this point.

“Then, I moved to the Regression menu and there I could add all the terms I wanted and more. Just for fun, I added many terms and performed backward elimination. Surprisingly, some terms appeared significant and my R-squared Predicted shot up. To me, your concerns are all taken care of with R-squared Predicted. If the model can still predict without the data point, then that's good.”

Comments like this are common and emphasize the temptation to select regression models by trying as many different combinations of variables as possible and seeing which model produces the best-looking statistics. The overall gist of this type of comment is, "What could possibly be wrong with using data mining to build a regression model if the end results are that all the p-values are significant and the various types of R-squared values are all high?"

In this blog post, I’ll illustrate the problems associated with using data mining to build a regression model in the context of a smaller-scale analysis.

An Example of Using Data Mining to Build a Regression Model

My first order of business is to prove to you that data mining can have severe problems. I really want to bring the problems to life so you'll be leery of using this approach. Fortunately, this is simple to accomplish because I can use data mining to make it appear that a set of randomly generated predictor variables explains most of the changes in a randomly generated response variable!

To do this, I’ll create a worksheet in Minitab statistical software that has 100 columns, each of which contains 30 rows of entirely random data. In Minitab, you can use Calc > Random Data > Normal to create your own worksheet with random data, or you can use this worksheet that I created for the data mining example below. (If you don’t have Minitab and want to try this out, get the free 30 day trial!)

Next, I’ll perform stepwise regression using column 1 as the response variable and the other 99 columns as the potential predictor variables. This scenario produces a situation where stepwise regression is forced to dredge through 99 variables to see what sticks, which is a key characteristic of data mining.

When I perform stepwise regression, the procedure adds 28 variables that explain 100% of the variance! Because we only have 30 observations, we’re clearly overfitting the model. Overfitting the model is different problem that also inflates R-squared, which you can read about in my post about the dangers of overfitting models.

I’m specifically addressing the problems of data mining in this post, so I don’t want a model that is also overfit. To avoid an overfit model, a good rule of thumb is to include no more than one term for each 10 observations. We have 30 observations, so I’ll include only the first three variables that the stepwise procedure adds to the model: C7, C77, and C95. The output for the first three steps is below.

Stepwise regression output

Under step 3, we can see that all of the coefficient p-values are statistically significant. The R-squared value of 67.54% can either be good or mediocre depending on your field of study. In a real study, there are likely to be some real effects mixed in that would boost the R-squared even higher. We can also look at the adjusted and predicted R-squared values and neither one suggests a problem.

If we look at the model building process of steps 1 - 3, we see that at each step all of the R-squared values increase. That’s what we like to see. For good measure, let’s graph the relationship between the predictor (C7) and the response (C1). After all, seeing is believing, right?

Scatterplot of two variables in regression model

This graph looks good too! It sure appears that as C7 increases, C1 tends to increase, which agrees with the positive regression coefficient in the output. If we didn’t know better, we’d think that we have a good model!

This example answers the question posed at the beginning: what could possibly be wrong with this approach? Data mining can produce deceptive results. The statistics and graph all look good but these results are based on entirely random data with absolutely no real effects. Our regression model suggests that random data explain other random data even though that's impossible. Everything looks great but we have a lousy model.

The problems associated with using data mining are real, but how the heck do they happen? And, how do you avoid them? Read my next post to learn the answers to these questions!

Descriptive vs. Inferential Statistics: When Is a P-value Superfluous?

$
0
0

True or false: When comparing a parameter for two sets of measurements, you should always use a hypothesis test to determine whether the difference is statistically significant.

The answer? (drumroll...) True!

...and False!

To understand this paradoxical answer, you need to keep in mind the difference between samples, populations, and descriptive and inferential statistics. 

Descriptive Statistics and Populations

Consider the fictional countries of Glumpland and Dolmania.

Welcome to Glumpland!

wkshet

The population of Glumpland is 8,442,012. The population of Dolmania is 6,977,201. For each country, the age of every citizen (to the nearest tenth), is recorded in a cell of a Minitab worksheet

Using Stat > Basic Statistics > Display Descriptive Statistics we can quickly calculate the mean age of each country.

desc stats

It looks like Dolmanians are, on average, more youthful than Glumplanders. But is this difference in means statistically significant?

To find out, we might be tempted to evaluate these data using a 2-sample t-test.

Except for one thing: there's absolutely no point in doing that.

That's because these calculated means are the means of the entire populations. So we already know that the population means differ.

Another example. Suppose a baseball player gets 213 hits in 680 at bats in 2015, and 178 hits in 532 at bats in 2016.

Would you need a 2-proportions test to determine whether the difference in batting averages (.313 vs .335) is statistically significant? Of course not.

You've already calculated the proportions using all the data for the entire two seasons. There's nothing more to extrapolate. And yet you often see a hypothesis test applied in this type of situation, in the mistaken belief that if there's no p-value, the results aren't "solid" or "statistical" enough.

But if you've collected every possible piece of data for a population, that's about as solid as you can get!

Inferential Statistics and Random Samples

Now suppose that draconian budget cuts have made it infeasible to track and record the age of every resident in Glumpland and Dolmania. What can they do?

Quite a lot, actually. They can apply inferential statistics, which is based on random sampling, to make reliable estimates without those millions of data values they don't have.

To see how it works, use Calc > Random Data > Sample from columns in Minitab. Randomly sample 50 values from the 8,422,012 values in column C1, which includes the ages of the entire population of Glumpland. Then use descriptive statistics to calculate the mean of the sample.

Here are the results for one random sample of 50:

Descriptive Statistics: GPLND (50)
Variable       Mean
GPLND(50)    52.37

The sample mean, 52.37 is slightly less than the true mean age of 53 for the entire population of Glumpland. What about another random sample of 50?

Descriptive Statistics: GPLND (50) 
Variable       Mean
GPLND(50)    54.11

Hmm. This sample mean of 54.11 slightly overshoots the true population mean of 53.

Even though the sample estimates are in the ballpark of the true population mean, we're seeing some variation. How much variation can we expect? Using descriptive statistics alone, we have no inkling of how "close" a sample estimate might be to the truth.  

Enter...the Confidence Interval

To quantify the precision of a sample estimate for the population, we can use a powerful tool in inferential statistics: the confidence interval.

Suppose you take random samples of size 5, 10, 20, 50, and 100 from Glumpland and Dolmania using Calc > Random Data > Sample from columns. Then use Graph > Interval Plot > Multiple Ys to display the  95% confidence intervals for the mean of each sample.

Here's what the interval plots look for the random samples in my worksheet.

interval plot Glumpland

Interval plot Dolmania

Your plots will look different based on your random samples, but you should notice a similar pattern: The sample mean estimates (the blue dots) tend to vary more from the population mean as the sample sizes decrease. To compensate for this, the intervals "stretch out" more and more, to ensure the same 95% overall probability of "capturing" the true population mean.

The larger samples produce narrower intervals. In fact, using only 50-100 data values, we can closely estimate the mean of over 8.4 million values, and get a general sense of how precise the estimate is likely to be. That's the incredible power of random sampling and inferential statistics!

To display side-by-side confidence intervals of the mean estimates for Glumpland and Dolmania, you can use an interval plot with groups.

interval plot side by side

Now, you might be tempted to use these results to infer whether there's a statistically significant difference in the mean age of the populations of Glumpland and Dolmania. But don't. Confidence intervals can be misleading for that purpose.

For that, we need another powerful tool of inferential statistics...

Enter...the hypothesis test and p-value

The 2-sample t-test is used to determine whether there is a statistically significant difference in the means of the populations from which the two random samples were drawn. The following table shows the t-test results for each pair of same-sized samples from Glumpland and Dolmania. As the sample size increases, notice what happens to the p-value and the confidence interval for the difference between the population means.

t tests

Again, the confidence intervals tend to get wider as the samples get smaller. With smaller samples, we're less certain of the precision of the estimate for the difference..

In fact, only for the two largest random samples (N=50 and N=100) is the p-value less than a 0.05 level of significance, allowing us to conclude that the mean ages of Glumplanders and Dolmanians are statistically different. For the three smallest samples (N=20, N=10, N=5), the p-value is greater than 0.05, and confidence interval for each of these small samples includes 0. Therefore, we cannot conclude that there is difference in the population means.

But remember, we already know that the true population means actually do differ by 5.4 years. We just can't statistically "prove" it with the small samples. That's why statisticians bristle when someone says, "The p-value is not less than 0.05. Therefore, there's no significant difference between the groups." There might very well be. So it's safer to say, especially with small samples, "we don't have enough evidence to conclude that there's a significant difference between the groups."

It's not just a matter of nit-picky semantics. It's simply the truth, as you can see when you take random samples of various sizes from the same known populations and test them for a difference.

Wrap-up

If you have a random sample, you should always accompany estimates of statistical parameters with a confidence interval and p-value, whenever possible. Without them, there's no way to know whether you can safely extrapolate to the entire population. But if you already know every value of the population, you're good to go. You don't need a p-value, a t-test, or a CI—any more than you need a clue to determine whats inside a box, if you already know what's in it.


Validating Process Changes with Design of Experiments (DOE)

$
0
0

We’ve got a plethora of case studies showing how businesses from different industries solve problems and implement solutions with data analysis. Take a look for ideas about how you can use data analysis to ensure excellence at your business!

Boston Scientific, one of the world’s leading developers of medical devices, is just one organization who has shared their story. A team at their Heredia, Costa Rica facility was able to assess and validate a packaging process, which resulted in a streamlined process and a cost-saving redesign of the packaging.

Below is a brief look at how they did it, but you can also take a look at the full case study at https://www.minitab.com/Case-Studies/Boston-Scientific/.

Their Challenge

guidewires in pouchBoston Scientific Heredia evaluates its operations regularly, to maintain process efficiency and contribute to affordable healthcare by reducing costs. At this facility, one packaging engineer led an effort to streamline packaging for guidewires—which are used during procedures such as catheter placement or endoscopic diagnoses—with the introduction of a new, smaller plastic pouch.

Using smaller and different packaging materials for their guidewires would substantially reduce material costs, but the company needed to prove that the new pouches would work with their sealing process, which creates a barrier that keeps the guidewires sterile.

How Data Analysis Helped

To ensure that the seal strength for the smaller pouches met or exceeded standards, they evaluated the process and identified several important factors, such as the temperature of the sealing system. They then used a statistical method called Design of Experiments (DOE) to determine how each of the variables affected the quality of the pouch seal.

The DOE revealed which factors were most critical. Below is a Minitab Pareto Chart that identified the factors that significantly affect seal strength: front temperature, rear temperature, and their respective two-way interaction.

https://www.minitab.com/uploadedImages/Content/Case_Studies/EffectsParetoforAveragePull.jpg

Armed with this knowledge, the team devised optimal process settings to ensure the new pouches had strong seals. To verify the effectiveness of the improved process, they used a statistical tool called capability analysis, which demonstrates whether or not a process meets specifications and can produce good results:

https://www.minitab.com/uploadedImages/Content/Case_Studies/ProcessCapabilityofHighSettings-SealStrength.jpg

Results

The analysis showed that guidewires packaged using the new, optimal process settings met, and even exceeded, the minimum seal strength requirements.

With the new pouches, Boston Scientific has saved more than $330,000. “At the end of the day,” a key team member noted, “the more money we save, the more additional savings we can pass on to the people we serve.”

For another example of how Boston Scientific uses data analysis to ensure the safety and reliability of its products, read Pulling Its Weight: Tensile Testing Challenge Speeds Regulatory Approval for Boston Scientific, a story about how the company used Minitab Statistical Software to confirm the equivalency of its catheter’s pull-wire strength to previous testing results, and eliminate the need to perform test method validation by leveraging its existing tension testing standard.

How to Save a Failing Regression with PLS

$
0
0

Face it, you love regression analysis as much as I do. Regression is one of the most satisfying analyses in Minitab: get some predictors that should have a relationship to a response, go through a model selection process, interpret fit statistics like adjusted R2 and predicted R2, and make predictions. Yes, regression really is quite wonderful.

Except when it’s not. Dark, seedy corners of the data world exist, lying in wait to make regression confusing or impossible. Good old ordinary least squares regression, to be specific.

For instance, sometimes you have a lot of detail in your data, but not a lot of data. Want to see what I mean?

  1. In Minitab, choose Help > Sample Data...
  2. Open Soybean.mtw.

SoybeansThe data has 88 variables about soybeans, the results of near-infrared (NIR) spectroscopy at different wavelengths. But the data contains only 60 measurements, and the data are arranged to save 6 measurements for validation runs.

A Limit on Coefficients

With ordinary least squares regression, you only estimate as many coefficients as the data have samples. Thus, the traditional method that’s satisfactory in most cases would only let you estimate 53 coefficients for variables plus a constant coefficient.

This could leave you wondering about whether any of the other possible terms might have information that you need.

Multicollinearity

The NIR measurements are also highly collinear with each other. This multicollinearity complicates using statistical significance to choose among the variables to include in the model.

When the data have more variables than samples, especially when the predictor variables are highly collinear, it’s a good time to consider partial least squares regression.

How to Perform Partial Least Squares Regression

Try these steps if you want to follow along in Minitab Statistical Software using the soybean data:

  1. Choose Stat > Regression > Partial Least Squares.
  2. In Responses, enter Fat.
  3. In Model, enter ‘1’-‘88’.
  4. Click Options.
  5. Under Cross-Validation, select Leave-one-out. Click OK.
  6. Click Results.
  7. Check Coefficients. Click OK twice.

One of the great things about partial least squares regression is that it forms components and then does ordinary least squares regression with them. Thus the results include statistics that are familiar. For example, predicted R2 is the criterion that Minitab uses to choose the number of components.


Minitab selects the model with the highest predicted R-squared.

Each of the 9 components in the model that maximizes the predicted R2 value is a complex linear combination of all 88 of the variables. So although the ANOVA table shows that you’re using only 9 degrees of freedom for the regression, the analysis uses information from all of the data.

The regression uses 9 degrees of freedom.

 The full list of standardized coefficients shows the relative importance of each predictor in the model. (I’m only showing a portion here because the table is 88 rows long.)


Each variable has a standardized coefficient.

Ordinary least squares regression is a great tool that’s allowed people to make lots of good decision over the years. But there are times when it’s not satisfying. Got too much detail in your data? Partial least squares regression could be the answer.

Want more partial least squares regression now? Check out how Unifi used partial least squares to improve their processes faster.

The image of the soybeans is by Tammy Green and is licensed for reuse under thisCreative Commons License.

5 Powerful Insights from Noted Quality Leaders

$
0
0

If you were among the 300 people who attended the first-ever Minitab Insights conference in September, you already know how powerful it was. Attendees learned how practitioners from a wide range of industries use data analysis to address a variety of problems, find solutions, and improve business practices.Minitab Insights 2016

In the coming weeks and months, we will share more of the great insights and guidance shared by our speakers and attendees. But here are five helpful, challenging, and thought-provoking ideas and suggestions that we heard during the event.

You Can Get More Information from VOC Data.

Joel Smith of the Dr. Pepper Snapple Group used the assessment of different beers to show how applying the tools in Minitab can help a business move from raw Voice of the Customer (VOC) data to actionable insights. His presentation showed how to use graphical analysis and descriptive statistics to clean observational VOC data, and then how to use cluster analysis, principal component analysis, and regression analysis to make informed decisions about how to create a better product.  

Consider Multiple Ways to Show Results. 

Graphs are often part of a Minitab analysis, but a graph may not be the only way to visualize your results. Think about your audience and your communication goals when choosing and customizing your graphs, suggested Rip Stauffer, senior consultant at Management Science and Innovation. He showed examples of how the same information comes across very differently when presented in various charts, and when colors, thicknesses, and styles are selected carefully. Along the way, he also illustrated Minitab's flexibility in tailoring the appearance of a graph to fit your needs. 

Quality Methods Make Great Sales Tools.

We hear all the time about the impact of quality improvement methods on manufacturing. But what about using statistical analysis to boost sales? Andrew Mohler from global chemical company Buckman explained how training technical sales associates to use data analysis and Minitab has transformed the company's business. Empowering the sales team to help customers improve their processes has enabled the company to provide more value and to drive sales—boosting the bottom line.

Data-Driven Cultures Have Risks, Too.

In the quality improvement world, we tend to think that transforming an organization's culture so everyone understands the value of data analysis only brings benefits. But Richard Titus, a consultant and adjunct instructor at Lehigh University who has worked with Crayola, Ingersoll-Rand, and many other organizations, highlighted potential traps for organizations with a high level of statistical knowledge. These include trying to find data to fit favored answer(s); working as a "lone ranger" independent of a team; failing to map and measure processes; not selecting a primary metric to measure success; searching for a "silver bullet;" and trying to outsmart the process. 

When Subgroup Sizes Are Large, Use P' Charts.

T. C. Simpson and M. E. Rusak from Air Products illustrated how using a traditional P chart to monitor a transactional process can lead to problems if you have a large subgroup size. False alarms or failure to detect special-cause variation can result from overdispersion or underdispersion in your data when your subgroup sizes are large. You can avoid these risks with a Laney P' control chart, which uses calculations that account for large subgroups. Learn more about the Laney P' chart. 

Watch for more stories, tips, and ideas from the the Minitab Insights conference in future issues of Minitab News, and on the Minitab Blog.

Why Shrewd Experts "Fail to Reject the Null" Every Time

$
0
0

nulls angels: the toughest statisticians around!I watched an old motorcycle flick from the 1960s the other night, and I was struck by the bikers' slang. They had a language all their own. Just like statisticians, whose manner of speaking often confounds those who aren't hep to the lingo of data analysis.

It got me thinking...what if there were an all-statistician biker gang? Call them the Nulls Angels. Imagine them in their colors, tearing across the countryside, analyzing data and asking the people they encounter on the road about whether they "fail to reject the null hypothesis."

If you point out how strange that phrase sounds, the Nulls Angels will know you're not cool...and not very aware of statistics.

Speaking purely as an editor, I acknowledge that "failing to reject the null hypothesis" is cringe-worthy. "Failing to reject" seems like an overly complicated equivalent to accept. At minimum, it's clunky phrasing.

But it turns out those rough-and-ready statisticians in the Nulls Angels have good reason to talk like that. From a statistical perspective, it's undeniably accurate—and replacing "failure to reject" with "accept" would just be wrong.

What Is the Null Hypothesis, Anyway?

Hypothesis tests include one- and two-sample t-tests, tests for association, tests for normality, and many more. (All of these tests are available under the Stat menu in Minitab statistical software. Or, if you want a little more statistical guidance, the Assistant can lead you through common hypothesis tests step-by-step.)

A hypothesis test examines two propositions: the null hypothesis (or H0 for short), and the alternative (H1). The alternative hypothesis is what we hope to support. We presume that the null hypothesis is true, unless the data provide sufficient evidence that it is not.

You've heard the phrase "Innocent until proven guilty." That means innocence is assumed until guilt is proven. In statistics, the null hypothesis is taken for granted until the alternative is proven true.

So Why Do We "Fail to Reject" the Null Hypothesis?

That brings up the issue of "proof."

The degree of statistical evidence we need in order to “prove” the alternative hypothesis is the confidence level. The confidence level is 1 minus our risk of committing a Type I error, which occurs when you incorrectly reject the null hypothesis when it's true. Statisticians call this risk alpha, and also refer to it as the significance level. The typical alpha of 0.05 corresponds to a 95% confidence level: we're accepting a 5% chance of rejecting the null even if it is true. (In life-or-death matters, we might lower the risk of a Type I error to 1% or less.)

Regardless of the alpha level we choose, any hypothesis test has only two possible outcomes:

  1. Reject the null hypothesis and conclude that the alternative hypothesis is true at the 95% confidence level (or whatever level you've selected).
     
  2. Fail to reject the null hypothesis and conclude that not enough evidence is available to suggest the null is false at the 95% confidence level.

We often use a p-value to decide if the data support the null hypothesis or not. If the test's p-value is less than our selected alpha level, we reject the null. Or, as statisticians say "When the p-value's low, the null must go."

This still doesn't explain why a statistician won't "accept the null hypothesis." Here's the bottom line: failing to reject the null hypothesis does not mean the null hypothesis is true. That's because a hypothesis test does not determine which hypothesis is true, or even which is most likely: it only assesses whether evidence exists to reject the null hypothesis.

"My hypothesis is Null until proven Alternative, sir!"  "Null Until Proven Alternative"

Hark back to "innocent until proven guilty." As the data analyst, you are the judge. The hypothesis test is the trial, and the null hypothesis is the defendant. The alternative hypothesis is the prosecution, which needs to make its case beyond a reasonable doubt (say, with 95% certainty).

If the trial evidence does not show the defendant is guilty, neither has it proved that the defendant is innocent. However, based on the available evidence, you can't reject that possibility. So how would you announce your verdict?

"Not guilty."

That phrase is perfect: "Not guilty"doesn't say the defendant is innocent, because that has not been proven. It just says the prosecution couldn't convince the judge to abandon the assumption of innocence.

So "failure to reject the null" is the statistical equivalent of "not guilty." In a trial, the burden of proof falls to the prosecution. When analyzing data, the entire burden of proof falls to your sample data. "Not guilty" does not mean "innocent," and "failing to reject" the null hypothesis is quite distinct from "accepting" it. 

So if a group of marauding statisticians in their Nulls Angels leathers ever asks, keep yourself in their good graces, and show that know "failing to reject the null" is not "accepting the null."

5 More Powerful Insights from Noted Quality Leaders

$
0
0

We hosted our first-ever Minitab Insights conference in September, and if you were among the attendees, you already know the caliber of the speakers and the value of the information they shared. Experts from a wide range of industries offered a lot of great lessons about how they use data analysis to improve business practices and solve a variety of problems.tips from Minitab Insights 2016

I blogged earlier about five key takeaways gleaned from the sessions at the Minitab Insights 2016 conference. But that was just the tip of the iceberg, and participants learned many more helpful things are well worth sharing. So here are five more helpful, challenging, and thought-provoking ideas and suggestions that we heard during the event.

Improve Your Skills while Improving Yourself! 

Everyone has personal goals they'd like to achieve, such as getting fit, changing a habit, or writing a book. Rod Toro, deployment leader at Edward Jones, explained how challenging himself and his team to apply Lean and Six Sigma tools to their personal goals has helped them better understand the underlying principles of quality improvement, personalized learning and gain deeper insights, and expanded their ability to apply quality methods in a variety of circumstances and situations. 

We Can't Claim the Null Hypothesis Is True.

Minitab technical training specialist Scott Kowalski reminded us that when we test a hypothesis with statistics, "failing to reject the null" does not prove that the null hypothesis is true. It only means we don't have enough evidence to reject it. We need to keep this in mind when we interpret our results, and to be careful how we explain our findings to others. We also need to be sure our hypotheses are clearly stated, and that we've selected the appropriate test for our task!

Outliers Won't Just Be Ignored, So You'd Better Investigate Them. 

We've all seen them in our data: those troublesome observations that just don't want to belong, lurking off in the margins, maybe with one or two other loners. It can be tempting to ignore or just delete those observations, but Larry Bartkus, senior distinguished engineer at Edwards Lifesciences, provided vivid illustrations of the drastic impact outliers can have on the results of an analysis. He also reminded us of the value in slowing down our assumptions, looking at the data in several ways, and trying to understand why our data is the way it is. 

Attribute Agreement Analysis Is Just One Option.

When we need to assess how well an attribute measurement system performs, attribute agreement analysis is the go-to method—but Thomas Rust, reliability engineer at Autoliv, demonstrated that many more options are available. In encouraging quality practitioners to "break the attribute paradigm," Rust detailed four innovative ways to assess an attribute measurement system: measure an underlying variable; attribute measurement of a variable product; variable measurement of an attribute product; and attribute measurement of an attribute product.

Minitab Users Do Great Things.

More than anything else, what we took away from Minitab Insights 2016 was an even greater appreciation for the people who are using our software in innovative ways—to increase the quality of the products we use every day, to raise the level of service we receive from businesses and organizations, to increase the efficiency and safety of our healthcare providers, and so much more.

Watch for more stories and ideas from the the Minitab Insights conference in future issues of Minitab News, and on the Minitab Blog.

Viewing all 143 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>