Mathematics/Statistics/Chi-Square Test

From Dev Wiki
< Mathematics‎ | Statistics
Revision as of 06:55, 16 May 2020 by Brodriguez (talk | contribs) (Add missing section)
Jump to navigation Jump to search

The Chi-Square test, alternatively called test, is used to measure two possible things:

  • Chi-Square Goodness of Fit test - Determines if a set of sample data matches a larger population.
  • Chi-Square Test for Independence - Determines if two variables are at all correlated.


At its core, all Chi-Square tests do the following:

  • Start with some claim.
  • Get a random sampling of data to test against the claim.
  • Use probability analysis to determine how likely our sample is to have occurred.
  • If our sample is "too unlikely", then we reject the initial claim.


General Notation

In general, Chi-Square is represented by a formula. The notations are as follows:

  • O stands for "Observed/Actual".
  • E stands for "Expected".
  • Spans across some data set of size n.

With this in mind we have the following formulas:

Scientific Notation

Formal:


Less Formal:


Direct Notation

Formal:


Less Formal:


Initial Variables

The following variables are used in both types of Chi-Square tests. Determining these is the first step before doing anything else.

Null Hypothesis

The Null Hypothesis is integral to the Chi-Square test. This is effectively a special way of saying "this is what we're testing for."

This null hypothesis is represented by , and should essentially always be a declaration in support of our "expected" outcome.

Alternative Hypothesis

To counter , we have an Alternative Hypothesis, represented as . Often, this is as simple as "Our is not true."

P-Value

Finally, we have a P-Value or . In layman's terms, this can be though of a Signficance Level. We use this at the end to determine if our result is significant or not.

Our P-Value is generally between 0 and 1, and is generally chosen based on the expected distribution for our population.
Since a normal distribution is one of the most common distribution types, most of the time our P-Value = 0.05.
This is because, in a normal distribution 95% of values will fall within two standard deviations, so we only care if we hit an instance outside of that.

Goodness of Fit Test

The Goodness of Fit Chi-Square test is used to evaluate if a smaller, sample population matches a larger one.
A lot of times, we'll doubt if something works the way it claims, so we get a sample subset and compare results.
If our final result is below our P-Value, then it means our sample was unlikely enough to reject the initial claim.

Running the Test

Once we determine our initial variables, we can conduct our test. We plug our values into the above formulas and get a result.

We then calculate Degrees of Freedom (Df), which is just a fancy way of saying "number of possible outcomes, minus 1".
We use this Df value to look up a Chi-Square probability table (it's probably best to just google this) and find the appropriate row.
On this row, we find the rough equivalent of what our above formula gave us, and then note the value at the very top of this column.

Finally, we look back at our P-Value. If the table value is greater than our P-Value, we accept as valid. If the table value is less than our P-value, we reject as invalid and instead accept our .

Example

Background

Hypothetically, let's say you're a student. A big, national test is coming up, and all the questions are multiple choice with answers of A, B, C, or D.
The publisher of the test claims that "all our tests are structured this way, and there is an equal chance of any letter being correct for any question.
In other words, one would expect each of the answers should come up as the correct answer exactly 25% of the time.

When you get the results back, you're not confident in what the publisher claimed. For your specific version of the test, the correct answers were dispersed as follows:

A: 20%
B: 20%
C: 25%
D: 30%

Note that, due to random chance, there is always going to be some variation from that original 25%. But you feel like these results are a bit extreme, so you investigate further.
One way to do so is the Chi-Square Goodness of Fit test.

Initial Variables

The first step is always to determine our , , and .

can be "The publisher's claim is correct. Each possible answer has a 25% chance of occurring."
can be "The publisher's claim is incorrect and there is a bias towards one or more possible answers."
Since we expect a normal distribution, our will be 0.05.

Using the Formula

We can consider our version of the test to be an adequate sample of a larger population. The "larger population" is considered to be "all tests and test versions created by the publisher".
To word it another way, our test answers can be the "actual" and the claim by the publisher can be the "expected". Thus we can proceed without any additional information.

In this case, we have the following formula values:


 

 

 

 

 

We also have 4 possible outcomes (aka four different possible test answers), so our Degrees of Freedom is:


Looking up an external Chi-Square table on google, the Df = 3 row indicates that a value of 6.251 occurs at column 0.10.
In other words, 10% of the time, with Df = 3, you'll get a value of 6.251 for any random sample of a population.
Comparing this 0.10 value back to our , we note that our computed table value is higher.

With this, we can conclude that the values of our test are actually not as extreme as we originally thought, and our null hypothesis holds.


Test for Independence

The Test for Independence Chi-Square test is used to evaluate if two variables are correlated in some way.
In other words, given two variables, we want to know "Do one of these variables appear to influence the other in some way?"

Running the Test

For this test, we will essentially always have the following:

  • is "there is no association between the variables".
  • is "there is an association between the variables".

Gathering Data

With that in mind, we start by gathering some sample data.
With this sample data, we record values for each possible outcome of each category and write it in the format of a table.
One variable will span the table rows and the other will span the table columns. And for each row/column, we record total values as well.
All of this is used to be our "actual/observed values".

Next, we have to calculate our "expected values". With our in mind, our expected should assume "no correlation".
So for each table cell, we take the row and column totals as a percent and multiply them together. This provides the "expected" for that cell.

Testing Conditions

At this point, we want to test for the following conditions:

  • Sample was taken from a random population.
  • Our sample size can be no more than 10% of the total population.
  • All calculated "expected" value data points are at least 5. If not, you need to gather more samples until they are.
Note: The "observed/actual" data points can be lower than 5, as long as the "expected" data points calculate to 5 or higher.

Using the Test

Similar to the Goodness of Fit test, we use our Chi-Square formula, calculate our Degrees of Freedom, and reference an external Chi-Square table.

The only major difference in this test is how we calculate Degrees of Freedom. In this case, the formula is:


Example

Background

For a statistics class, you're supposed to conduct a survey to show your knowledge of Chi-Squared association.
So you decide to test if there's a correlation between "favorite color" and being "left handed or right handed".

After polling 200 different, random people, you get the following values:

"Actual/Observed" Values
Blue Green Purple Red Total
Left Handed 4 5 5 11 25
Right Handed 49 35 34 31 149
Ambidextrous 9 6 5 6 26
Total 62 46 44 48 200


Assuming there's no correlation, we can use the row and column sums to compute the following "Expected" table:

"Expected" Value Calculations
Blue Green Purple Red Total
Left Handed (62/200) * (25/200) * 200

0.31 * 0.125 * 200
(46/200) * (25/200) * 200

0.23 * 0.125 * 200
(44/200) * (25/200) * 200

0.22 * 0.125 * 200
(48/200) * (25/200) * 200

0.24 * 0.125 * 200
25α = 0.05 {\displaystyle \alpha =0.05} {\displaystyle \alpha =0.05}
Right Handed (62/200) * (149/200) * 200

0.31 * 0.745 * 200
(46/200) * (149/200) * 200

0.23 * 0.745 * 200
(44/200) * (149/200) * 200

0.22 * 0.745 * 200
(48/200) * (149/200) * 200

0.24 * 0.745 * 200
149
Ambidextrous (62/200) * (26/200) * 200

0.31 * 0.13 * 200
(46/200) * (26/200) * 200

0.23 * 0.13 * 200
(44/200) * (26/200) * 200

0.22 * 0.13 * 200
(48/200) * (26/200) * 200

0.24 * 0.13 * 200
26
Total 62 46 44 48 200


We can simplify this to:

"Expected" Values
Blue Green Purple Red Total
Left Handed 7.75 5.75 5.5 6 25
Right Handed 46.19 34.27 32.78 35.76 149
Ambidextrous 8.06 5.98 5.72 6.24 26
Total 62 46 44 48 200

Initial Variables

As a standard Chi-Square test, we need to determine , , and . The first two are basically given:

  • is "there is no association between the variables".
  • is "there is an association between the variables".

Then we once again expect a normal distribution so we have .

Testing Conditions

At this point, we need to test our conditions, to make sure we have meaningful data for the test:

  • We know we sampled 200 random people on the street, so our sample set is random.
  • We know that there's no way 200 people is "10% of total people".
  • Looking at our final expected table, no values are less than 5.

With this, we can proceed to run our Chi-Square test.

Using the Formula

At this point, we can use the standard Chi-Square formula, using values from our tables:


 

 

 

 

We calculate our Degree of Freedom via (rows - 1) * (cols - 1) so:


Looking up an external Chi-Square table on google, the Df = 6 row indicates that a value of 10.645 occurs at column 0.10.
In other words, 10% of the time, with Df = 6, you'll get a value of 10.645 for any random sample of a population.
This is much higher than our calculated value of 7.1989. So we know we'll get this value more than 10% of the time. Comparing back to our , we note that our computed table value is much higher.

We can safely conclude that this distribution for sampling a random 200 people indicates that there is not any significant correlation between variables.
Note that smaller values indicate a stronger variable correlation, while larger values indicate less and less correlation.