Mathematics/Statistics/Regression Lines: Difference between revisions
Brodriguez (talk | contribs) (Create page) |
Brodriguez (talk | contribs) (Create regression equation section) |
||
| Line 6: | Line 6: | ||
Since this is regression line is a standard straight line, it can be represented by a standard {{ ic | <math>y = mx + b</math>}} equation. | Since this is regression line is a standard straight line, it can be represented by a standard {{ ic | <math>y = mx + b</math>}} equation. | ||
== Correlation Coefficients == | |||
{{ Tip | This is often represented as '''r'''.}} | |||
The '''Correlation Coefficient''' is a value that describes "how well can a straight line fit this data", and will always be between [-1, 1]. | |||
A value of exactly 1 indicates that there is a strong positive correlation between x and y. That is, as x increases, so does y. And as y increases, so does x. | |||
A value of exactly -1 indicates that there is a strong negative correlation between x and y. That is, as x increases, y decreases. And as y increases, x decreases. | |||
As values approach 0, it indicates a weaker and weaker correlation, with 0 indicating that there is absolutely no correlation between x and y. | |||
The equation to calculate the correlation coefficient is the following. | |||
<math>r = \frac{1}{n - 1}\sum_{i=1}{n}(\frac{x_i-\bar{x}}{S_x})(\frac{y_i-\bar{y}}{S_y})</math> | |||
For further explanation, see [https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/scatterplots-and-correlation/v/calculating-correlation-coefficient-r this Khan Academy video]. | |||
== Residuals == | == Residuals == | ||
| Line 16: | Line 34: | ||
We can then take this a step further and combine all residuals in our dataset to get an | We can then take this a step further and combine all residuals in our dataset to get an overall view of "how closely the regression line matches our dataset". | ||
<math>\sum_{i=1}^n (r_i)^2</math> | <math>\sum_{i=1}^n (r_i)^2</math> | ||
| Line 23: | Line 41: | ||
== Least Squares Regression == | == Least Squares Regression == | ||
'''Least Squares Regression''' is one of the more popular ways to minimize the sum of all residuals in our dataset. | '''Least Squares Regression''' is one of the more popular ways to minimize the sum of all residuals in our dataset. We can calculate this regression line with a few steps: | ||
First, we note that the equation we want to calculate is ultimately some form of {{ ic | <math>y = mx + b</math>}}. | |||
Our regression line will always go through the point <math>(\bar{x},\bar{y})</math>, which denotes the [[Statistics/Core_Measurements#Mean|means]] of our x and y. So it's safe to use that as the <math>x</math> and <math>y</math> in our equation. From there, we can move to calculating our <math>m</math> and <math>b</math>. | |||
=== Calculating our M and B === | |||
Next, we can calculate line slope with the equation | |||
<math>m = r \frac{S_y}{S_x}</math> | |||
Where | |||
* <math>r</math> is the [[#Correlation Coefficients|correlation coefficient]] for our dataset. | |||
* <math>S_y</math> is the [[Statistics/Core_Measurements#Standard Deviation|standard deviation]] of our x. | |||
* <math>S_x</math> is the [[Statistics/Core_Measurements#Standard Deviation|standard deviation]] of our y. | |||
We can then proceed to get our <math>b</math> with simple algebra: | |||
<math>b = y - mx</math> | |||
where | |||
* <math>x</math> is the [[Statistics/Core_Measurements#Mean|mean]] of our x. | |||
* <math>y</math> is the [[Statistics/Core_Measurements#Mean|mean]] of our y. | |||
* <math>m</math> is the m we just calculated, above. | |||
Putting all of this together, we can replace the m and b in our {{ ic | <math>y = mx + b</math>}}, which will give us a line equation for our regression. | |||
For further explanation, see [https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/regression-library/v/calculating-the-equation-of-a-regression-line this Khan Academy video]. | |||
Revision as of 06:23, 18 May 2020
A regression line is effectively an attempt to fit some data to a straight line.
Basically, given some set of data points, you create a line that as closely as possible fits the data. This is accomplished by trying to minimize the total distance between all points and the line itself.
Since this is regression line is a standard straight line, it can be represented by a standard Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y = mx + b}
equation.
Correlation Coefficients
The Correlation Coefficient is a value that describes "how well can a straight line fit this data", and will always be between [-1, 1].
A value of exactly 1 indicates that there is a strong positive correlation between x and y. That is, as x increases, so does y. And as y increases, so does x.
A value of exactly -1 indicates that there is a strong negative correlation between x and y. That is, as x increases, y decreases. And as y increases, x decreases.
As values approach 0, it indicates a weaker and weaker correlation, with 0 indicating that there is absolutely no correlation between x and y.
The equation to calculate the correlation coefficient is the following.
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r = \frac{1}{n - 1}\sum_{i=1}{n}(\frac{x_i-\bar{x}}{S_x})(\frac{y_i-\bar{y}}{S_y})}
For further explanation, see this Khan Academy video.
Residuals
Given a single data point, a residual is the distance between that single point and the regression line.
A positive residual value indicates that the data point is somewhere above the regression line. A negative residual value indicates that the data point is somewhere below the regression line. Larger values indicate the point is farther away.
To calculate the residual for some point at Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (p_x, p_y)} , we have the following equation for some Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle m} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle b} from our regression line:
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle residual_p = p_y - (m(p_x) + b)}
We can then take this a step further and combine all residuals in our dataset to get an overall view of "how closely the regression line matches our dataset".
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_{i=1}^n (r_i)^2}
Squaring means means that negative residual values don't cancel out positive values. It also means points that are farther away from the line end up with more weight than points closer to the line.
Least Squares Regression
Least Squares Regression is one of the more popular ways to minimize the sum of all residuals in our dataset. We can calculate this regression line with a few steps:
First, we note that the equation we want to calculate is ultimately some form of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y = mx + b}
.
Our regression line will always go through the point Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\bar{x},\bar{y})} , which denotes the means of our x and y. So it's safe to use that as the Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} in our equation. From there, we can move to calculating our Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle m} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle b} .
Calculating our M and B
Next, we can calculate line slope with the equation
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle m = r \frac{S_y}{S_x}}
Where
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r} is the correlation coefficient for our dataset.
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S_y} is the standard deviation of our x.
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S_x} is the standard deviation of our y.
We can then proceed to get our Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle b}
with simple algebra:
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle b = y - mx}
where
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} is the mean of our x.
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} is the mean of our y.
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle m} is the m we just calculated, above.
Putting all of this together, we can replace the m and b in our Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y = mx + b}
, which will give us a line equation for our regression.
For further explanation, see this Khan Academy video.