Detailed Explanation of Simple Linear Regression, Assessment and, Inference with ANOVA
简单线性回归的详细解释,评估和,用方差分析进行推断
Detailed Explanation of Simple Linear Regression, Assessment and, Inference with ANOVA
简单线性回归,评估和用方差分析推断的详细解释
Step by Step Discussion and Workout with Examples, Implementation manually and in R
循序渐进地讨论,并结合实例进行求解,手工实现和用R语言实现
A linear relationship between two variables is very common. So, a lot of mathematical and statistical models have been developed to use this phenomenon and extract more information about the data. This article will explain the very popular methods in statistics Simple Linear Regression (SLR).
两个变量之间的线性关系是非常常见的。因此,人们开发了很多数学和统计模型来利用这一现象,并提取更多的数据信息。本文将解释统计学中非常流行的方法简单线性回归(SLR)。
This Article Covers:
本文涵盖:
Development of a Simple Linear Regression model
一个简单线性回归模型的发展
Assessment of how good the model fits
评估模型的适用性
Hypothesis test using ANOVA table
用方差分析表进行假设检验
That’s a lot of material to learn in one day if you are reading this to learn. All the topics will be covered with a working example. Please work on the example by yourself to understand it well.
如果你读这本书是为了学习,那么一天之内要学习的材料就很多了。所有的主题都将用一个工作实例来涵盖。请你自己研究这个例子,以充分理解它。
Developing the SLR model should not be too hard. It’s pretty straightforward. Simply use the formulas and find your model or use the software. Both are straightforward.
开发单反相机模型应该不会太难。这是很直接的。只需使用公式并找到你的模型或使用软件。两者都是直截了当的
The assessment and the hypothesis testing part may be confusing if you are totally new to it. You may have to go over it a few times slowly. I will try to be precise and to the point.
如果你是完全陌生的人,评估和假设测试部分可能会让人困惑。你可能需要慢慢地复习几遍。我将努力做到精确和有针对性。
Simple Linear Regression(SLR)
简单线性回归
When linear relation is observed between two quantitative variables, Simple Linear Regression can be used to take explanations and assessments of that data further. Here is an example of a linear relationship between two variables:
当观察到两个定量变量之间的线性关系时,可以使用简单线性回归来进一步解释和评估该数据。下面是一个两个变量之间的线性关系的例子。
The dots in this graph show a positive upward trend. That means if the hours of study increase, exam scores also increase. In other words, there is a positive correlation between the hours of study and the exam scores. From a graph like that, the strength and direction of the correlation of two variables can be assumed. But it is not possible to quantify the correlation and how much the exam score changes with each additional hour of study. If you can quantify that, it will be possible to forecast the exam scores, if you know the hours of study. That will be very useful, right?
这张图中的点显示了一个积极的上升趋势。这意味着如果学习时间增加,考试分数也会增加。换句话说,学习时间和考试成绩之间存在着正相关关系。从这样的图表中,可以推测出两个变量的相关性的强度和方向。但是,我们不可能量化这种相关性,以及每增加一个小时的学习时间,考试成绩会有多大的变化。如果你能量化,就有可能预测考试分数,如果你知道学习时间。这将是非常有用的,对吗?
Simple Linear Regression(SLR) does just that. It uses this old school formula of the straight line that we all learned in school. Here is the formula:
简单线性回归(SLR)就是这么做的。它使用了我们在学校里学过的一个古老的直线公式。公式如下:
y = c + mx
y=c+mx
Here,
这里,
y is the dependent variable,
y是因变量,
x is the independent variable,
x是自变量,
m is the slope and
m是斜率,并且
c is the intercept
c为截距
In the graph above, the exam Score is the ‘y’ and the Hours of Study is the ‘x’. Exam score depends on the hours of study. So, Exam Score is the dependent variable, and Hours of Study is the independent variable.
在上图中,考试分数是 "y",学习时间是 "x"。考试分数取决于学习时间。所以,考试分数是因变量,学习时间是自变量。
Slope and intercept are to be determined using the Simple Linear Regression.
斜率和截距用简单线性回归法确定。
Linear regression is all about fitting the best fit line through the points and find out the intercept and slope. If you can do that you will be able to quantify the exam score if you have the hours of study data available. Now, how accurate that estimation of exam scores will depend on some more information. We will get there slowly.
线性回归就是通过各点拟合出最适合的直线,找出截距和斜率。如果你能做到这一点,如果你有学习时间的数据,你就能量化考试分数。现在,这种对考试分数的估计有多准确,将取决于一些更多的信息。我们会慢慢达到这个目的。
In statistics, beta0 and beta1 is the term commonly used instead of c and m. So, the equation above looks like this:
在统计学中,beta0和beta1是常用的术语,而不是c和M。所以,上面的等式是这样的:
The red dotted line in the graph above should be as close as possible to the dots. The most common way of doing that is the least square regression method.
上图中的红色虚线应尽可能地接近于点。最常见的方法是最小平方回归法
Regression Equations
回归方程
The red dotted line in the graph above is called the Least Squares Regression line. The line should be as close as possible to the dots.
上图中的红色虚线被称为最小二乘法回归线。该线应尽可能地接近点。
Here y_hat is the estimated or predicted value of the dependent variable(exam scores in the example above).
这里y_hat是因变量的估计值或预测值(上面例子中的考试分数)。
Remember, predicted values can be different from the original values of the dependent variables. In the graph above, the original data points are scattered. But the predicted or expected values from the equation above will be right on the red dotted line. So, there will be a difference between the original y and the predicted values y_hat.
记住,预测值可能与因变量的原始值不同。在上图中,原始数据点是散乱的。但是从上面的方程中得出的预测值或预期值将正好在红色的虚线上。因此,原始值y和预测值y_hat之间会有差异。
The beta0 and beta1 can be calculated using the least squared regression formulas as follows:
beta0和beta1可以使用最小二乘回归公式计算,如下所示:
Here,
这里,
y_bar is the sample mean of the ‘y’ variable.
y_bar是“y”变量的样本平均值。
x_bar is the sample mean of the ‘x’ variable.
x_bar是“x”变量的样本平均值。
Sx is the sample standard deviation of the ‘x’ variable
Sx是“x”变量的样本标准差
Sy is the sample standard deviation of the ‘y’ variable
Sy是“y”变量的样本标准差
Example of Developing a Linear Regression Model
开发线性回归模型的示例
I hope the discussion above was clear. If not, that’s ok. Now, we will work on an example that will make everything clear.
我希望上面的讨论很清楚。如果没有,那也没关系。现在,我们将在一个例子上下功夫,使一切都清楚起来。
Here is the dataset to be used for this example:
下面是用于本示例的数据集:
This dataset contains arm lengths and leg lengths of 30 people. The scatter plot looks like this:
这个数据集包含30个人的臂长和腿长。散点图如下所示:
Please feel free to download this dataset and follow along:
请随时下载此数据集,并遵循以下步骤:
There is a linear trend here. Let’s see if we can develop a linear regression equation using data that may reasonably predict the leg length using the arm length.
这里有一个线性趋势。让我们看看我们是否能用数据建立一个线性回归方程,可能用臂长合理地预测腿长。
Here,
这里,
Arm length is the x-variable
臂长是x变量
Leg length is the y-variable
腿长是y变量
Let’s have a look at the formulas above. If we want to find the calculated values of y based on the arm length, we need to calculate the beta0 and beta1.
让我们看一下上面的公式。如果我们想找到基于臂长的y的计算值,我们需要计算β0和β1。
Required parameters to calculate the beta1: correlation coefficient, the standard deviation of arm lengths, and the standard deviation of the leg lengths.
计算β1的必要参数:相关系数、臂长的标准偏差和腿长的标准偏差。
Required parameters to calculate the beta0: mean of leg lengths, beta1, and the mean of arm lengths.
计算BETA0:腿长度平均值,beta1和臂长度平均值所需的参数。
All the parameters can be calculated very easily using the dataset. I used R to calculate them. You can use any other language, you are comfortable with.
所有的参数都可以用数据集非常容易地计算出来。我使用R来计算它们。你可以使用任何其他你熟悉的语言。
First, read the dataset into RStudio:
首先,将数据集读入RStudio:
I already showed the whole dataset before. It has two columns: ‘arm’ and ‘leg’ which represent the length of the arms and the length of the legs of people respectively.
我之前已经展示了整个数据集。它有两列。胳膊 "和 "腿",分别代表人们的胳膊和腿的长度。
For the convenience of calculation, I will save the length of the arms and the length of the legs in separate variables:
为了方便计算,我将手臂的长度和腿的长度分别保存在单独的变量中:
Here is how to find the mean and the standard deviation of the ‘arm’ and ‘leg’ columns:
下面是求“臂”和“腿”列的平均值和标准差的方法:
R also has a ‘cor’ function to calculate the correlation between two columns:
R还有一个“cor”函数来计算两列之间的相关性:
Now, we have all the information we need to calculate beta0 and beta1. Let’s use the formulas for beta0 and beta1 described before:
现在,我们有了计算β0和β1所需的所有信息。让我们使用之前描述的β0和β1的公式。
The beta1 and beta0 are 0.9721 and 1.9877 respectively.
beta1和beta0分别为0.9721和1.9877。
I wanted to explain the process of working on a linear regression problem from the scratch.
我想从零开始解释处理线性回归问题的过程。
Otherwise, R has the ‘lm’ function to where you can simply pass the two variables and it outputs the slope(beta1) and intercepts(beta0).
否则,R有'lm'函数,你可以简单地传递两个变量,它就会输出斜率(beta1)和截距(beta0)。
Output:
输出:
Plugging in the values of slope and intercept, the linear regression equation for this dataset is:
插入斜率和截距值,此数据集的线性回归方程为:
y = 1.9877 + 0.9721x
y=1.9877+0.9721 x
If you know a person’s arm length, you can now estimate the length of his or her legs using this equation. For example, if the length of the arms of a person is 40.1, the length of that person’s leg is estimated to be:
如果你知道一个人的臂长,你现在就可以用这个公式估计他或她的腿长。例如,如果一个人的臂长是40.1,那么这个人的腿长就可以估计为。
y = 1.9877 + 0.9721*40.1
y=1.9877+0.9721*40.1
It is 40.99. This way, you can get the length of legs of other people with different arm lengths as well.
是40.99。这样,你就可以得到其他不同臂长的人的腿长。
But remember this is just an estimate or a calculated value of the length of that person’s legs.
但是请记住,这只是一个估算值,或者说是一个计算出的,那个人腿长的值。
One caution though. When you use the arm length to calculate the leg lengths, remember not to extrapolate. That means be aware of the range of the data you used in the model. For example, in this model, we used arm lengths between 31 to 44.1 cm. Do not calculate the leg lengths for an arm’s length of 20 cm. That may not give you a correct estimation.
但有一点要注意。当你使用臂长来计算腿长时,记住不要推断。这意味着要注意你在模型中使用的数据的范围。例如,在这个模型中,我们使用的臂长在31至44.1厘米之间。不要计算臂长为20厘米的腿长。这可能不会给你一个正确的估计。
Interpreting the slope and estimate in plain language:
用通俗易懂的语言解释斜率和估计:
The slope of 0.9721 represents that if the length of arms changes by one unit, the length of legs will increase by 0.9721 units on average. Please focus on the word ‘average’.
0.9721的斜率表示,如果手臂的长度变化一个单位,腿的长度将平均增加0.9721个单位。请关注 "平均 "这个词
Every person who has an arm length of 40.1, may not have a leg length of 40.99. It could be a little different. But our model suggests that on average, it is 40.99. As you can see not all the dots are on the red line. The red dotted line is nothing but the line of all the averages.
每个人的臂长是40.1,可能没有40.99的腿长。它可能有一点不同。但我们的模型表明,平均而言,它是40.99。正如你所看到的,并非所有的点都在红线上。红色虚线只不过是所有平均数的线。
The intercept of 1.9877 means, if the length of the arms is zero, still the length of the legs will be 1.9877 on average. The length of arms is zero is not possible. So, in this case, it is only theoretical. But in other cases, it is possible. For example, think of a linear relationship between the hours of study vs the exam score. There might be a linear relationship such that exam score increases with the hours of study. But even if a student did not study at all, s/he still may obtain some score.
截距为1.9877意味着,如果手臂的长度为零,腿部的平均长度仍为1.9877。胳膊的长度为零是不可能的。所以,在这种情况下,这只是理论上的。但在其他情况下,它是可能的。例如,想想学习时间与考试分数之间的线性关系。可能存在一种线性关系,即考试成绩随着学习时间的增加而增加。但即使一个学生根本没有学习,他/她仍然可能获得一些分数。
How good this estimate is?
这个估计有多好?
This is a good question, right? We can estimate. But how close this estimate is to the real length of that person’s leg.
这是个好问题,对吧?我们可以估计。但这个估计值与那个人腿的真实长度有多接近。
To explain that we need to see the regression line first.
来说明我们需要先看到回归线。
Using the ‘abline’ function a regression line can be drawn in R:
使用'abline‘函数,可以在R中绘制回归线:
Look at this picture. The original points (black dots) are scattered around. The estimated points will fall straight on the red dotted line. In that case, a lot of times the estimated length of legs will be different than the real length of legs for this dataset.
看这张图。原始点(黑点)散落在周围。估计的点会直接落在红色的虚线上。在这种情况下,很多时候,估计的腿长会与这个数据集的真实腿长不同。
So, it is important to check how well the regression line fits the data.
因此,检查回归线与数据的拟合程度是很重要的。
To find that out we need to really understand y-variables. For any given data point, there might be three y-variables to consider.
为了找出这一点,我们需要真正理解Y变量。对于任何给定的数据点,可能有三个y-变量需要考虑。
There are real or observed y-variable (that we get from the dataset. In this example the length of the legs). Let’s call each of these ‘y’ data as ‘y_i’.
有真实的或观察到的y变量(我们从数据集中得到的,在这个例子中是腿的长度)。让我们把这些 "y "数据中的每一个称为 "y_i"。
The predicted y-variable (the leg length that we can calculate from the linear regression equation. Remember that might be different than the original data point y_i.). We will call it ‘y_ihat’ for this demonstration.
预测的y变量(我们可以从线性回归方程中计算出的腿长。记住这可能与原始数据点y_i不同)。在这个演示中,我们将称其为 "y_ihat"。
The sample average of y-variable. That we already calculated and saved it in a variable ‘y_bar’.
y变量的样本平均值。我们已经计算并将其保存在变量“y_bar”中。
For assessing, how well the regression model fits the dataset, all these y_i, y_ihat and y_bar will be very important.
为了评估回归模型对数据集的拟合程度,所有这些y_i,y_ihat和y_bar将是非常重要的。
The distance between y_ihat and y_bar is called the regression component.
y_ihat和y_bar之间的距离称为回归分量。
regression component = y_ihat — y_bar
回归分量=Y_IHAT-Y_BAR
The distance between the original y point y_i and the calculated y point y_ihat is called the residual component.
原始y点y_i与计算出的y点y_ihat之间的距离称为残差分量。
residual component = y_i — y_ihat
剩余分量=y_i-y_ihat
A rule of thumb is the regression line that fits the data well will have a regression component bigger than the residual component across all data points. In contrast, a regression line that does not fit the data well will have the residual component larger than the regression component across all data points.
一个经验法则是,对数据拟合良好的回归线,其回归分量将大于所有数据点的残差分量。相反,一个不能很好地拟合数据的回归线,其残差分量将大于所有数据点的回归分量。
Make sense, right? If the observed data points are too different than the calculated data points then the regression line did not fit well. If all the data points fell on the regression line, then the residual component will be zero or close to zero.
有道理,对吗?如果观察到的数据点与计算出的数据点相差太大,那么回归线就没有很好地拟合。如果所有的数据点都落在回归线上,那么残差分量将是零或接近零。
If we add the regression component and the residual component:
如果我们加上回归分量和残差分量:
Total = y_ihat — y_bar + y_i — y_ihat = y_i — y_bar
总计=y_ihat-y_bar+y_i-y_ihat=y_i-y_bar
How to quantify this? You can simply deduct the mean ‘y’ (y_bar) from the observed y values(y_i). But that will give you some positive and some negative values. And negative values and positive values will cancel each other. That means, this will not represent the real differences of the mean ‘y’ and observed y values.
如何量化这一点?你可以简单地从观察到的y值(y_i)中扣除平均 "y"(y_bar)。但这将给你一些正值和一些负值。而负值和正值会相互抵消。这意味着,这并不代表平均数'y'和观察到的y值的真正差异。
One popular way to quantify this is to take the sum of squares. That way, there won’t be any negatives.
一种常用的量化方法是求平方和。那样的话,就不会有底片了。
The total sum of squares or ‘Total SS’ is:
平方的总和或“总SS”为:
The regression sum of squares or ‘Reg SS’ is:
回归平方和或“reg ss”为:
The residual sum of squares or ‘Res SS’ is:
残差平方和或“Res SS”为:
Total SS can also be calculated as the sum of ‘Reg SS’ and ‘Res SS’.
总SS也可以计算为“REG SS”和“RES SS”之和。
Total SS = Reg SS + Res SS
总SS=注册SS+注册SS
Everything is ready! Now it’s time to calculate the R-squared value. As discussed before, R-squared is the measure that represents how well the regression line fits the data. Here is the formula for R-squared:
一切都准备好了! 现在是计算R平方值的时候了。正如之前所讨论的,R平方是表示回归线对数据的拟合程度的措施。下面是R平方的计算公式。
R-squared = Reg SS / Total SS
R平方=注册SS/总SS
If the R-squared value is 1, that means, all the variations in the response variable (y-variable) can be explained by the explanatory variable (x-variable).
如果R平方值为1,这意味着响应变量(y变量)的所有变化都可以由解释变量(x变量)解释。
On the contrary, if the R-squared value is 0, that means, none of the variations in the response variable can be explained by the explanatory variable.
相反,如果R平方值为0,这意味着响应变量的变化都不能被解释变量所解释。
ANOVA table
方差分析表
This is one of the most popular ways of assessments of the fit of the model to the data.
这是评估模型与数据拟合度的最常用方法之一。
Here is the general form of the ANOVA table. You already know some of the parameters used in the table. We will discuss the rest after the table.
这里是方差分析表的一般形式。你已经知道表格中使用的一些参数。我们将在表后讨论其余的内容。
The relationship and parameters in this table are very important in regression analysis. This actually helps to assess the model for us. We already learned the terms Reg SS, Res SS, and Total SS and how to calculate them.
这个表格中的关系和参数在回归分析中是非常重要的。这实际上有助于为我们评估模型。我们已经学习了术语Reg SS、Res SS和Total SS以及如何计算它们。
‘Reg df’ in the table above is the degrees of freedom of the regression sum of squares. This is equal to the number of parameters that are estimated except the intercept. In a Simple Linear Regression(SLR), it is 1. For Multiple Regression k > 1.
上表中的'Reg df'是回归方差之和的自由度。这等于除截距外被估计的参数的数量。在简单线性回归(SLR)中,它是1。 对于多元回归,k>1。
‘Res df’ is the degrees of freedom of the residual sum of squares. It is calculated as the number of data points(n) minus k minus 1 or (n-k-1). As we mentioned before, for SLR k is always 1. So, Res df for SLR is n-2.
'Res df'是残差平方和的自由度。它被计算为数据点的数量(n)减去k减去1或(n-k-1)。正如我们之前提到的,对于SLR,k总是1。因此,SLR的自由度是n-2。
The p-value is the probability that the observed value of the test statistic or a more extreme value could have been observed.
P值是可以观察到的测试统计量的观测值或更极端的值的概率。
One more term needs to be mentioned here. If you calculate R-squared in R it will give you two R-squared values. We already discussed one R-squared value and the calculation method before. But there is another one. That is adjusted R-squared. Here is the formula:
这里还需要提到一个术语。如果你在R中计算R平方,它会给你两个R平方值。我们之前已经讨论了一个R平方值和计算方法。但还有一个。那就是调整后的R平方。公式是这样的。
Here, Sy is the standard deviation of the y-variable. It represents the proportion of variance of the y variable that can be explained by the model.
这里,Sy是y变量的标准差。它表示y变量的方差中可以被模型解释的比例。
For large n (n = the number of data points):
对于大n(n=数据点数):
All the tables and equations are ready. Let’s assess the model we developed before!
所有的表格和方程式都准备好了。让我们来评估一下我们之前开发的模型!
Calculating the R-squared and ANOVA table to assess the model and inference from it
计算r-平方和方差分析表来评估模型和从它得到的推论
First, generate a table with all the parameters:
首先,生成一个包含所有参数的表:
feel free to download the excel file from this link so you can see the implementation and formulas:
请随意从此链接下载excel文件,以便查看实现和公式:
Notice at the end of the table. We calculated the ‘Total SS’ using the formula and also as the summation of ‘Reg SS’ and ‘Res SS’. Both the ‘Total SS’ are almost the same (490.395 and 490.372). We can use either of them. From this table:
请注意在表格的最后。我们用公式计算了 "总SS",也计算了 "Reg SS "和 "Res SS "的总和。两个 "总SS "几乎相同(490.395和490.372)。我们可以使用其中任何一个。从这个表中可以看出。
Total SS = 490.372
总SS=490.372
Reg SS = 261.134
Reg SS=261.134
Res SS = 229.238
Res SS=229.238
Calculate the R-squared and R-squared-adj:
计算r平方和r平方调整:
R-squared = Reg SS / Total SS = 261.134/490.372 = 0.5324
R平方=注册SS/总SS=261.134/490.372=0.5324
R-squared-adj = 1–8.187/(s_leg)**2 = 0.5159
R-平方-调整=1-8.187/(s_leg)**2=0.5159
As expected they are almost the same.
不出所料,它们几乎是一样的。
That means 51.59% variability of the length of the legs can be explained by the length of the arms.
这意味着腿长的51.59%的变异可以由臂长来解释。
This R-squared value provides a good estimate of the relationship between arm length and leg length.
这个r平方值提供了臂长和腿长之间关系的一个很好的估计。
But to affirm that there is a significant linear relationship between these two variables a hypothesis test is necessary.
但为了证实这两个变量之间存在显著的线性关系,需要进行假设检验。
If you are totally new to hypothesis testing, you may think that why do we need to affirm that? We already developed the model and calculated the correlation.
如果你对假设检验完全陌生,你可能会想,为什么我们需要肯定?我们已经建立了模型,并计算出了相关关系。
But we studied only 30sample and developed the model on these 30 samples. If we want to infer a conclusion about the total population from it we need hypothesis testing. Here is a detailed article on hypothesis testing concepts:
但我们只研究了30个样本,并在这30个样本上建立了模型。如果我们想从中推断出关于总人口的结论,我们需要假设检验。这里有一篇关于假设检验概念的详细文章。
In this example, we will use the ANOVA table we described before for the hypothesis testing.
在本例中,我们将使用前面描述的方差分析表进行假设检验。
Hypothesis Test Example Using the ANOVA Table
使用方差分析表的假设检验示例
There are two different equivalent tests to assess these hypotheses: 1) t-test and 2) F-test.
评估这些假设有两种不同的等价检验:1)t检验和2)f检验。
I chose to do it using F-test. If you already know how to do perform a t-test already, feel free to go ahead with that. For me, both F-test and the t-test have the same amount of work. So, either one is good. Here is how to perform an F-test
我选择用F检验来做。如果你已经知道如何进行t检验,请随意进行。对我来说,F-检验和t-检验的工作量都是一样的。所以,任何一个都是好的。以下是如何进行F检验的方法
F-test
F检验
There is a five-step process of this F-test. This is almost a general rule. You will be able to use this same process in many other problems.
这个F检验有五个步骤。这几乎是一个一般规则。你将能够在许多其他问题上使用这个相同的过程。
Step 1:
步骤1:
Set up the hypothesis: We set two hypotheses in the beginning. Null hypothesis and alternative hypothesis. Then based on the evidence, we reject or fail to reject the null hypothesis.
设置假说。我们一开始就设定两个假设。空白假说和备选假说。然后根据证据,我们拒绝或不拒绝无效假设。
Null hypothesis:
零假设:
beta1 = 0
beta1=0
Remember from the linear regression equation that beta1 is the slope of the regression line. We set the null hypothesis as beta1 = 0 means that we assume that there is no linear association between the arm length and the leg length.
请记住,在线性回归方程中,β1是回归线的斜率。我们设定无效假设为β1=0,意味着我们假设臂长和腿长之间没有线性关系。
Alternative hypothesis:
备选假设:
beta1 != 0
beta1!=0
The alternative is beta1 is not equal to zero means that there is a linear association between the arm length and the leg length.
另一种情况是β1不等于零意味着臂长和腿长之间存在着线性关联。
Setting the significance level alpha =0.05. That means a 95% confidence level. If you need a refresher in confidence interval concept, please check out this article:
设定显著性水平α=0.05。这意味着95%的置信度。如果你需要复习一下置信区间的概念,请查看这篇文章。
Step 2:
第二步:
Select the appropriate test statistic. Here we are selecting F-statistic.
选择适当的测试统计量。这里我们选择F-统计量。
Step 3:
第三步:
Define the decision rule. That means to make the decision when to reject the null hypothesis.
定义决策规则。这意味着要做出决定,什么时候拒绝零假设。
Since this is an F-test, we need to determine the appropriate value from the F-distribution. You can use the table to determine the F value. But the table does not include all the F values. I prefer using R. It’s very simple and easy. R has this ‘qf’ function that takes the confidence level, and the degrees of freedom. We already discussed two types of degrees of freedom: ‘Reg df’ and ‘Res df’.
由于这是一个F检验,我们需要从F分布中确定适当的值。你可以使用该表来确定F值。但该表并不包括所有的F值。我更喜欢使用R,它非常简单和容易。R有这样的'qf'函数,它接受置信度和自由度。我们已经讨论了两种自由度:"Reg df "和 "Res df"。
Output:
输出:
So if the F is greater than or equal to 4.196, reject the null hypothesis. Otherwise, do not reject the null hypothesis. This is our Decision rule.
因此,如果F值大于或等于4.196,拒绝无效假设。否则,不要拒绝无效假设。这就是我们的决定规则。
Step 4:
步骤4:
Calculate the test statistic.
计算测试统计。
There are two ways, I will show here. First I will do it manually to show the steps. And then I will simply use the ‘anova’ function from R. We already know the ‘Reg SS’, ‘Res SS’, and degrees of freedoms. So, here is the ANOVA table:
有两种方法,我将在这里展示。首先,我将手动操作以显示步骤。然后我将简单地使用R的'ANOVA'功能。我们已经知道'Reg SS'、'Res SS'和自由度。因此,这里是方差分析表:
Please feel free to download the original excel file where I did all these calculations:
请随意下载我进行所有这些计算的原始excel文件:
Notice, I did not calculate the p-value in the table. Because I wanted to show the calculation here. I will use R to calculate the p-value from F-statistic.
注意,我没有在表中计算P值。因为我想在这里显示计算结果。我将使用R来计算F-statistic的p值。
Output:
输出:
4.742e-06
4.742 E-06
You can get the ANOVA table directly from the ‘anova’ function in R. The ‘anova’ function takes the linear regression model. Remember we got the linear regression model in the beginning and saved it in the variable ‘m’. Please go back and check. We will pass that ‘m’ in the ‘anova’ function to get the ‘anova’ table using R:
你可以直接从R中的'ANOVA'函数中得到方差分析表。"ANOVA "函数需要线性回归模型。记得我们一开始就得到了线性回归模型,并将其保存在变量'm'中。请回去检查一下。我们将在'anova'函数中传递这个'm',以使用R得到'anova'表。
Output:
输出:
Look at the output carefully. ANOVA table starts with Df (degrees of freedom), Sq Mean(SS(Sum of Squares) in the calculated table before), Mean Sq (MS (Mean Square)), F-value, and p-value. If you notice the values, they are pretty the same.
仔细看一下输出。方差分析表以Df(自由度)、Sq Mean(SS(Sum of Squares)在之前的计算表中开始,Mean Sq(MS(Mean Square))、F值和P值。如果你注意到这些数值,它们是相当相同的。
Step 5:
步骤5:
Draw the conclusion. We defined the decision rule before that we will reject the null hypothesis if the F ≥ 4.196. F-value is 31.899. So we can reject the null hypothesis. That means we have enough evidence that there is a significant linear relationship between arm length and leg lengths in alpha = 0.05 level. Our p-value is also less than alpha. That gives another evidence that we can reject the null hypothesis.
得出结论。我们之前定义了决策规则,如果F≥4.196,我们将拒绝无效假设。F值是31.899。所以我们可以拒绝无效假设。这意味着我们有足够的证据表明,在α=0.05的水平上,臂长和腿长之间存在着显著的线性关系。我们的P值也小于α。这就提供了另一个证据,证明我们可以拒绝无效假设。
Conclusion
结论
If you could finish all those, congratulations! That’s a lot of work. This is one of the simplest models and yet popular. Lots of other models are based on linear regression. It is important to learn this very well and grasp the basic concept. Hypothesis testing is also a common everyday task in statistics and data analytics for that matter. So, this article covered a lot of useful and widely used material. Hope this was helpful.
如果你能完成所有这些,那就恭喜你了! 这是个很大的工作。这是最简单的模型之一,却很受欢迎。很多其他的模型都是基于线性回归的。把这个学好,掌握基本概念是很重要的。假设检验也是统计学和数据分析中常见的日常工作,就这一点而言。所以,这篇文章涵盖了很多有用的和广泛使用的材料。希望这对你有帮助。
Feel free to follow me on Twitter and like my Facebook page.
欢迎在Twitter上关注我,喜欢我的Facebook页面。