• Users Online: 726
  • Home
  • Print this page
  • Email this page
Home About us Editorial board Ahead of print Current issue Search Archives Submit article Instructions Subscribe Contacts Login 


 
 Table of Contents  
REVIEW ARTICLE
Year : 2020  |  Volume : 6  |  Issue : 2  |  Page : 116-122

F-test of overall significance in regression analysis simplified


1 Department of Educational Planning and Management, Masinde Muliro University and Technology, Kakamega, Kenya
2 Department of Physiotherapy, The Nairobi Hospital, Nairobi, Kenya

Date of Submission04-Mar-2020
Date of Decision30-Apr-2020
Date of Acceptance28-Jun-2020
Date of Web Publication27-Aug-2020

Correspondence Address:
Mr. Onchiri Sureiman
Department of Educational Planning and Management, Masinde Muliro University and Technology, Kakamega
Kenya
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/jpcs.jpcs_18_20

Rights and Permissions
  Abstract 


Regression analysis is using the relationship between a known value and an unknown variable to estimate the unknown one. Here, an estimate of the dependent variable is made corresponding to given values of independent variables by placing the relationship between the variables in the form of a regression line. To determine how well the regression line obtained fits the given data points, F-test of overall significance is conducted. The issues involved in the F-test of overall significance are many and mathematics involved is rigorous, especially when more than two variables are involved. This study describes in details how the test can be conducted and finally gives the simplified approach of test using an online calculator.

Keywords: F-test, hypothesis testing, online calculator, regression


How to cite this article:
Sureiman O, Mangera CM. F-test of overall significance in regression analysis simplified. J Pract Cardiovasc Sci 2020;6:116-22

How to cite this URL:
Sureiman O, Mangera CM. F-test of overall significance in regression analysis simplified. J Pract Cardiovasc Sci [serial online] 2020 [cited 2023 Jun 7];6:116-22. Available from: https://www.j-pcs.org/text.asp?2020/6/2/116/293527




  Introduction Top


The term “regression” was first used in 1877 by Sir Francis Galton who made a study that showed that the height of children born to tall parents will tend to move back or “regress” toward the mean height of the population. He designated the word regression as name of the process of predicting one variable from another variable.[1] Then came the term “multiple regression” to describe the process by which several variables are used to predict one another.[2] The F-Test of overall significance in regression is a test of whether or not your linear regression model provides a better fit to a dataset than a model with no predictor variables.


  Assumptions Underlying F-Test of Overall Significance in Regression Analysis Top


The main assumptions include:

Linearity

Linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots.[3]

Normality

The linear regression analysis requires all variables to be multivariate normal.[4] This assumption can best be checked with a histogram or a Q-Q-Plot. There are also a variety of statistical tests for normality, including the Kolmogorov–Smirnov test, the Shapiro–Wilk test, the Jarque–Bera test, and the Anderson–Darling test.[5] When the data are not normally distributed a nonlinear transformation (e.g., log-transformation) might fix this issue.

Multicollinearity

Linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.[3] Multicollinearity may be tested with three central criteria:

  1. Tolerance – The tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1– R2 for these first step regression analysis.[6] With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is
  2. Correlation matrix – When computing the matrix of Pearson's Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1[7]
  3. Variance Inflation Factor (VIF) – The VIF of the linear regression is defined as VIF = 1/T. With VIF >5, there is an indication that multicollinearity may be present; with VIF >10, there is certainly multicollinearity among the variables.[3] The simplest way to address the problem is to remove independent variables with high VIF values.


Homoscedasticity

The scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line). The Goldfeld–Quandt, Breush–Pagan, Park and White's tests can also be used to test for heteroscedasticity.[8]


  How to Interpret the F-Statistic Top


The F-statistic is calculated as regression MS/residual MS. This statistic indicates whether the regression model provides a better fit to the data than a model that contains no independent variables. In essence, it tests if the regression model as a whole is useful. If the P < the significance level, there is sufficient evidence to conclude that the regression model fits the data better than the model with no predictor variables. This finding is good because it means that the predictor variables in the model actually improve the fit of the model. In general, if none of the predictor variables in the model are statistically significant, the overall F statistic is also not statistically significant.


  Illustrative Examples on Determining F-Test of Overall Significance in Regression Analysis Top


This tutorial walks through examples of a regression analysis using two methods (manual and online calculator) providing an in-depth explanation of how to read and interpret the output of a regression table.

Example 1

In estimating output (Y) of physiotherapist from a knowledge of his/her test score on the aptitude test (X1) and years of experience (X2) in a hospital, the [Table 1] summarizes the findings of the study.
Table 1: Test scores, experience, and output of physiotherapist

Click here to view


H0: Y = b0

H1: Y = b0 + b1X1 + b2X2


  Manual Computation of F-Test of Overall Significance in Regression Analysis Top


Obtaining the regression equation

The given data are reproduced in [Table 2]. [Table 2] also shows other inputs required for obtaining the regression equation.
Table 2: Obtaining regression equation

Click here to view




The general form of multiple equation applicable in this case is:

Y = b0 + b1X1 + b2X2

Moreover, the required normal equations to find the values of b0, b1, and b2 can be written as under:







Accordingly, the three equations are:

255 = 10 b0+ 1354 b1+ 53 b2

37175 = 1354 b0+ 194128 b1+ 7374.5 b2

1552 = 53 b0+ 7347.5 b1+ 363 b2

Solving the three equations simultaneously, we obtain b0 = −13.824567, b1= 0.212167, and b2= 1.999461. Thus, the regression equation of Y on X1 and X2 is: YC= -13.824567 + 0.212167 X2+ 1.999461 X2.

Calculation of R and F-ratios

To determine the R and F statistic, we need to calculate total, explained and unexplained variation as shown in [Table 3].
Table 3: Calculation of total, explained, and unexplained variation

Click here to view


Total variation (sum of squares total, SST) =974.5.

Explained variation (sum of square regression, SSR)=962 710

Unexplained variation (sum of squares error, SSE) =11.791

R square (R2) =0.988, R= 0.984

Mean square regression (MSR)=481.355

Mean square error (MSE) =1.684

F ==285 775

Goodness of fit

The F table value [Table 4] corresponding with degree of freedom n1= 2 and n2= 7 is 4.74. Since 285.775>4.74, we ignore the null hypothesis and conclude that Y = b0 or Y = b0+ b1X1+ b2X2.
Table 4: Excerpts from significance points of the variance-ratio “F”

Click here to view


Validity checking

  1. Linearity: The relationship between the Y and X1 variables is linear [Figure 1] as well as the relationship between the Y and X2 variables [Figure 2]
  2. Normality
  3. Figure 1: Output against test score.

    Click here to view
    Figure 2: Output against experience.

    Click here to view


    QQ-Plot illustrates [Figure 3] that all variables to be multivariate normal
    Figure 3: Residuals: QQ plot.

    Click here to view


  4. Multicollinearity


  5. Tolerance = 1-R2 = 1-0.987902 = 0.012098. With 0.012098= 0.01 but 0.012098 <0.1, there might be multicollinearity in the data.

  6. Homoscedasticity-homogeneity of variance.


The data are homoscedastic since the residuals are equal across the regression line [Figure 4] and [Figure 5].
Figure 4: X1 residuals plot.

Click here to view
Figure 5: X2 residuals plot.

Click here to view



  Using an Online Linear Regression Analysis Calculator (Simplified Method) Top


The F-test of overall significance in regression analysis can be done through online calculators which are easily available in internet. For use friendly online calculator, you may visit this uniform locator http://www.statskingdom.com/410 multi_linear_regression.htm.

In the software, it is really easy to conduct an F-test and most of the assumptions are preloaded. The calculator uses variables transformations, calculates the Linear equation, R, P value, outliers and the adjusted Fisher-Pearson coefficient of skewness. After checking the residuals' normality, multicollinearity, homoscedasticity, and priori power, the program interprets the results. Then, it draws a histogram, a residuals QQ-plot, a correlation matrix, a residuals x-plot and a distribution chart. You may transform the variables exclude any predictor or run backward stepwise selection automatically based on the predictor's P value.

The basic step for using and online calculator is to correctly fill in you data into it [Figure 6]. For instance, in the above example, we have to fill in the data in the columns of an online calculator. Click the calculate button.
Figure 6: Setting up the data in the table of an online calculator.

Click here to view



  Summary Output Top


The output of the F-test is summarized below by the regression equation, regression statistics [Figure 7], correlation matrix [Table 5], ANOVA [Table 6], coefficient table iteration I [Table 7], and residual graphs [Figure 8].
Figure 7: Regression statistics.

Click here to view
Table 5: Correlation matrix

Click here to view
Table 6: ANOVA table

Click here to view
Table 7: Coefficient Table Iteration 1 (adjusted R2=0.984)

Click here to view
Figure 8: Residual plots.

Click here to view


Regression equation is Y = −13.825 + 0.212 X1+ 1.999 X2


  Validity Checking Top


  1. Residual Normality: Linear regression assumes normality for residual errors. Shapiro–Wilk P = 0.664 [Figure 7]. It is assumed that the data are normally distributed
  2. Homoscedasticity-Homogeneity of Variance: The White test P = 0.909 [Figure 7]. It is assumed that the variance is homogeneous
  3. Multicollinearity-Intercorrelations among the Predictors: There is no multicollinearity concern as all the VIF values are smaller than 2.5 [Table 6] and [Table 7]
  4. Priori power-of the entire model (2 predictors): The priori power should be calculated before running the regression. Although the power is low: 0.134 [Figure 7], we reject the H0.



  Interpretation of the Output Top


Y and X relationship

R square (R2) equals 0.988. It means that the predictors (Xi) explain 98.8% of the variance of Y. Adjusted R square equals 0.984. The coefficient of multiple correlation® equals 0.994. It means that there is a very strong direct relationship between the predicted data (ŷ) and the observed data (y).

Goodness of fit

Right-tailed F test is used to check if the entire regression model is statistically significant. From [Table 6], F (1, 7) = 285.802, P = 1.94764e-7. Since P < α (0.05), we reject the H0. The linear regression model, Y = b0 + b1X1 + b2X2, provides a better fit than the model without the independent variables resulting in, Y = b0.

As shown in [Table 5], P value for X1= 6.59e-7 and for X2= 0.00000257. All the independent variables (Xi) are significant since P < α (0.05). The Y-intercept (b): Two-tailed, T = −7.701131, P = 0.000116139 [Table 7]. Hence, b is significantly different from zero.

Example 2

The data in [Table 8] are taken from a clinical trial to compare two hypotensive drugs used to lower the blood pressure during operations. The dependent variable, y, is the recovery time (in minutes) elapsing between the time at which the drug was discontinued and the time at which the systolic blood pressure had returned to 100 mmHg. The two predictors are quantity of drugs used in mg (x1) and mean level of systolic blood pressure during hypotension in mmHg (x2).
Table 8: Data on use of hypotensive drugs

Click here to view


H0: Y = b0

H1: Y = b0 + b1X1 + b2X2


  Using an Online Linear Regression Analysis Calculator (Simplified Method) Top


To analyze the relationship between quantity of drugs used and mean level of systolic blood pressure during hypotension, we run a multiple linear regression using quantity of drugs used and mean level of systolic blood pressure during hypotension taken as the predictor variables and recovery time as the response variable. The output of the F-test is summarized below by the regression equation, residual plots [Figure 9], correlation matrix [Table 9], ANOVA [Table 10] coefficient table iteration I [Table 11], and Regression statistics [Figure 10].
Figure 9: Residual plots.

Click here to view
Table 9: Correlation matrix

Click here to view
Table 10: ANOVA table

Click here to view
Table 11: Coefficient table iteration 1 (adjusted R2=0.728)

Click here to view
Figure 10: Regression statistics.

Click here to view


Regression equation is predicted Y = 58.603 + 53.688 X1-2.091 X2.


  Validity Checking Top


  1. Residual Normality: Linear regression assumes normality for residual errors. Shapiro–Wilk P = 0.638 [Figure 10]. It is assumed that the data are normally distributed
  2. Homoscedasticity-Homogeneity of Variance: The White test P value [Figure 10] equals 0.567 (F = 0.637). It is assumed that the variance is homogeneous
  3. ©Multicollinearity-Intercorrelations among the Predictors. There is no multicollinearity concern as all the VIF values are smaller than 2.5 [Table 11]
  4. Priori power-of the Entire Model (2 Predictors): Although the power is low: 0.106 [Figure 10], we reject the H0.


The power to prove each predictor significance is always lower than the power of the entire model.


  Interpretation of the Output Top


Y and X relationship

R square (R2) equals 0.806. It means that the predictors (Xi) explain 80.6% of the variance of Y. Adjusted R square equals 0.728. The coefficient of multiple correlation ® equals 0.898. It means that there is a very strong direct relationship between the predicted data (ŷ) and the observed data (y).

Goodness of fit

Right-tailed F test is used to check if the entire regression model is statistically significant. From [Table 10], F(1,5)= 10.382, P = 0.0166. Since P < α (0.05), we reject the H0. The linear regression model, Y = b0 + b1X1 + b2X2, provides a better fit than the model without the independent variables resulting in, Y = b0.

As shown in [Table 11], P value for X1= 0.0248 and for X2= 0.0121. All the independent variables (Xi) are significant since P values< α (0.05). The Y-intercept (b): Two-tailed, T = 1.265, P = 0.262 [Table 11]. Hence, b is not significantly different from zero. It is still most likely recommended not to force b to be zero.


  What Does an F-Test of Overall Significance Test Tell and What it Does not Top


The F statistic represents the ratio of the variance explained by the regression model (regression mean square) to the not explained variance (residuals mean square). It can be calculated easily using an online calculator in comparison to the manual approach. The F-test of overall significance tests whether all of the predictor variables are jointly significant while the t-test of significance for each individual predictor variable merely tests whether each predictor variable is individually significant. Thus, the F-test determines whether or not all of the predictor variables are jointly significant. It is possible that each predictor variable is not significant and yet the F-test says that all of the predictor variables combined are jointly significant.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.



 
  References Top

1.
Kothari CN. Quantitative Techniques. 3rd ed. New Delhi: UBS Publishers' DistributorsPut LTD; 2007.  Back to cited text no. 1
    
2.
Vohra ND. Quantitative Techniques in Management. 3rd ed. New Delhi: Tata McGraw-Hill Publishing Company Limited; 2007.  Back to cited text no. 2
    
3.
Armitage P, Berry G, Mathews, JN. Statistical Methods in Medical Research. 4th ed. Massachusetts: Blackwell Science; 2002.  Back to cited text no. 3
    
4.
Sullivan LS. Essentials of Biostatistics Workbook. 2nd ed. London: Jones and Bartlett Learning; 2003.  Back to cited text no. 4
    
5.
Ogunleye LI, Oyejola BA, Obisesan KO. Comparison of some common tests for normality. Int J Probabil Statist 2018;7:5, 130-7.  Back to cited text no. 5
    
6.
Whetherill GB, Duncombe P, Kenward M, Kollerstrom J, Paul SR, Vowden BJ, et al. Regression Analysis with Applications. London: Chapman and Hall; 1986.  Back to cited text no. 6
    
7.
Harris M, Taylor G. Medical Statistics Made Easy. New Yolk: Springer-Verlag; 2003.  Back to cited text no. 7
    
8.
Su H, Berenson ML. Comparing tests of homoscedasticity in simple linear regression. JSM Math Stat 2017;4:1017.  Back to cited text no. 8
    


    Figures

  [Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7], [Figure 8], [Figure 9], [Figure 10]
 
 
    Tables

  [Table 1], [Table 2], [Table 3], [Table 4], [Table 5], [Table 6], [Table 7], [Table 8], [Table 9], [Table 10], [Table 11]


This article has been cited by
1 Uncertainty assessment and an improved CYGNSS cyclonic wind speed retrieval model for cyclones over North Indian Ocean
Megha Maheshwari, Arun Chakraborty, Akhilesh Kumar, S Nirmala
Journal of Earth System Science. 2023; 132(1)
[Pubmed] | [DOI]
2 Study on improved efficiency of induced fracture in gas hydrate reservoir depressurization development
Yajie Bai, Matthew A. Clarke, Jian Hou, Yongge Liu, Nu Lu, Ermeng Zhao, Hongzhi Xu, Litao Chen, Tiankui Guo
Energy. 2023; 278: 127853
[Pubmed] | [DOI]
3 Experimental and artificial intelligence approaches to measuring the wear behavior of DIN St28 steel boronized by the box boronizing method using a mechanically alloyed powder source
Muhammet Gökhan Albayrak, Ertan Evi?n, Oktay Yi?gi?t, Mesut Togaçar, Burhan Ergen
Engineering Applications of Artificial Intelligence. 2023; 120: 105910
[Pubmed] | [DOI]
4 Machine Learning and Causal Approaches to Predict Readmissions and Its Economic Consequences Among Canadian Patients With Heart Disease: Retrospective Study
Ethan Rajkumar, Kevin Nguyen, Sandra Radic, Jubelle Paa, Qiyang Geng
JMIR Formative Research. 2023; 7: e41725
[Pubmed] | [DOI]
5 Risk Mitigation in Agriculture in Support of COVID-19 Crisis Management
Boris M. Leybert, Oksana V. Shmaliy, Zhanna V. Gornostaeva, Daria D. Mironova
Risks. 2023; 11(5): 92
[Pubmed] | [DOI]
6 Development of a GIS-Based Methodology for the Management of Stone Pavements Using Low-Cost Sensors
Salvatore Bruno, Lorenzo Vita, Giuseppe Loprencipe
Sensors. 2022; 22(17): 6560
[Pubmed] | [DOI]
7 Biophysical Impact of Sunflower Crop Rotation on Agricultural Fields
Nataliia Kussul, Klaus Deininger, Leonid Shumilo, Mykola Lavreniuk, Daniel Ayalew Ali, Oleg Nivievskyi
Sustainability. 2022; 14(7): 3965
[Pubmed] | [DOI]
8 Modelling, Analysis and Optimization of Pre-Treatment Process for Bioethanol Production from Pineapple Waste: Comparative Study between Response Surface Methodology (Rsm) and Hybridized Artificial Neural Network (Ann) Approach
Wen Hong Teo, Pei Ching Oh
SSRN Electronic Journal. 2022;
[Pubmed] | [DOI]
9 Long-term funding of community projects has contributed to mitigation of illegal activities within a premier African protected area, Bwindi impenetrable National Park, Uganda
Robert Bitariho, Emmanuel Akampurira, Badru Mugerwa
Conservation Science and Practice. 2022;
[Pubmed] | [DOI]
10 Integration of image segmentation and fuzzy theory to improve the accuracy of damage detection areas in traffic accidents
Majid Amirfakhrian, Mahboub Parhizkar
Journal of Big Data. 2021; 8(1)
[Pubmed] | [DOI]
11 Role of Digital Technology in Transforming Organizational Competencies Influencing Digital Economy: Moderating Role of Product Knowledge Hiding
Haoran Bai
Frontiers in Psychology. 2021; 12
[Pubmed] | [DOI]
12 Estimation of daily diffuse solar radiation from clearness index, sunshine duration and meteorological parameters for different climatic conditions
Zia ul Rehman Tahir,Saiqa Hafeez,Muhammad Asim,Muhammad Amjad,Muhammad Farooq,Muhammad Azhar,Ghulam Murtza Amjad
Sustainable Energy Technologies and Assessments. 2021; 47: 101544
[Pubmed] | [DOI]



 

Top
 
 
  Search
 
Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
Access Statistics
Email Alert *
Add to My List *
* Registration required (free)

 
  In this article
   Abstract
  Introduction
   Assumptions Unde...
   How to Interpret...
   Illustrative Exa...
   Manual Computati...
   Using an Online ...
  Summary Output
  Validity Checking
   Interpretation o...
   Using an Online ...
  Validity Checking
   Interpretation o...
   What Does an F-T...
   References
   Article Figures
   Article Tables

 Article Access Statistics
    Viewed85648    
    Printed182    
    Emailed0    
    PDF Downloaded1338    
    Comments [Add]    
    Cited by others 12    

Recommend this journal


[TAG2]
[TAG3]
[TAG4]