Repository logoRepository logo

Development of Data Imputation Methods for the Multiple Linear Regression

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Prince of Songkla University

Abstract

Multiple linear regression is a statistical study that investigates the relationship between the response and the independent variables and may be used to predict or estimate the response values. Missing data is a serious issue that regularly occurs and impacts data analysis, resulting in the loss of information in certain critical areas and data analysis outcomes that differ greatly from reality. This research is divided into two sections. The first project study’s objective is to develop and compare the efficiency of eight imputation methods: hot deck imputation (HD), k-nearest neighbors imputation (KNN), stochastic regression imputation (SR), predictive mean matching imputation (PMM), random forest imputation (RF), stochastic regression random forest with equivalent weight imputation (SREW), k-nearest random forest with equivalent weight imputation (KREW), and k-nearest stochastic regression and random forest with equivalent weight imputation (KSREW). The simulation was done in this study with sample sizes of 30, 60, 100, and 150 with missing percentages of 10%, 20%, 30%, and 40% on both independent and response variables. The average mean square error (AMSE) was used to compare efficiency. The results reveal that the proposed composite approaches outperformed the single ones, particularly a three-component method called KSREW. The second project is to create a function for analyzing multiple linear regressions using the RStudio software. The mlrpro package is an intuitive regression analysis tool that is suitable for novice users. It is a built-in package that can fit the regression model, select independent variables, validate the assumptions of multiple linear regression, transform data using the Box-Cox transformation, and determine which regression model is the most suited. The regression coefficients, residuals, fitted values, and statistics related to regression, such as residual standard error, multiple R-squared, F-statistic, and so on, may all be obtained through the use of our mlrpro package. In addition to this, it provides visualization tools of the residuals plot, the normal Q-Q plot, and the lambda interval plot derived from Box-Cox transformations.

Description

Master of Science (Applied Statistics),2022

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Thailand