A new hybrid method for data analysis when a significant percentage of data is missing

Document Type : Research Paper

Authors

1 Department of Statistics, Faculty of Mathematical Sciences, University of Guilan, Rasht, Iran

2 Dep. of Applied Maths, University of Guilan, Rasht, Iran

Abstract

This article aims to compare the efficiency of different imputation methods with missing data. In this way we use mean, median, Expected-Maximization (EM), regression imputation(RI) and multiple imputations (MI) to replace missing data.
In fact, we employ three proposed combination methods, namely EM imputation with MI imputation (EMMI), EM imputation with regression imputation (EMR), and regression imputation with MI
imputation (MI). In this paper, we compare these methods using an example study of Waterborne Container Trade by the US Customs Port (2000-2017) where the methods with different missing percent-ages. Several criteria, are used to compare estimations efficiency, such as mean, Standard Deviation (SD), and Mean Squared Error (MSE). The results show that the efficiency of composite imputation methods in almost all situations, in terms of MSE, RMI imputation method outperforms other methods. Nevertheless, when the missing percentage is small, the EMR imputation method performs better. In terms of the SD criterion, we find that the MI method is better than the other methods, where the RMI method is good when the missing percentage is large. When the missing percentage is in the range (40-50%), the EMR and RMI imputation methods give a better MSE.

Keywords

Main Subjects


[1] D. B. Rubin, Inference and missing data, Biometrika, 63(3) (1976), 581-592.
[2] A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the royal statistical society: series B (methodological), 39(1) (1977), 1-22.
[3] A. Gelman and J. Hill, Data analysis using regression and multilevel/hierarchical models, Cambridge university press, (2006).
[4] J. P. Vandenbroucke, E. V. Elm, D. G. Altman, P. C. Gtzsche, C. D. Mulrow, S. J. Pocock, ... and Strobe Initiative, Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration, Annals of internal medicine, 147(8) (2007), W-163.
[5] J. C. Jakobsen,C. Gluud, J.Wetterslev, and P. Winkel, When and how should multiple imputation be used for handling missing data in randomised clinical trialsa practical guide with  owcharts, BMC medical research methodology, 17(1) (2017), 1-10.
[6] J. R. Dettori, D. C. Norvell, and J. R. Chapman, The sin of missing data: is all forgiven by way of imputation?, Global spine journal, 8(8) (2018), 892-894.
[7] R. J. Little and D. B. Rubin, Statistical analysis with missing data (Vol. 793), John Wiley and Sons, (2019).
[8] S. W. Narayan, K. Yu Ho, J. Penm, B.Mintzes, A. Mirzaei, C. Schneider and A. E. Patanwala, Missing data reporting in clinical pharmacy research, American Journal of Health-System Pharmacy, 76(24) (2019), 2048-2052.
[9] C. K. Enders, Applied missing data analysis, Guilford Publications, (2022).
[10] A. Mirzaei, S. R. Carter, A. E. Patanwala and C. R. Schneider, Missing data in surveys: Key concepts, approaches, and applications, Research in Social and Administrative Pharmacy, 18(2) (2002), 2308-2316.
[11] M. Asif and K. Samarth, Imputation methods for multiple regression with missing heteroscedastic data, Thailand Statistician, 20(1) (1976), 1-15.
[12] A. Nouraldin, B. Fathi Vajargah, S. Baghar Mirashra , A new approach for imputation missing data using partition with Expectation maximization method, Computational Sciences and Engineering (CSE), (2023).