Abstract
Objectives
An important step in preparing data for statistical analysis is outlier detection and removal, yet no gold standard exists in current literature. The objective of this study is to identify the ideal decision test using the National Health and Nutrition Examination Survey (NHANES) 2017–2018 dietary data.
Methods
We conducted a secondary analysis of NHANES 24-h dietary recalls, considering the survey's multi-stage cluster design. Six outlier detection and removal strategies were assessed by evaluating the decision tests' impact on the Pearson's correlation coefficient among macronutrients. Furthermore, we assessed changes in the effect size estimates based on pre-defined sample sizes. The data were collected as part of the 2017–2018 24-h dietary recall among adult participants (N=4,893).
Results
Effect estimate changes for macronutrients varied from 6.5 % for protein to 39.3 % for alcohol across all decision tests. The largest proportion of outliers removed was 4.0 % in the large sample size, for the decision test, >2 standard deviations from the mean. The smallest sample size, particularly for alcohol analysis, was most affected by the six decision tests when compared to no decision test.
Conclusions
This study, the first to use 2017–2018 NHANES dietary data for outlier evaluation, emphasizes the importance of selecting an appropriate decision test considering factors such as statistical power, sample size, normality assumptions, the proportion of data removed, effect estimate changes, and the consistency of estimates across sample sizes. We recommend the use of non-parametric tests for non-normally distributed variables of interest.
Funding source: National Institute on Drug Abuse
Award Identifier / Grant number: K01DA044313
Funding source: National Institute of Environmental Health Sciences
Award Identifier / Grant number: P30 ES006096
Award Identifier / Grant number: R21ES032161
Acknowledgments
We thank Dr. Mary Beth Genter, Professor at the University of Cincinnati College of Medicine Department of Environmental and Public Health Sciences for her review and comments. All authors read and approved the final manuscript.
-
Research ethics: This study was conducted according to the guidelines laid down in the Declaration of Helsinki and all procedures involving research study participants were approved by the National Center for Health Statistics Ethics Review Board. Written informed consent was obtained from all subjects/patients.
-
Informed consent: Written informed consent was obtained from all subjects/patients.
-
Author contributions: The authors has accepted responsibility for the entire content of this manuscript and approved its submission.
-
Competing interests: The authors state no conflict of interest.
-
Research funding: Ashley L. Merianos’ contribution was partly funded by grants K01DA044313 and R21ES032161 from the National Institutes of Health. Angelico Mendy’s contribution was partly funded by grant P30 ES006096 from the National Institutes of Health. Yuki Liu is employed by Intuitive Surgical and contributed to this research as part of a personal outside of work. The research and research results are not, in any way, associated with Stanford University. YL has no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed. The funders had no role in the design, analysis or writing of this article.
-
Data availability: Not applicable.
References
1. Lee, MS, Carcone, AI, Ko, L, Kulik, N, Ellis, DA, Naar, S. Managing outliers in adolescent food frequency questionnaire data. J Nutr Educ Behav 2021;53:28–35. https://doi.org/10.1016/j.jneb.2020.08.002.Search in Google Scholar PubMed PubMed Central
2. Kwak, SK, Kim, JH. Statistical data preparation: management of missing values and outliers. Korean J Anesthesiol 2017;70:407. https://doi.org/10.4097/kjae.2017.70.4.407.Search in Google Scholar PubMed PubMed Central
3. Thakwalakwa, CM, Kuusipalo, HM, Maleta, KM, Phuka, JC, Ashorn, P, Cheung, YB. The validity of a structured interactive 24-hour recall in estimating energy and nutrient intakes in 15-month-old rural Malawian children: the validity of 24 h recall. Matern Child Nutr 2012;8:380–9. https://doi.org/10.1111/j.1740-8709.2010.00283.x.Search in Google Scholar PubMed PubMed Central
4. Maniruzzaman, M, Rahman, MJ, Al-MehediHasan, M, Suri, HS, Abedin, MM, El-Baz, A, et al.. Accurate diabetes risk stratification using machine learning: role of missing value and outliers. J Med Syst 2018;42:92. https://doi.org/10.1007/s10916-018-0940-7.Search in Google Scholar PubMed PubMed Central
5. Curran-Everett, D. Explorations in statistics: the assumption of normality. Adv Physiol Educ 2017;41:449–53. https://doi.org/10.1152/advan.00064.2017.Search in Google Scholar PubMed
6. Pollard, TJ, Johnson, AEW, Raffa, JD, Mark, RG. tableone: an open source python package for producing summary statistics for research papers. JAMIA Open 2018;1:26–31. https://doi.org/10.1093/jamiaopen/ooy012.Search in Google Scholar PubMed PubMed Central
7. Mowbray, FI, Fox-Wasylyshyn, SM, El-Masri, MM. Univariate outliers: a conceptual overview for the nurse researcher. Can J Nurs Res 2019;51:31–7. https://doi.org/10.1177/0844562118786647.Search in Google Scholar PubMed
8. van der Spoel, E, Choi, J, Roelfsema, F, le Cessie, S, van Heemst, D, Dekkers, OM. Comparing methods for measurement error detection in serial 24-h hormonal data. J Biol Rhythm 2019;34:347–63. https://doi.org/10.1177/0748730419850917.Search in Google Scholar PubMed PubMed Central
9. SAS Institute Inc. SAS® 9.4 Language reference: concepts, 6th ed. Cary, NC: SAS Institute Inc.; 2013.Search in Google Scholar
10. Centers for Disease Control and Prevention, National Center for Health Statistics. National health and nutrition examination survey; 2021. Why I was selected? https://www.cdc.gov/nchs/nhanes/participant/participant-selected.htm#:∼:text=We%20cannot%20go%20to%20every,process%20using%20U.S.%20Census%20information [Accessed 9 Mar 2023].Search in Google Scholar
11. National Center for Health Statistics. 2017-2018 data documentation, codebook, and frequencies – dietary interview – individual foods, first day [Internet]; 2020. https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DR1IFF_J.htm [Accessed 20 May 2022].Search in Google Scholar
12. Liu, J, Micha, R, Li, Y, Mozaffarian, D. Trends in food sources and diet quality among US children and adults, 2003–2018. JAMA Netw Open 2021;4:e215262. https://doi.org/10.1001/jamanetworkopen.2021.5262.Search in Google Scholar PubMed PubMed Central
13. Food Surveys Research Group. Key points using WWEIA NHANES 2017–2018 [Internet]. Agricultural Research Service, USDA; 2021. https://www.ars.usda.gov/ARSUserFiles/80400530/pdf/1718/Key%20Points%20Using%20WWEIA%20NHANES%202017-2018.pdf [Accessed 26 Jul 2022].Search in Google Scholar
14. Moshfegh, AJ, Rhodes, DG, Baer, DJ, Murayi, T, Clemens, JC, Rumpler, WV, et al.. The US department of agriculture automated multiple-pass method reduces bias in the collection of energy intakes. Am J Clin Nutr 2008;88:324–32. https://doi.org/10.1093/ajcn/88.2.324.Search in Google Scholar PubMed
15. Tanner, KJ, Watowicz, RP. Development of a tool to measure the number of foods and beverages consumed by children using national health and nutrition examination survey (NHANES) FFQ data. Publ Health Nutr 2018;21:1486–94. https://doi.org/10.1017/s1368980017004098.Search in Google Scholar PubMed PubMed Central
16. Saylor, J, Friedmann, E, Lee, HJ. Navigating complex sample analysis using national survey data. Nurs Res 2012;61:231–7. https://doi.org/10.1097/nnr.0b013e3182533403.Search in Google Scholar
17. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2021. Available from: https://www.R-project.org/.Search in Google Scholar
18. Kowalkowska, J, Wadolowska, L. The 72-item semi-quantitative food frequency questionnaire (72-item SQ-FFQ) for polish young adults: reproducibility and relative validity. Nutrients 2022;14:2696. https://doi.org/10.3390/nu14132696.Search in Google Scholar PubMed PubMed Central
19. Leaf, A, Antonio, J. The effects of overfeeding on body composition: the role of macronutrient composition. Int J Exerc Sci 2017;10:1275–96.10.70252/HPPF5281Search in Google Scholar
20. Gress, TW, Denvir, J, Shapiro, JI. Effect of removing outliers on statistical inference: implications to interpretation of experimental data in medical research. Marshall J Med 2018;4:9. https://doi.org/10.18590/mjm.2018.vol4.iss2.9.Search in Google Scholar PubMed PubMed Central
21. Tukey, J. Exploratory data analysis. Reading, MA: Addison-Wesley Publishing Company; 1977.Search in Google Scholar
22. Hickman, PE, Koerbin, G, Potter, JM, Glasgow, N, Cavanaugh, JA, Abhayaratna, WP, et al.. Choice of statistical tools for outlier removal causes substantial changes in analyte reference intervals in healthy populations. Clin Chem 2020;66:1558–61. https://doi.org/10.1093/clinchem/hvaa208.Search in Google Scholar PubMed
23. Sullivan, JH, Warkentin, M, Wallace, L. So many ways for assessing outliers: what really works and does it matter? J Bus Res 2021;132:530–43. https://doi.org/10.1016/j.jbusres.2021.03.066.Search in Google Scholar
24. Forouhi, NG, Krauss, RM, Taubes, G, Willett, W. Dietary fat and cardiometabolic health: evidence, controversies, and consensus for guidance. BMJ 2018;361:k2139.10.1136/bmj.k2139Search in Google Scholar PubMed PubMed Central
25. Shan, Z, Rehm, CD, Rogers, G, Ruan, M, Wang, DD, Hu, FB, et al.. Trends in dietary carbohydrate, protein, and fat intake and diet quality among US adults, 1999–2016. JAMA 2019;322:1178. https://doi.org/10.1001/jama.2019.13771.Search in Google Scholar PubMed PubMed Central
26. Willett, W. Chapter 11. Implications of total energy intake for epidemiologic analyses, 3rd ed. New York, NY: Oxford University Press; 2013.Search in Google Scholar
27. Livingstone, MBE, Black, AE. Markers of the validity of reported energy intake. J Nutr 2003;133:895S–920S. https://doi.org/10.1093/jn/133.3.895s.Search in Google Scholar PubMed
28. Čampulová, M, Veselík, P, Michálek, J. Control chart and six sigma based algorithms for identification of outliers in experimental data, with an application to particulate matter PM 10. Atmos Pollut Res 2017;8:700–8. https://doi.org/10.1016/j.apr.2017.01.004.Search in Google Scholar
29. Sangra, RA, Codina, AF. The identification, impact and management of missing values and outlier data in nutritional epidemiology. Nutr Hosp 2015;31:189–95. https://doi.org/10.3305/nh.2015.31.sup3.8766.Search in Google Scholar PubMed
© 2023 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Research Articles
- Outliers in nutrient intake data for U.S. adults: national health and nutrition examination survey 2017–2018
- Using repeated antibody testing to minimize bias in estimates of prevalence and incidence of SARS-CoV-2 infection
- A compartmental model of the COVID-19 pandemic course in Germany
- Energy-efficient model “DenseNet201 based on deep convolutional neural network” using cloud platform for detection of COVID-19 infected patients
- Identification of time delays in COVID-19 data
- A country-specific COVID-19 model
- Incidence and trend of leishmaniasis and its related factors in Golestan province, northeastern Iran: time series analysis
- Application of machine learning tools for feature selection in the identification of prognostic markers in COVID-19
- A study of the impact of policy interventions on daily COVID scenario in India using interrupted time series analysis
- Measuring COVID-19 spreading speed through the mean time between infections indicator
- Performance evaluation of ResNet model for classification of tomato plant disease
- Energy- efficient model “Inception V3 based on deep convolutional neural network” using cloud platform for detection of COVID-19 infected patients
Articles in the same Issue
- Research Articles
- Outliers in nutrient intake data for U.S. adults: national health and nutrition examination survey 2017–2018
- Using repeated antibody testing to minimize bias in estimates of prevalence and incidence of SARS-CoV-2 infection
- A compartmental model of the COVID-19 pandemic course in Germany
- Energy-efficient model “DenseNet201 based on deep convolutional neural network” using cloud platform for detection of COVID-19 infected patients
- Identification of time delays in COVID-19 data
- A country-specific COVID-19 model
- Incidence and trend of leishmaniasis and its related factors in Golestan province, northeastern Iran: time series analysis
- Application of machine learning tools for feature selection in the identification of prognostic markers in COVID-19
- A study of the impact of policy interventions on daily COVID scenario in India using interrupted time series analysis
- Measuring COVID-19 spreading speed through the mean time between infections indicator
- Performance evaluation of ResNet model for classification of tomato plant disease
- Energy- efficient model “Inception V3 based on deep convolutional neural network” using cloud platform for detection of COVID-19 infected patients