Home Technology Testing normality
Article
Licensed
Unlicensed Requires Authentication

Testing normality

An introduction with sample size calculation in legal metrology
  • Katy Klauenberg

    Katy Klauenberg is a researcher in the working group “Data analysis and measurement uncertainty” at Germany’s national metrology institute PTB. Her research interests include Bayesian statistics, regression problems, and sampling procedures. She provides training and support for the evaluation of measurement uncertainty and in legal metrology.

    EMAIL logo
    and Clemens Elster

    Clemens Elster currently leads PTB’s working group “Data analysis and measurement uncertainty.” His main topics of interest are statistical data analysis and evaluation of measurement uncertainty.

Published/Copyright: November 6, 2019

Abstract

In metrology, the normal distribution is often taken for granted, e. g. when evaluating the result of a measurement and its uncertainty, or when establishing the equivalence of measurements in key or supplementary comparisons. The correctness of this inference and subsequent conclusions is dependent on the normality assumption, such that a validation of this assumption is essential. Hypothesis testing is the formal statistical framework to do so, and this introduction will describe how statistical tests detect violations of a distributional assumption.

In the metrological context we will advise on how to select such a hypothesis test, how to set it up, how to perform it and which conclusion(s) can be drawn. In addition, we calculate the number of measurements needed to decide whether a process departs from a normal distribution and quantify how sure one is about this decision then. These aspects are illustrated for the powerful Shapiro-Wilk test and by an example in legal metrology. For this application we recommend to perform 330 measurements. Briefly we also touch upon the issues of multiple testing and rounded measurements.

Zusammenfassung

Häufig wird in der Metrologie eine Normalverteilung vorausgesetzt, z. B. wenn ein Messergebnis und seine Unsicherheit bestimmt werden, oder wenn die Äquivalenz von Messungen in Ring- oder nachgeordneten Vergleichen festgestellt wird. Die Korrektheit dieser Auswertungen und der daraus folgenden Schlüsse ist abhängig von der Annahme der Normalverteilung, sodass eine Validierung dieser Annahme essenziell ist. Hypothesentests sind hierfür der formale statistische Rahmen, und diese Einführung stellt dar, wie statistische Tests Verletzungen von Verteilungsannahmen detektieren.

Im metrologischen Kontext beschreiben wir, wie ein solcher Hypothesentest ausgewählt wird, wie er durchgeführt wird und welche Schlussfolgerung(en) gezogen werden können. Außerdem wird die nötige Anzahl an Messungen berechnet, um entscheiden zu können, ob ein Prozess von der Normalverteilung abweicht und es wird quantifiziert, wie sicher man sich einer solchen Entscheidung dann sein kann. Diese Aspekte werden für den trennscharfen Shapiro-Wilk-Test und durch ein Beispiel aus dem gesetzlichen Messwesen illustiert. Hierfür empfehlen wir, 330 Messungen durchzuführen. Zudem wird kurz auf multiple Hypothesen und gerundete Messungen eingegangen.

Funding statement: Part of this work has been funded by a scientific cooperation between the Physikalisch-Technische Bundesanstalt (PTB) and the ‘Forum Netztechnik/Netzbetrieb’ (FNN) in the ‘Verband der Elektrotechnik, Elektronik und Informationstechnik e. V.’ (VDE).

About the authors

Katy Klauenberg

Katy Klauenberg is a researcher in the working group “Data analysis and measurement uncertainty” at Germany’s national metrology institute PTB. Her research interests include Bayesian statistics, regression problems, and sampling procedures. She provides training and support for the evaluation of measurement uncertainty and in legal metrology.

Clemens Elster

Clemens Elster currently leads PTB’s working group “Data analysis and measurement uncertainty.” His main topics of interest are statistical data analysis and evaluation of measurement uncertainty.

Appendix A Impact demonstration for the alternative t-distribution

For efficient attribute sampling of utility meters, the normal distribution of their measurement deviations is critical for consumer protection (see procedure 2 in [22] or procedure 4.3 in [12]). This appendix shows how to calculate the impact of different violations of the normal distribution on the legal requirement that 95 % of the measuring instruments shall fulfil conformance standards.

Let us assume that attribute sampling plans are applied which guarantee that 100q=92% of the meters measure correctly within a deviation of ±Δ. Then [23, theorem 1] showed that under normally distributed measurement deviations, this implies that at least 100p% of the meters measure correctly within a deviation of ±ΔγN when γN=Φ1((p+1)/2)Φ1((0.92+1)/2), with Φ being the standard normal cumulative distribution function. E. g. for p=0.98 we have γN=1.3288. However, [13] show that under a tν-distribution only at least 100p% of the meters measure correctly within a deviation of ±ΔγN, with

γN=Fν1((p+1)/2)Fν1((0.92+1)/2),

where Fν is the cumulative distribution function of the standardized tν-distribution. E. g. for γN=1.3288 and ν=6 we have p=0.9687. Under the assumption of a linear decrease of correctly functioning meters over time, an age of at least t1 years and a prediction interval of at most T+1 years, this results in at least

11p1+T+1t1

correctly measuring meters within ±ΔγN until time t+T (cf. eq. (2) in [22]). E. g. for γN=1.3288, ν=6 and T=t=5 years, we have 92.17% reliability – instead of the required 95 %. Further examples are displayed in figure 2.

Appendix B Implementing the Shapiro-Wilk test in software

To apply the Shapiro-Wilk test, one can either make use of implementations in some standard programming languages or programme the test from scratch. One ready-to-use implementation is available in the free statistical software R [28]. The following code reads in the data from figure 1 (right) after loading the required xlsx package, then outputs the Shapiro-Wilk test statistic with its significance level (third line of code) and returns the test decision (forth line of code):

Shapiro-Wilk test R code
library(xlsx)
Data<-read.xlsx("ShapiroTestBeispiel.xlsx",
sheetIndex=2)[,2]
shapiro.test(Data)
ifelse(shapiro.test(Data)$p<0.1,"reject H0","no
evidence")

One should be aware that implementations in different programming languages may possibly use different versions of the Shapiro-Wilk test. The R-function shapiro.test implements [34] (which includes the algorithm described in [33]) for samples of sizes 3 to 5000. A similar routine exists in SAS [38].

When implementing the Shapiro-Wilk test from scratch, one needs to calculate the test statistic according to (step 4) in section 2.3 and compare it to the critical value (cf. section 3). For the test statistic, the coefficients ai are usually approximated by the polynomials in [33, sect. 2]. The critical value is usually calculated from a lognormal approximation, which was derived from simulations of the distribution of the test statistic in [33, sect. 4]. The Excel file available at the PTB website [21] contains the coefficients and critical values for the sample size n=330 and type I errors α=(0.1,0.1/2,,0.1/50).

References

1. Bundesgesetzblatt. Bundesanzeiger Verlag, 1(58):2010–2073, 2014.Search in Google Scholar

2. ASTM E178-16a. Standard Practice for Dealing With Outlying Observations. ASTM International, West Conshohocken, 2016. DOI:10.1520/E0178-16A.10.1520/E0178-16ASearch in Google Scholar

3. JO Berger. Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical Science, 18(1):1–32, 2003.10.1214/ss/1056397485Search in Google Scholar

4. BIPM, IEC, IFCC, ILAC, ISO, IUPAC, IUPAP, and OIML. Evaluation of measurement data – Guide to the expression of uncertainty in measurement. Joint Committee for Guides in Metrology, JCGM 100, 2008.Search in Google Scholar

5. BIPM, IEC, IFCC, ILAC, ISO, IUPAC, IUPAP, and OIML. Evaluation of measurement data – Supplement 1 to the ‘Guide to the expression of uncertainty in measurement’ – Propagation of distributions using a Monte Carlo method. Joint Committee for Guides in Metrology, JCGM 101, 2008.Search in Google Scholar

6. BIPM, IEC, IFCC, ILAC, ISO, IUPAC, IUPAP, and OIML. Evaluation of measurement data – Supplement 2 to the ‘Guide to the expression of uncertainty in measurement’ – Extension to any number of output quantities. Joint Committee for Guides in Metrology, JCGM 101, 2011.Search in Google Scholar

7. BIPM, IEC, IFCC, ILAC, ISO, IUPAC, IUPAP, and OIML. Evaluation of measurement data – The role of measurement uncertainty in conformity assessment. Joint Committee for Guides in Metrology, JCGM 106, 2012.Search in Google Scholar

8. WR Blischke and DNP Murthy. Reliability: Modeling, Prediction, and Optimization. Wiley Series in Probability and Statistics. John Wiley & Sons, 2011. ISBN 9781118150474.Search in Google Scholar

9. RB D’Agostino. Tests for the Normal Distribution. In: Goodness-of-Fit Techniques, pages 405–419. CRC Press, Taylor & Francis Group, 1986. ISBN 9780824774875.Search in Google Scholar

10. RB D’Agostino and MA Stephens, editors. Goodness-of-Fit Techniques. Statistics: A Series of Textbooks and Monographs. CRC Press, Taylor & Francis Group, 1986. ISBN 9780824774875.Search in Google Scholar

11. GE D’Errico. Multiple hypothesis testing for metrology applications. Accreditation and Quality Assurance, 19(1):1–10, Feb 2014. ISSN 1432-0517. 10.1007/s00769-013-1025-4.Search in Google Scholar

12. Deutschen Akademie für Metrologie DAM. Gesetzliches Messwessen – Verfahrensanweisung für Stichprobenverfahren zur Verlängerung der Eichfrist (GM-VA SPV). Rechtssammlung der DAM, Stand 20.03.2018, 2018.Search in Google Scholar

13. C Elster and K Klauenberg. A quantile inequality for location-scale distributions. 2019. Draft available.10.1016/j.spl.2020.108851Search in Google Scholar

14. PJ Farrell and K Rogers-Stewart. Comprehensive study of tests for normality and symmetry: extending the Spiegelhalter test. Journal of Statistical Computation and Simulation, 76(9):803–816, 2006.10.1080/10629360500109023Search in Google Scholar

15. FF Gan and KJ Koehler. Goodness-of-fit tests based on p-p probability plots. Technometrics, 32(3):289–303, 1990. 10.1080/00401706.1990.10484682.Search in Google Scholar

16. S Greenland, SJ Senn, KJ Rothman, JB Carlin, C Poole, SN Goodman and DG Altman. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4):337–350, 2016. doi:10.1007/s10654-016-0149-3.Search in Google Scholar

17. R Hubbard. Alphabet soup: Blurring the distinctions between p’s and a’s in psychological research. Theory & Psychology, 14(3):295–327, 2004. 10.1177/0959354304043638.Search in Google Scholar

18. R Hubbard and MJ Bayarri. Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing. The American Statistician, 57(3):171–178, 2003.10.1198/0003130031856Search in Google Scholar

19. ISO/TC 69 SC 6 Measurement methods and results. ISO 5479:1997 Statistical interpretation of data – Tests for departure from the normal distribution. International Organization for Standardization ISO, 1997.Search in Google Scholar

20. RN Kacker, RU Datla, and AC Parr. Statistical analysis of CIPM key comparisons based on the ISOGuide. Metrologia, 41(4):340–352, jul 2004. 10.1088/0026-1394/41/4/017.Search in Google Scholar

21. K Klauenberg, October 2019. URL https://www.ptb.de/cms/nc/en/ptb/fachabteilungen/abt8/fb-84/ag-842/messunsicherheit-8420.html#c119255.Search in Google Scholar

22. K Klauenberg and C Elster. How to ensure the future quality of utility meters. OIML Bulletin, LIX(3):16–23, July 2018.Search in Google Scholar

23. K Klauenberg, R Kramer, C Kroner, J Rose, and C Elster. Reducing sample size by tightening test conditions. Quality & Reliability Engineering International, 34(3):333–346, 2018.10.1002/qre.2256Search in Google Scholar

24. EL Lehmann and JP Romano. Testing Statistical Hypotheses. Springer Texts in Statistics. Springer New York, 3rd edition, 2006. ISBN 9780387276052.Search in Google Scholar

25. NIST/SEMATECH. e-handbook of statistical methods. URL http://www.itl.nist.gov/div898/handbook/. accessed May 16 2019.Search in Google Scholar

26. R Nuzzo. Scientific method: statistical errors. Nature News, 506(7487):150, 2014.10.1038/506150aSearch in Google Scholar PubMed

27. ES Pearson, RB D’Agostino, and KO Bowman. Tests for departure from normality: Comparison of powers. Biometrika, 64(2):231–246, 1977.10.1093/biomet/64.2.231Search in Google Scholar

28. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019. URL http://www.R-project.org/.Search in Google Scholar

29. NM Razali and YB Wah. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of statistical modeling and analytics, 2(1):21–33, 2011.Search in Google Scholar

30. GB Rossi. Probability in Metrology. In: Data modeling for metrology and testing in measurement science. Springer, 2008.10.1007/978-0-8176-4804-6_2Search in Google Scholar

31. JP Royston. An extension of Shapiro and Wilk’s W test for normality to large samples. Journal of the Royal Statistical Society: Series C (Applied Statistics), 31(2):115–124, 1982.10.2307/2347973Search in Google Scholar

32. JP Royston. Correcting the Shapiro-Wilk W for ties. Journal of Statistical Computation and Simulation, 31(4):237–249, 1989. 10.1080/00949658908811146.Search in Google Scholar

33. P Royston. Approximating the Shapiro-Wilk W-test for non-normality. Statistics and Computing, 2(3):117–119, Sep 1992. ISSN 1573-1375. 10.1007/BF01891203.Search in Google Scholar

34. P Royston. Remark AS R94: A Remark on Algorithm AS 181: The W-test for normality. Journal of the Royal Statistical Society. Series C (Applied Statistics), 44(4):547–551, 1995. ISSN 00359254, 14679876.10.2307/2986146Search in Google Scholar

35. E Seier. Comparison of tests for univariate normality. In InterStat, number 1 in Statistics on the Internet, pages 1–17, January 2002.Search in Google Scholar

36. SS Shapiro and MB Wilk. An analysis of variance test for normality (complete samples). Biometrika, 52(3/4):591–611, 1965.10.1093/biomet/52.3-4.591Search in Google Scholar

37. MA Stephens. Tests based on EDF statistics. In: Goodness-of-Fit Techniques, pages 97–193. CRC Press, Taylor & Francis Group, 1986. ISBN 9780824774875.10.1201/9780203753064-4Search in Google Scholar

38. D Taeger and S Kuhnt. Statistical Hypothesis Testing with SAS and R. Wiley, 2014. ISBN 9781118762615.10.1002/9781118762585Search in Google Scholar

39. OIML TC 1 Terminology. International Vocabulary of Terms in Legal Metrology, volume 1 of OIML V. International Organization of Legal Metrology (OIML), 2013 (e/f) edition, 2013.Search in Google Scholar

40. RL Wasserstein. ASA statement on statistical significance and p-values. The American Statistician, 70(2):129–133, 2016.10.1080/00031305.2016.1154108Search in Google Scholar

41. RL Wasserstein, AL Schirm, and NA Lazar, editors. The American Statistician, volume 73. American Statistical Association, 2019.10.1080/00031305.2019.1583913Search in Google Scholar

42. G Wübbeler, O Bodnar, and C Elster. Bayesian hypothesis testing for key comparisons. Metrologia, 53(4):1131–1138, Jul 2016. 10.1088/0026-1394/53/4/1131.Search in Google Scholar

Received: 2019-10-02
Accepted: 2019-10-11
Published Online: 2019-11-06
Published in Print: 2019-11-18

© 2019 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 16.12.2025 from https://www.degruyterbrill.com/document/doi/10.1515/teme-2019-0148/html?lang=en
Scroll to top button