Startseite Homogeneity test and sample size of response rates for AC 1 in a stratified evaluation design
Artikel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Homogeneity test and sample size of response rates for AC 1 in a stratified evaluation design

  • Jingwei Jia , Yuanbo Liu , Jikai Yang und Zhiming Li EMAIL logo
Veröffentlicht/Copyright: 30. April 2025
Veröffentlichen auch Sie bei De Gruyter Brill

Abstract

Gwet’s first-order agreement coefficient (AC 1) is widely used to evaluate the consistency between raters. Considering the existence of a certain relationship between the raters, the paper aims to test the equality of response rates and the dependency between two raters of modified AC 1’s in a stratified design and estimates the sample size for a given significance level. We first establish a probability model and then estimate the unknown parameters. Further, we explore the homogeneity test of these AC 1’s under the asymptotic method, such as likelihood ratio, score, and Wald-type statistics. In numerical simulation, the performance of statistics is investigated in terms of type I error rates (TIEs) and power while finding a suitable sample size under a given power. The results show that the Wald-type statistic has robust TIEs and satisfactory power and is suitable for large samples (n≥50). Under the same power, the sample size of the Wald-type test is smaller when the number of strata is large. The higher the power, the larger the required sample size. Finally, two real examples are given to illustrate these methods.


Corresponding author: Zhiming Li, College of Mathematics and System Science, Xinjiang University, Urumqi, China, E-mail: 

Funding source: 2025 Central Guidance for Local Science and Technology Development Fund

Award Identifier / Grant number: ZYYD2025ZY20

Funding source: Xinjiang University Undergraduate Training Program for Innovation and Entrepreneurship

Award Identifier / Grant number: XJU-SRT-23102

  1. Research ethics: Not applicable.

  2. Informed consent: Informed consent was obtained from all individuals included in this study, or their legal guardians or wards.

  3. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Use of Large Language Models, AI and Machine Learning Tools: None declared.

  5. Conflict of interest: The authors state no conflict of interest.

  6. Research funding: This work is supported by the Central Guidance for Local Science and Technology Development Fund (Grant No. ZYYD2025ZY20). National Natural Science Foundation of China (Grant No. 12061070), Science and Technology Department of Xinjiang Uygur Autonomous Region (Grant No. 2021D01E13).

  7. Data availability: Clinical data referred to are from Barlow et al. (1991) and Reed III (2000).

References

1. Scott, WA. Reliability of content analysis: the case of nominal scale coding. Public Opin Q 1955;19:321–5. https://doi.org/10.1086/266577.Suche in Google Scholar

2. Cohen, J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960;20:37–46. https://doi.org/10.1177/001316446002000104.Suche in Google Scholar

3. Cicchetti, DV, Feinstein, AR. High agreement but low kappa: ii. resolving the paradoxes. J Clin Epidemiol 1990;43:551–8. https://doi.org/10.1016/0895-4356(90)90159-m.Suche in Google Scholar PubMed

4. Wilson Holley, J, Paul Guilford, J. A note on the G index of agreement. Educ Psychol Meas 1964;24:749–53. https://doi.org/10.1177/001316446402400402.Suche in Google Scholar

5. Aickin, M. Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 1990;46:293–302. https://doi.org/10.2307/2531434.Suche in Google Scholar

6. Andrés, AM, Marzo, PF. Delta: a new measure of agreement between two raters. Br J Math Stat Psychol 2004;57:1–19. https://doi.org/10.1348/000711004849268.Suche in Google Scholar PubMed

7. Li Gwet, K. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol 2008;61:29–48. https://doi.org/10.1348/000711006x126600.Suche in Google Scholar

8. Shankar, V, Bangdiwala, SI. Observer agreement paradoxes in 2x2 tables: comparison of agreement measures. BMC Med Res Methodol 2014;14:1–9. https://doi.org/10.1186/1471-2288-14-100.Suche in Google Scholar PubMed PubMed Central

9. Ohyama, T. Statistical inference of agreement coefficient between two raters with binary outcomes. Commun Stat Theor Methods 2020;49:2529–39. https://doi.org/10.1080/03610926.2019.1576894.Suche in Google Scholar

10. Vach, W, Gerke, O. Gwet’s AC1 is not a substitute for Cohen’s kappa–a comparison of basic properties. MethodsX 2023;10:102212. https://doi.org/10.1016/j.mex.2023.102212.Suche in Google Scholar PubMed PubMed Central

11. Honda, C, Ohyama, T. Homogeneity score test of AC1 statistics and estimation of common AC1 in multiple or stratified inter-rater agreement studies. BMC Med Res Methodol 2020;20:20. https://doi.org/10.1186/s12874-019-0887-5.Suche in Google Scholar PubMed PubMed Central

12. Giammarino, M, Mattiello, S, Battini, M, Quatto, P, Battaglini, LM, Vieira, ACL, et al.. Evaluation of inter-observer reliability of animal welfare indicators: which is the best index to use? Animals 2021;11:1445. https://doi.org/10.3390/ani11051445.Suche in Google Scholar PubMed PubMed Central

13. Tan, KS, Yeh, Y-C, Adusumilli, PS, Travis, WD. Quantifying interrater agreement and reliability between thoracic pathologists: paradoxical behavior of Cohen’s kappa in the presence of a high prevalence of the histopathologic feature in lung cancer. JTO Clin Res Rep 2024;5:100618. https://doi.org/10.1016/j.jtocrr.2023.100618.Suche in Google Scholar PubMed PubMed Central

14. Ganju, J, Zhou, K. The benefit of stratification in clinical trials revisited. Stat Med 2011;30:2881–9. https://doi.org/10.1002/sim.4351.Suche in Google Scholar PubMed

15. Barlow, W, Lai, M-Y, Azen, SP. A comparison of methods for calculating a stratified kappa. Stat Med 1991;10:1465–72. https://doi.org/10.1002/sim.4780100913.Suche in Google Scholar PubMed

16. Reed, JFIII. Homogeneity of kappa statistics in multiple samples. Comput Methods Progr Biomed 2000;63:43–6. https://doi.org/10.1016/s0169-2607(00)00074-2.Suche in Google Scholar PubMed

17. Xu, M, Li, Z, Mou, K, Shuaib, KM. Homogeneity test of the rirst-order agreement coefficient in a stratified design. Entropy 2023;25:536. https://doi.org/10.3390/e25030536.Suche in Google Scholar PubMed PubMed Central

18. Vach, W. The dependence of Cohen’s kappa on the prevalence does not matter. J Clin Epidemiol 2005;58:655–61. https://doi.org/10.1016/j.jclinepi.2004.02.021.Suche in Google Scholar PubMed

19. Engle, RF. Chapter 13 Wald, likelihood ratio, and Lagrange multiplier tests in econometrics. Volume 2 of Handbook of Econometrics. North-Holland: Elsevier; 1984:775–826 pp.10.1016/S1573-4412(84)02005-5Suche in Google Scholar

20. Mou, K, Li, Z, Ma, C. Asymptotic sample size for common test of relative risk ratios in stratified bilateral data. Mathematics 2023;11:4198. https://doi.org/10.3390/math11194198.Suche in Google Scholar

21. Tang, M-L, Tang, N-S, Rosner, B. Statistical inference for correlated data in ophthalmologic studies. Stat Med 2006;25:2771–83. https://doi.org/10.1002/sim.2425.Suche in Google Scholar PubMed

Received: 2024-09-03
Accepted: 2025-03-31
Published Online: 2025-04-30

© 2025 Walter de Gruyter GmbH, Berlin/Boston

Heruntergeladen am 6.10.2025 von https://www.degruyterbrill.com/document/doi/10.1515/ijb-2024-0080/html?lang=de
Button zum nach oben scrollen