Homogeneity test and sample size of response rates for AC
1 in a stratified evaluation design

Jingwei Jia; Yuanbo Liu; Jikai Yang; Zhiming Li

doi:10.1515/ijb-2024-0080

Article

Homogeneity test and sample size of response rates for AC ₁ in a stratified evaluation design

Jingwei Jia , Yuanbo Liu , Jikai Yang and Zhiming Li

Published/Copyright: April 30, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal The International Journal of Biostatistics Volume 21 Issue 1

Abstract

Gwet’s first-order agreement coefficient (AC ₁) is widely used to evaluate the consistency between raters. Considering the existence of a certain relationship between the raters, the paper aims to test the equality of response rates and the dependency between two raters of modified AC ₁’s in a stratified design and estimates the sample size for a given significance level. We first establish a probability model and then estimate the unknown parameters. Further, we explore the homogeneity test of these AC ₁’s under the asymptotic method, such as likelihood ratio, score, and Wald-type statistics. In numerical simulation, the performance of statistics is investigated in terms of type I error rates (TIEs) and power while finding a suitable sample size under a given power. The results show that the Wald-type statistic has robust TIEs and satisfactory power and is suitable for large samples (n≥50). Under the same power, the sample size of the Wald-type test is smaller when the number of strata is large. The higher the power, the larger the required sample size. Finally, two real examples are given to illustrate these methods.

Keywords: the first-order agreement coefficient; homogeneity test; asymptotic statistic; sample size; confidence interval

Corresponding author: Zhiming Li, College of Mathematics and System Science, Xinjiang University, Urumqi, China, E-mail: zmli@xju.edu.cn

Funding source: 2025 Central Guidance for Local Science and Technology Development Fund

Award Identifier / Grant number: ZYYD2025ZY20

Funding source: Xinjiang University Undergraduate Training Program for Innovation and Entrepreneurship

Award Identifier / Grant number: XJU-SRT-23102

Research ethics: Not applicable.
Informed consent: Informed consent was obtained from all individuals included in this study, or their legal guardians or wards.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interest: The authors state no conflict of interest.
Research funding: This work is supported by the Central Guidance for Local Science and Technology Development Fund (Grant No. ZYYD2025ZY20). National Natural Science Foundation of China (Grant No. 12061070), Science and Technology Department of Xinjiang Uygur Autonomous Region (Grant No. 2021D01E13).
Data availability: Clinical data referred to are from Barlow et al. (1991) and Reed III (2000).

References

1. Scott, WA. Reliability of content analysis: the case of nominal scale coding. Public Opin Q 1955;19:321–5. https://doi.org/10.1086/266577.Search in Google Scholar

2. Cohen, J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960;20:37–46. https://doi.org/10.1177/001316446002000104.Search in Google Scholar

3. Cicchetti, DV, Feinstein, AR. High agreement but low kappa: ii. resolving the paradoxes. J Clin Epidemiol 1990;43:551–8. https://doi.org/10.1016/0895-4356(90)90159-m.Search in Google Scholar PubMed

4. Wilson Holley, J, Paul Guilford, J. A note on the G index of agreement. Educ Psychol Meas 1964;24:749–53. https://doi.org/10.1177/001316446402400402.Search in Google Scholar

5. Aickin, M. Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 1990;46:293–302. https://doi.org/10.2307/2531434.Search in Google Scholar

6. Andrés, AM, Marzo, PF. Delta: a new measure of agreement between two raters. Br J Math Stat Psychol 2004;57:1–19. https://doi.org/10.1348/000711004849268.Search in Google Scholar PubMed

7. Li Gwet, K. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol 2008;61:29–48. https://doi.org/10.1348/000711006x126600.Search in Google Scholar

8. Shankar, V, Bangdiwala, SI. Observer agreement paradoxes in 2x2 tables: comparison of agreement measures. BMC Med Res Methodol 2014;14:1–9. https://doi.org/10.1186/1471-2288-14-100.Search in Google Scholar PubMed PubMed Central

9. Ohyama, T. Statistical inference of agreement coefficient between two raters with binary outcomes. Commun Stat Theor Methods 2020;49:2529–39. https://doi.org/10.1080/03610926.2019.1576894.Search in Google Scholar

10. Vach, W, Gerke, O. Gwet’s AC1 is not a substitute for Cohen’s kappa–a comparison of basic properties. MethodsX 2023;10:102212. https://doi.org/10.1016/j.mex.2023.102212.Search in Google Scholar PubMed PubMed Central

11. Honda, C, Ohyama, T. Homogeneity score test of AC1 statistics and estimation of common AC1 in multiple or stratified inter-rater agreement studies. BMC Med Res Methodol 2020;20:20. https://doi.org/10.1186/s12874-019-0887-5.Search in Google Scholar PubMed PubMed Central

12. Giammarino, M, Mattiello, S, Battini, M, Quatto, P, Battaglini, LM, Vieira, ACL, et al.. Evaluation of inter-observer reliability of animal welfare indicators: which is the best index to use? Animals 2021;11:1445. https://doi.org/10.3390/ani11051445.Search in Google Scholar PubMed PubMed Central

13. Tan, KS, Yeh, Y-C, Adusumilli, PS, Travis, WD. Quantifying interrater agreement and reliability between thoracic pathologists: paradoxical behavior of Cohen’s kappa in the presence of a high prevalence of the histopathologic feature in lung cancer. JTO Clin Res Rep 2024;5:100618. https://doi.org/10.1016/j.jtocrr.2023.100618.Search in Google Scholar PubMed PubMed Central

14. Ganju, J, Zhou, K. The benefit of stratification in clinical trials revisited. Stat Med 2011;30:2881–9. https://doi.org/10.1002/sim.4351.Search in Google Scholar PubMed

15. Barlow, W, Lai, M-Y, Azen, SP. A comparison of methods for calculating a stratified kappa. Stat Med 1991;10:1465–72. https://doi.org/10.1002/sim.4780100913.Search in Google Scholar PubMed

16. Reed, JFIII. Homogeneity of kappa statistics in multiple samples. Comput Methods Progr Biomed 2000;63:43–6. https://doi.org/10.1016/s0169-2607(00)00074-2.Search in Google Scholar PubMed

17. Xu, M, Li, Z, Mou, K, Shuaib, KM. Homogeneity test of the rirst-order agreement coefficient in a stratified design. Entropy 2023;25:536. https://doi.org/10.3390/e25030536.Search in Google Scholar PubMed PubMed Central

18. Vach, W. The dependence of Cohen’s kappa on the prevalence does not matter. J Clin Epidemiol 2005;58:655–61. https://doi.org/10.1016/j.jclinepi.2004.02.021.Search in Google Scholar PubMed

19. Engle, RF. Chapter 13 Wald, likelihood ratio, and Lagrange multiplier tests in econometrics. Volume 2 of Handbook of Econometrics. North-Holland: Elsevier; 1984:775–826 pp.10.1016/S1573-4412(84)02005-5Search in Google Scholar

20. Mou, K, Li, Z, Ma, C. Asymptotic sample size for common test of relative risk ratios in stratified bilateral data. Mathematics 2023;11:4198. https://doi.org/10.3390/math11194198.Search in Google Scholar

21. Tang, M-L, Tang, N-S, Rosner, B. Statistical inference for correlated data in ophthalmologic studies. Stat Med 2006;25:2771–83. https://doi.org/10.1002/sim.2425.Search in Google Scholar PubMed

Received: 2024-09-03

Accepted: 2025-03-31

Published Online: 2025-04-30

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/ijb-2024-0080

Keywords for this article

the first-order agreement coefficient; homogeneity test; asymptotic statistic; sample size; confidence interval

Homogeneity test and sample size of response rates for AC 1 in a stratified evaluation design

Article

Abstract

References

Articles in the same Issue

Articles in the same Issue

Articles in the same Issue

Homogeneity test and sample size of response rates for AC ₁ in a stratified evaluation design