Abstract
Gwet’s first-order agreement coefficient (AC 1) is widely used to evaluate the consistency between raters. Considering the existence of a certain relationship between the raters, the paper aims to test the equality of response rates and the dependency between two raters of modified AC 1’s in a stratified design and estimates the sample size for a given significance level. We first establish a probability model and then estimate the unknown parameters. Further, we explore the homogeneity test of these AC 1’s under the asymptotic method, such as likelihood ratio, score, and Wald-type statistics. In numerical simulation, the performance of statistics is investigated in terms of type I error rates (TIEs) and power while finding a suitable sample size under a given power. The results show that the Wald-type statistic has robust TIEs and satisfactory power and is suitable for large samples (n≥50). Under the same power, the sample size of the Wald-type test is smaller when the number of strata is large. The higher the power, the larger the required sample size. Finally, two real examples are given to illustrate these methods.
Funding source: 2025 Central Guidance for Local Science and Technology Development Fund
Award Identifier / Grant number: ZYYD2025ZY20
Funding source: Xinjiang University Undergraduate Training Program for Innovation and Entrepreneurship
Award Identifier / Grant number: XJU-SRT-23102
-
Research ethics: Not applicable.
-
Informed consent: Informed consent was obtained from all individuals included in this study, or their legal guardians or wards.
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Use of Large Language Models, AI and Machine Learning Tools: None declared.
-
Conflict of interest: The authors state no conflict of interest.
-
Research funding: This work is supported by the Central Guidance for Local Science and Technology Development Fund (Grant No. ZYYD2025ZY20). National Natural Science Foundation of China (Grant No. 12061070), Science and Technology Department of Xinjiang Uygur Autonomous Region (Grant No. 2021D01E13).
-
Data availability: Clinical data referred to are from Barlow et al. (1991) and Reed III (2000).
References
1. Scott, WA. Reliability of content analysis: the case of nominal scale coding. Public Opin Q 1955;19:321–5. https://doi.org/10.1086/266577.Search in Google Scholar
2. Cohen, J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960;20:37–46. https://doi.org/10.1177/001316446002000104.Search in Google Scholar
3. Cicchetti, DV, Feinstein, AR. High agreement but low kappa: ii. resolving the paradoxes. J Clin Epidemiol 1990;43:551–8. https://doi.org/10.1016/0895-4356(90)90159-m.Search in Google Scholar PubMed
4. Wilson Holley, J, Paul Guilford, J. A note on the G index of agreement. Educ Psychol Meas 1964;24:749–53. https://doi.org/10.1177/001316446402400402.Search in Google Scholar
5. Aickin, M. Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 1990;46:293–302. https://doi.org/10.2307/2531434.Search in Google Scholar
6. Andrés, AM, Marzo, PF. Delta: a new measure of agreement between two raters. Br J Math Stat Psychol 2004;57:1–19. https://doi.org/10.1348/000711004849268.Search in Google Scholar PubMed
7. Li Gwet, K. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol 2008;61:29–48. https://doi.org/10.1348/000711006x126600.Search in Google Scholar
8. Shankar, V, Bangdiwala, SI. Observer agreement paradoxes in 2x2 tables: comparison of agreement measures. BMC Med Res Methodol 2014;14:1–9. https://doi.org/10.1186/1471-2288-14-100.Search in Google Scholar PubMed PubMed Central
9. Ohyama, T. Statistical inference of agreement coefficient between two raters with binary outcomes. Commun Stat Theor Methods 2020;49:2529–39. https://doi.org/10.1080/03610926.2019.1576894.Search in Google Scholar
10. Vach, W, Gerke, O. Gwet’s AC1 is not a substitute for Cohen’s kappa–a comparison of basic properties. MethodsX 2023;10:102212. https://doi.org/10.1016/j.mex.2023.102212.Search in Google Scholar PubMed PubMed Central
11. Honda, C, Ohyama, T. Homogeneity score test of AC1 statistics and estimation of common AC1 in multiple or stratified inter-rater agreement studies. BMC Med Res Methodol 2020;20:20. https://doi.org/10.1186/s12874-019-0887-5.Search in Google Scholar PubMed PubMed Central
12. Giammarino, M, Mattiello, S, Battini, M, Quatto, P, Battaglini, LM, Vieira, ACL, et al.. Evaluation of inter-observer reliability of animal welfare indicators: which is the best index to use? Animals 2021;11:1445. https://doi.org/10.3390/ani11051445.Search in Google Scholar PubMed PubMed Central
13. Tan, KS, Yeh, Y-C, Adusumilli, PS, Travis, WD. Quantifying interrater agreement and reliability between thoracic pathologists: paradoxical behavior of Cohen’s kappa in the presence of a high prevalence of the histopathologic feature in lung cancer. JTO Clin Res Rep 2024;5:100618. https://doi.org/10.1016/j.jtocrr.2023.100618.Search in Google Scholar PubMed PubMed Central
14. Ganju, J, Zhou, K. The benefit of stratification in clinical trials revisited. Stat Med 2011;30:2881–9. https://doi.org/10.1002/sim.4351.Search in Google Scholar PubMed
15. Barlow, W, Lai, M-Y, Azen, SP. A comparison of methods for calculating a stratified kappa. Stat Med 1991;10:1465–72. https://doi.org/10.1002/sim.4780100913.Search in Google Scholar PubMed
16. Reed, JFIII. Homogeneity of kappa statistics in multiple samples. Comput Methods Progr Biomed 2000;63:43–6. https://doi.org/10.1016/s0169-2607(00)00074-2.Search in Google Scholar PubMed
17. Xu, M, Li, Z, Mou, K, Shuaib, KM. Homogeneity test of the rirst-order agreement coefficient in a stratified design. Entropy 2023;25:536. https://doi.org/10.3390/e25030536.Search in Google Scholar PubMed PubMed Central
18. Vach, W. The dependence of Cohen’s kappa on the prevalence does not matter. J Clin Epidemiol 2005;58:655–61. https://doi.org/10.1016/j.jclinepi.2004.02.021.Search in Google Scholar PubMed
19. Engle, RF. Chapter 13 Wald, likelihood ratio, and Lagrange multiplier tests in econometrics. Volume 2 of Handbook of Econometrics. North-Holland: Elsevier; 1984:775–826 pp.10.1016/S1573-4412(84)02005-5Search in Google Scholar
20. Mou, K, Li, Z, Ma, C. Asymptotic sample size for common test of relative risk ratios in stratified bilateral data. Mathematics 2023;11:4198. https://doi.org/10.3390/math11194198.Search in Google Scholar
21. Tang, M-L, Tang, N-S, Rosner, B. Statistical inference for correlated data in ophthalmologic studies. Stat Med 2006;25:2771–83. https://doi.org/10.1002/sim.2425.Search in Google Scholar PubMed
© 2025 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Research Articles
- Prognostic adjustment with efficient estimators to unbiasedly leverage historical data in randomized trials
- Homogeneity test and sample size of response rates for AC 1 in a stratified evaluation design
- A review of survival stacking: a method to cast survival regression analysis as a classification problem
- DsubCox: a fast subsampling algorithm for Cox model with distributed and massive survival data
- A hybrid hazard-based model using two-piece distributions
- Regression analysis of clustered current status data with informative cluster size under a transformed survival model
- Bayesian covariance regression in functional data analysis with applications to functional brain imaging
- Risk estimation and boundary detection in Bayesian disease mapping
- An improved estimator of the logarithmic odds ratio for small sample sizes using a Bayesian approach
- Short Communication
- A multivariate Bayesian learning approach for improved detection of doping in athletes using urinary steroid profiles
- Research Articles
- Guidance on individualized treatment rule estimation in high dimensions
- Weighted Euclidean balancing for a matrix exposure in estimating causal effect
- Penalized regression splines in Mixture Density Networks
Articles in the same Issue
- Frontmatter
- Research Articles
- Prognostic adjustment with efficient estimators to unbiasedly leverage historical data in randomized trials
- Homogeneity test and sample size of response rates for AC 1 in a stratified evaluation design
- A review of survival stacking: a method to cast survival regression analysis as a classification problem
- DsubCox: a fast subsampling algorithm for Cox model with distributed and massive survival data
- A hybrid hazard-based model using two-piece distributions
- Regression analysis of clustered current status data with informative cluster size under a transformed survival model
- Bayesian covariance regression in functional data analysis with applications to functional brain imaging
- Risk estimation and boundary detection in Bayesian disease mapping
- An improved estimator of the logarithmic odds ratio for small sample sizes using a Bayesian approach
- Short Communication
- A multivariate Bayesian learning approach for improved detection of doping in athletes using urinary steroid profiles
- Research Articles
- Guidance on individualized treatment rule estimation in high dimensions
- Weighted Euclidean balancing for a matrix exposure in estimating causal effect
- Penalized regression splines in Mixture Density Networks