Scoring and Consequential Validity Evidence of Computer- and Paper-Based Writing Tests in Times of Change

María Guapacha-Chamorro; Orlando Chaves-Varón

doi:10.1515/CJAL-2024-0305

Enjoy 40% off

academic books on De Gruyter Brill *

Article

Scoring and Consequential Validity Evidence of Computer- and Paper-Based Writing Tests in Times of Change

María Guapacha-Chamorro
María Eugenia Guapacha-Chamorro is Associate Professor of English (applied linguistics) at Universidad del Valle in Colombia. Her research efforts have focused on language assessment, writing assessment, EFL writing, and language teachers’ professional development.
and Orlando Chaves-Varón
Orlando Chaves-Varón is Professor of English (applied linguistics) at Universidad del Valle in Colombia. His research efforts have focused on L2 writing and language teachers’ professional development.

Published/Copyright: September 18, 2024

Published by

Become an author with De Gruyter Brill

Author Information

From the journal Chinese Journal of Applied Linguistics Volume 47 Issue 3

Abstract

Little is known about how the assessment modality, i. e., computer-based (CB) and paper-based (PB) tests, affects language teachers’ scorings, perceptions, and preferences and, therefore, the validity and fairness of classroom writing assessments. The present mixed-methods study used Shaw and Weir’s (2007) sociocognitive writing test validation framework to examine the scoring and consequential validity evidence of CB and PB writing tests in EFL classroom assessment in higher education. Original handwritten and word-processed texts of 38 EFL university students were transcribed to their opposite format and assessed by three language lecturers (N = 456 texts, 152 per teacher) to examine the scoring validity of CB and PB tests. The teachers’ perceptions of text quality and preferences for assessment modality accounted for the consequential validity evidence of both tests. Findings revealed that the assessment modality impacted teachers’ scorings, perceptions, and preferences. The teachers awarded higher scores to original and transcribed handwritten texts, particularly text organization and language use. The teachers’ perceptions of text quality differed from their ratings, and physical, psychological, and experiential characteristics influenced their preferences for assessment modality. The results have implications for the validity and fairness of CB and PB writing tests and teachers’ assessment practices.

Keywords: scoring validity; consequential validity; CB and PB writing tests; classroom assessment

About the authors

María Guapacha-Chamorro

María Eugenia Guapacha-Chamorro is Associate Professor of English (applied linguistics) at Universidad del Valle in Colombia. Her research efforts have focused on language assessment, writing assessment, EFL writing, and language teachers’ professional development.

Orlando Chaves-Varón

Orlando Chaves-Varón is Professor of English (applied linguistics) at Universidad del Valle in Colombia. His research efforts have focused on L2 writing and language teachers’ professional development.

Acknowledgments

The authors declare no competing interests. The study was approved by Universidad del Valle as part of one of the authors’ doctoral thesis. The participants gave informed consent to participate in the study. We are grateful for the teachers who participated in this study.

References

AERA, APA, & NCME. (2014). Standards for educational and psychological testing. American Psychological Association.Search in Google Scholar

Aydin, S. (2006). The Effect of computers on the test and inter-rater reliability of writing tests of ESL Learners. Turkish Online Journal of Educational Technology-TOJET, 5(1), 75-81. https://eric.ed. gov/?id=EJ1102486Search in Google Scholar

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford University Press.Search in Google Scholar

Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford University Press.Search in Google Scholar

Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86-107. https://doi.org/10.1016/j.asw.2007.07.00110.1016/j.asw.2007.07.001Search in Google Scholar

Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74. https://doi.org/10.1080/1543430090346441810.1080/15434300903464418Search in Google Scholar

Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279293. https://doi. org/ 10.1080/0969594X.2010.52658510.1080/0969594X.2010.526585Search in Google Scholar

Barkaoui, K., & Knouzi, I. (2018). The effects of writing mode and computer ability on L2 test-takers’ essay characteristics and scores. Assessing Writing, 36, 19-31. https://doi.org/10.1016/j.asw.2018.02.00510.1016/j.asw.2018.02.005Search in Google Scholar

Breland, H., Lee, Y. W., & Muraki, E. (2005). Comparability of TOEFL CBT essay prompts: Response-mode analyses. Educational and Psychological Measurement, 65(4), 577-595. https://doi.org/10.1177/001316440427250410.1177/0013164404272504Search in Google Scholar

Bridgeman, B., & Cooper, P. (1998). Comparability of scores on word-processed and handwritten essays on the Graduate Management Admissions Test. Research Report No. 143. http://files.eric.ed. gov/fulltext/ED421528.pdfSearch in Google Scholar

Brown, A. (2003). Legibility and the rating of second language writing: An investigation of the rating of handwritten and word-processed IELTS task two essays. In R. Tulloh (Ed.), International English Language Testing System (IELTS) research reports: 4 (pp. 131-151). IELTS. https://search.informit.com.au/documentSummary;dn=909088164666390;res=IELHSSSearch in Google Scholar

Brown, H. D., & Abeywickrama, P. (2019). Language assessment: Principles and classroom practices (3rd ed.). Pearson Longman.Search in Google Scholar

Brunfaut, T., Harding, L., & Batty, A. O. (2018). Going online: The effect of mode of delivery on performances and perceptions on an English L2 writing test suite. Assessing Writing, 36, 3-18. https://doi.org/10.1016/j.asw.2018.02.00310.1016/j.asw.2018.02.003Search in Google Scholar

Canz, T., Hoffmann, L., & Kania, R. (2020). Presentation-mode effects in large-scale writing assessments. Assessing Writing, 45, 100470. https://doi.org/10.1016/j.asw.2020.10047010.1016/j.asw.2020.100470Search in Google Scholar

Chapelle, C., & Voss, E. (2016). 20 years of technology and language assessment in language learning & technology. Language Learning & Technology, 20(2), 116-128. http://llt.msu.edu/issues/june2016/chapellevoss.pdfSearch in Google Scholar

Cheng, L., & Sun, Y. (2015). Teachers’ grading decision making: Multiple influencing factors and methods. Language Assessment Quarterly, 12(2), 213-233. https://doi.org/10.1080/15434303.2015.101072610.1080/15434303.2015.1010726Search in Google Scholar

Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge University Press.Search in Google Scholar

Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). Sage.Search in Google Scholar

Crusan, D. (2010). Assessment in the second language writing classroom. The University of Michigan Press.10.3998/mpub.770334Search in Google Scholar

East, M. (2008). Dictionary use in foreign language writing exams: Impact and implications. John Benjamins.10.1075/lllt.22Search in Google Scholar

East, M. (2009). Evaluating the reliability of a detailed analytic scoring rubric for foreign language writing. Assessing Writing, 14(2), 88-115. https://doi.org/10.1016/j.asw.2009.04.00110.1016/j.asw.2009.04.001Search in Google Scholar

Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155-185. https://doi.org/10.1177/026553220708678010.1177/0265532207086780Search in Google Scholar

Eckes, T., Müller-Karabil, A., & Zimmermann, S. (2016). Assessing writing. In D. Tsagari, & J. Banerjee (Eds.), Handbook of second language assessment (pp. 147-164). De Gruyter.10.1515/9781614513827-012Search in Google Scholar

Elder, C., Knoch, U., & Zhang, R. (2009). Diagnosing the support needs of second language writers: Does the time allowance matter? TESOL Quarterly, 43(2), 351-360. http://www.jstor.org/stable/2778501510.1002/j.1545-7249.2009.tb00178.xSearch in Google Scholar

Fulcher, G. (2012). Assessment literacy for the language classroom. Language Assessment Quarterly, 9(2), 113-132. https://doi.org/10.1080/15434303.2011.64204110.1080/15434303.2011.642041Search in Google Scholar

Green, A., & Hawkey, R. (2012). Marking assessments: Rating scales and rubrics. In C. Coombe, P. Davidson, B. O’Sullivan, & S. Stoynoff (Eds.), The Cambridge guide to second language assessment (pp. 299-306). Cambridge University Press.Search in Google Scholar

Green, A., & Maycock, L. (2004). Computer-based IELTS and paper-based versions of IELTS. Research Notes, 18, 3-6. https://www.cambridgeenglish.org/images/23135-research-notes-18.pdfSearch in Google Scholar

Guapacha-Chamorro, M. E. (2020). Investigating the comparative validity of computer-and paper-based writing tests and differences in impact on EFL test-takers and raters (Doctoral dissertation). https://researchspace.auckland.ac.nz/bitstream/handle/2292/53273/Chamorro-2020-thesis.pdf?sequence=4Search in Google Scholar

Guapacha-Chamorro, M. E. (2022). Cognitive validity evidence of computer-and paper-based writing tests and differences in the impact on EFL test-takers in classroom assessment. Assessing Writing, 51, 100594. https://doi.org/10.1016/j.asw.2021.10059410.1016/j.asw.2021.100594Search in Google Scholar

Guapacha-Chamorro, M. E., & Chaves Varón, O. (2023). EFL writing studies in Colombia between 1990 and 2020: A qualitative research synthesis. Profile: Issues in Teachers Professional Development, 25(1), 247-267. https://doi.org/10.15446/profile.v25n1.9479810.15446/profile.v25n1.94798Search in Google Scholar

Hamp-Lyons, L. (2016). Farewell to Holistic Scoring? Assessing Writing, 27, A1-A2. https://doi.org/10.1016/j.asw.2015.12.00210.1016/j.asw.2015.12.002Search in Google Scholar

He, T. H., Gou, W. J., Chien, Y. C., Chen, I. S. J., & Chang, S. M. (2013). Multi-faceted Rasch measurement and bias patterns in EFL writing performance assessment. Psychological Reports, 112(2), 469-485. https://doi.org/10.2466/03.11.PR0.112.2.469-48510.2466/03.11.PR0.112.2.469-485Search in Google Scholar

Hyland, K. (2010). Teaching and researching writing (2nd Ed.). Pearson.Search in Google Scholar

Im, G. H., Shin, D., & Cheng, L. (2019). Critical review of validation models and practices in language testing: Their limitations and future directions for validation research. Language Testing in Asia, 9 (14), 1-26. https://doi.org/10.1186/s40468-019-0089-410.1186/s40468-019-0089-4Search in Google Scholar

Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL composition: A practical approach. Newbury House.Search in Google Scholar

Kane, M. (2013). The argument-based approach to validation. School Psychology Review, 42(4), 448-457. https://doi.org/10.1080/02796015.2013.1208746510.1080/02796015.2013.12087465Search in Google Scholar

Kim, H. R., Bowles, M., Yan, X., & Chung, S. J. (2018). Examining the comparability between paper- and computer-based versions of an integrated writing placement test. Assessing Writing, 36, 49-62. https://doi.org/10.1016/j.asw.2018.03.00610.1016/j.asw.2018.03.006Search in Google Scholar

Knoch, U. (2016). Validation of writing assessment. In C. Chapelle (Ed.), The encyclopedia of applied linguistics (pp. 1-6). Blackwell. https://doi.org/10.1002/9781405198431.wbeal148010.1002/9781405198431.wbeal1480Search in Google Scholar

Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155-163. https://doi.org/10.1016/j.jcm.2016.02.01210.1016/j.jcm.2016.02.012Search in Google Scholar

Landers, R. (2015). Computing intraclass correlations (ICC) as estimates of inter-rater reliability in SPSS. The Winnower, 2, 1-4. https://dx.doi.org/10.15200/winn.143518.8174410.15200/winn.143518.81744Search in Google Scholar

Lee, H. K. (2004). A comparative study of ESL writers’ performance in a paper-based and a computer-delivered writing test. Assessing Writing, 9(1), 4-26. https://doi.org/10.1016/j.asw.2004.01.00110.1016/j.asw.2004.01.001Search in Google Scholar

Lessien, E. (2013). The effects of typed versus handwritten essays on students’ scores on proficiency tests (Unpublished master’s thesis). Michigan State University, USA.Search in Google Scholar

Li, J. (2006). The mediation of technology in ESL writing and its implications for writing assessment. Assessing Writing, 11, 5-21. http://dx.doi.org/10.1016/j.asw.2005.09.00110.1016/j.asw.2005.09.001Search in Google Scholar

Mahshanian, A., Eslami, A. R., & Ketabi, S. (2017). Raters’ fatigue and their comments during scoring writing essays: A case of Iranian EFL learners. Indonesian Journal of Applied Linguistics, 7(2), 302-314. https://doi.org/10.17509/ijal.v7i2.834710.17509/ijal.v7i2.8347Search in Google Scholar

Mahshanian, A., & Shahnazari, M. (2020). The effect of raters’ fatigue on scoring EFL writing tasks. Indonesian Journal of Applied Linguistics, 10(1), 1-13. https://doi.org/10.17509/ijal.v10i1.2495610.17509/ijal.v10i1.24956Search in Google Scholar

Manalo, J. R., & Wolfe, E. W. (2000). The impact of composition medium on essay raters in foreign language testing. Paper presented at the Annual Meeting of the American Educational Research Association (New Orleans, LA, April 24-28, 2000, pp, 1-16). https://eric.ed.gov/?id=ED443836Search in Google Scholar

McNamara, T. (2000). Language testing. Oxford University Press.Search in Google Scholar

McNess, E., Arthur, L., & Crossley, M. (2015). “Ethnographic dazzle” and the construction of the “Other”: Revisiting dimensions of insider and outsider research for international and comparative education. Compare: A Journal of Comparative and International Education, 45(2), 295-316. https://doi.org/10.1080/03057925.2013.85461610.1080/03057925.2013.854616Search in Google Scholar

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). Macmillan.Search in Google Scholar

Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241256. https://doi.org/10.1177/02655322960130030210.1177/026553229601300302Search in Google Scholar

Milligan, L. (2016). Insider-outsider-inbetweener? Researcher positioning, participative methods and cross-cultural educational research. Compare: A Journal of Comparative and International Education, 46(2), 235-250. https://doi.org/10.1080/03057925.2014.92851010.1080/03057925.2014.928510Search in Google Scholar

Mislevy, R. J., & Risconscente, M. (2005). Evidence-centered assessment design: Layers, concepts, and terminology. PADI Technical Report No. 9. SRI International and University of Maryland. http://padi.sri.com/downloads/TR9_ECD.pdfSearch in Google Scholar

Mohammadi, M., & Barzgaran, M. (2010). Comparability of computer-based and paper-based versions of writing section of PET in Iranian EFL context. The Journal of Applied Linguistics, 3(2), 144-167. https://jal.tabriz.iau.ir/article_523270_eb02bb135b05ea9834d50066fd1a3e7d.pdfSearch in Google Scholar

Pallant, J. (2016). SPSS survival manual (6th ed.). Allen & Unwin.Search in Google Scholar

Phakiti, A., & Isaacs, T. (2021). Classroom assessment and validity: Psychometric and edumetric approaches. European Journal of Applied Linguistics and TEFL, 10(1), 3-24. https://discovery.ucl.ac.uk/id/eprint/10118328Search in Google Scholar

Pitoniak, M. J., Young, J. W., Martiniello, M., King, T. C., Buteux, A., & Ginsburgh, M. (2009). Guidelines for the assessment of English language learners. Educational Testing Service.Search in Google Scholar

Rahimi, M., & Zhang, L. J. (2018). Effects of task complexity and planning conditions on L2 argumentative writing production. Discourse Processes, 55(8), 726-742. https://doi. org/10.1080/0163853X.2017.133604210.1080/0163853X.2017.1336042Search in Google Scholar

Rahimi, M., & Zhang, L. J. (2019). Writing task complexity, students’ motivational beliefs, anxiety and their writing production in English as a second language. Reading and Writing, 32(3), 761-786. https://doi.org/10.1007/s11145-018-9887-910.1007/s11145-018-9887-9Search in Google Scholar

Russell, M., & Tao, W. (2004). The influence of computer-print on rater scores. Practical Assessment, Research, and Evaluation, 9(1), 10. https://doi.org/10.7275/2efe-ts97Search in Google Scholar

Shaw, S. (2003). Legibility and the rating of second language writing: The effect on examiners when assessing handwritten and word-processed scripts. Research Notes, 11(3), 7-10. https://www.cambridgeenglish.org/research-and-validation/publishedresearch/research-notesSearch in Google Scholar

Shaw, S., & Weir, C. (2007). Examining writing: Research and practice in assessing second language writing. Cambridge University Press.Search in Google Scholar

Slomp, D. (2016). An integrated design and appraisal framework for ethical writing assessment. The Journal of Writing Assessment, 9(1), 1-14. https://journalofwritingassessment.org/article.php?article=91Search in Google Scholar

Stemler, S., & Tsai, J. (2008). Best practices in inter-rater reliability three common approaches. In J. Osborne (Ed.), Best practices in quantitative methods (pp. 29-49). Sage.10.4135/9781412995627.d5Search in Google Scholar

Tate, T. P., Warschauer, M., & Abedi, J. (2016). The effects of prior computer use on computer-based writing: The 2011 NAEP writing assessment. Computers & Education, 101, 115-131. http://dx.doi.org/10.1016/j.compedu.2016.06.00110.1016/j.compedu.2016.06.001Search in Google Scholar

Turner, C. E. (2013). Classroom assessment. In G. Fulcher, & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 65-78). Routledge. https://www.routledgehandbooks.com/doi/10.4324/9780203181287.ch4Search in Google Scholar

Weigle, S. C. (2002). Assessing writing. Cambridge University Press.10.1017/CBO9780511732997Search in Google Scholar

Weigle, S. C. (2012). Assessing writing. In C. Coombe, P. Davidson, B. O’Sullivan, & S. Stoynoff (Eds.), The Cambridge guide to second language assessment (pp. 218-224). Cambridge University Press.Search in Google Scholar

Weigle, S. C. (2016). Second language writing assessment. In R. M. Manchón, & P. K. Matsuda (Eds.), Handbook of second and foreign language writing (pp. 473-493). De Gruyter.10.1515/9781614511335-025Search in Google Scholar

Weir, C. (2005). Language testing and validation. Palgrave.10.1057/9780230514577Search in Google Scholar

Weir, C., Yan, J., O’Sullivan, B., & Bax, S. (2007). Does the computer make a difference? The reaction of candidates to a computer-based versus a traditional handwritten form of the IELTS Writing component: Effects and impact. International English Language Testing System (IELTS) Research Reports, 7, 1-37. https://search.informit.com.au/documentSummary;dn=078964976417848;res=IELHSSSearch in Google Scholar

Wind, S. A., & Guo, W. (2021). Beyond agreement: Exploring rater effects in large-scale mixed format assessments. Educational Assessment, 26(4), 264-283. https://doi.org/10.1080/10627197.2021.196227710.1080/10627197.2021.1962277Search in Google Scholar

Wolfe, E. W., & Manalo, J. R. (2004). Composition medium comparability in a direct writing assessment of non-native English speakers. Language Learning & Technology, 8(1), 53-65. http://dx.doi.org/10125/25229Search in Google Scholar

Xu, T. S., Zhang, L. J., & Gaffney, J. S. (2022). Examining the relative effects of task complexity and cognitive demands on students’ writing in a second language. Studies in Second Language Acquisition, 44(2), 483-506. https://doi.org/10.1017/S027226312100031010.1017/S0272263121000310Search in Google Scholar

Zhang, Q., & Min, G. (2019). Chinese writing composition among CFL learners: A comparison between handwriting and typewriting. Computers and Composition, 54,102522. https://doi.org/10.1016/j.compcom.2019.10252210.1016/j.compcom.2019.102522Search in Google Scholar

Zhi, M., & Huang, B. (2021). Investigating the authenticity of computer- and paper-based ESL writing tests. Assessing Writing, 50, 100548. https://doi.org/10.1016/j.asw.2021.10054810.1016/j.asw.2021.100548Search in Google Scholar

Published Online: 2024-09-18

Published in Print: 2024-09-25

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/CJAL-2024-0305

Keywords for this article

scoring validity; consequential validity; CB and PB writing tests; classroom assessment