A Difficulty-Informed Approach to Developing Language Assessment Literacy for Classroom Purposes

Armin Berger

doi:10.1515/CJAL-2023-0209

Enjoy 40% off

academic books on De Gruyter Brill *

Article Open Access

A Difficulty-Informed Approach to Developing Language Assessment Literacy for Classroom Purposes

Armin Berger
Armin BERGER is a senior scientist and senior lecturer in the Department of English and American Studies at the University of Vienna, Austria, where he acts as academic coordinator of the English Language Competence program and teaches specialized classes in language testing and assessment in the teacher education program.

Published/Copyright: June 8, 2023

Published by

Become an author with De Gruyter Brill

Author Information

From the journal Chinese Journal of Applied Linguistics Volume 46 Issue 2

Abstract

For pre-service teacher education, it would be helpful to know how difficult certain aspects of language testing and assessment are for students. Such information could serve different pedagogical purposes including program design and lesson sequencing. This paper, situated in the context of pre-service teacher education at the University of Vienna, Austria, presents an approach to developing language assessment literacy (LAL) which takes account of the difficulty of language assessment. To obtain empirical difficulty estimates, the items of a LAL test for pre-service teachers of English were converted into statements of ability, knowledge, and understanding and then calibrated by means of multi-facet Rasch analysis based on performance data from 420 students. Qualitative content analysis was used to identify clusters among the calibrations, thus characterizing the difficulty continuum. The findings show a clear progression which can offer a basis for the design of teacher education programs. I first report on the process of generating the difficulty estimates. Then I describe the resulting difficulty continuum with particular emphasis on classroom-based assessment. Finally, I suggest design principles for difficulty-informed assessment courses, illustrating how they could be implemented in the teacher education program at the University of Vienna.

Keywords: language assessment literacy; pre-service teacher education; difficulty of language assessment; multi-facet Rasch analysis; course design principles

1 Introduction

Over the past two decades, language assessment literacy (LAL) has developed into a key focus within mainstream research in the field of language testing and assessment. Several models and heuristics have been proposed since the beginning of the millennium identifying some of the core components of LAL (Brindley, 2001; Davies, 2008; Fulcher, 2012; Inbar-Lourie, 2008). After the initial period with its predominantly componential perspective, the emphasis moved towards stakeholder groups and their differential needs. The central question was not so much what constitutes LAL in general but what type of assessment-related knowledge and skills specific stakeholder groups need in their contexts (Baker, 2016; Deygers & Malone, 2019; O’Loughlin, 2013; Pill & Harding, 2013; Taylor, 2013). Special attention has been paid to the needs of language teachers as one of the primary stakeholder groups (Berry et al., 2017; Gan & Lam, 2020; Vogt & Tsagari, 2014; Xu & Brown, 2017).

While the debate continues about what exactly constitutes LAL for language teachers, the literature has recently begun to focus more strongly on LAL development, levels of LAL, and learning trajectories. The new key questions are how LAL develops both in individual teachers and within specific groups of teachers, how they progress from lower to higher levels of proficiency, and how LAL programs are implemented. For institutions offering pre-service teacher education, in particular, it would be interesting to know which aspects of language assessment are relatively easy for students and may thus be acquired or developed earlier, and which aspects are relatively difficult and thus acquired or developed later. Such information about the difficulty of certain aspects of the construct could help teacher educators to customize their programs.

Extending the developmental perspective provided by previous research (Berger et al., 2023), this study sheds further light on the question as to how easy or difficult certain aspects of language assessment are for pre-service teachers of English. The focus is on differences in difficulty in the context of secondary-level teacher education in Austria. Using multi-facet Rasch analysis as an analytic tool, the study generated empirical difficulty estimates of assessment-related abilities, knowledge, and understanding (AKUs), based on post-hoc content referencing of a LAL theory test used in the teacher education program at the Department of English and American Studies, University of Vienna, Austria. Difficulty is thus construed as the probability of pre-service teachers providing a correct answer to an item representing an assessment-related AKU.

I begin this article by reviewing some of the literature on LAL development, with particular emphasis on language teachers. I then go on to describe the dataset used in this study, as well as the procedures, the main instrument, and the analysis. After presenting the findings, especially in relation to classroom-based assessment, I discuss some implications for LAL development in teacher education programs.

2 Developing Language Assessment Literacy in Teacher Education

With teachers spending a substantial amount of time diagnosing their learners’ strengths and weaknesses, tracking their achievement, monitoring their progress, and certifying their proficiency, assessment is an essential field of teacher activity. It is well documented that assessment and feedback are crucial factors in student learning (Black & Wiliam, 2010). Professional competence in language testing and assessment is therefore an essential requisite for all language teachers. Over the past two decades, there have been several attempts to determine what knowledge and skills language teachers require for assessing their students’ language competence. In an early approach, Davies (2008) emphasized the importance of skills (i.e., expertise in item writing, statistics, and test analysis), knowledge (i.e., language description and measurement), and principles (i.e., questions of validity, ethics, and professionalism). Inbar-Lourie (2008) also distinguished three dimensions for the content of LAL courses, the “what” (i.e., the description of the trait), the “how” (i.e., the assessment method and process), and the “why” (i.e., the reason and rationale for the assessment). In addition to the components of test design and development, large-scale standardized testing, classroom testing and washback, and validity and reliability, Fulcher (2012) emphasized more strongly than previous approaches the importance of understanding the context and social dimensions of language assessment.

What these approaches have in common is that they acknowledge the importance for language teachers to understand the relationships between testing, assessment, teaching, and learning, but they differ in the relative importance they accord to these areas. Over time, there has been a shift from a focus on testing to assessment and teaching, and, more recently, to one on learning. Fulcher (2021), for example, extended his earlier definition of LAL from the perspective of learning-oriented assessment, emphasizing that the primary aim is to harness assessment for the purpose of enhancing learning. Linking learning-oriented assessment to validity theory, he contended that the validity of learning-oriented assessment hinges on its potential to change individual learners. Accordingly, the main concern of LAL in the framework of learning-oriented assessment is to equip teachers with what they “need to know to harness assessment in the service of change” (Fulcher, 2021, p. 35). Drawing on Hamp-Lyons (2017), he suggested seven elements to a theory of LAL for learning-oriented assessment, or practical skills, including task design for effective learning, self- and peer assessment, timely feedback, effective teacher questioning, scaffolding of performance, lesson planning and classroom management for reflection, and management of affective impact on learners.

While the precise requirements for teachers are still under debate, recent discussions have focused more closely on LAL development. Although language testing and assessment is increasingly being anchored in second language teacher education around the globe, teachers often have an insufficient degree of LAL (Vogt & Tsagari, 2014; Xu & Brown, 2017), inadequate training for their future needs (Lam, 2015; Lee & Coniam, 2013), or misconceptions about language assessment (Berry et al., 2017). It is in this context, between the poles of increasing significance and insufficient levels of LAL, that meta-reflection about LAL development has gathered momentum. The literature providing a developmental perspective encompasses at least four interrelated aspects: (1) theoretical LAL levels, (2) empirical LAL progressions, (3) longitudinal LAL development, and (4) effective LAL pedagogies.

As regards the first aspect, the focus on stakeholder groups has brought with it a growing emphasis not just on the different types of knowledge/skills needed but also on different degrees of knowledge/skills depending on the stakeholders’ depth of involvement in certain aspects of language assessment. Several theoretical profiles and hierarchies have been suggested. While Fulcher’s (2012) earlier model essentially implied a hierarchy, with practical knowledge being the bedrock and principles and contexts featuring at higher levels, Taylor (2013) explicitly addressed the question of “differential assessment literacy” and LAL levels (p. 409). Drawing on the four levels of LAL suggested by Pill and Harding (2013), she illustrated the differential needs for test writers, university administrators, professional language testers, and classroom teachers, with the latter being hypothesized to require “multidimensional literacy” (i.e., the highest level of literacy) in relation to language pedagogy; “procedural and conceptual literacy” in relation to technical skills, local practices, sociocultural values, and personal beliefs/attitudes; but only “functional literacy” in terms of knowledge of theory, scores and decision making, and principles and concepts (Taylor, 2013, p. 410). Continuing in Taylor’s footsteps, Kremmel and Harding (2020) set out to empirically validate the hypothesized profiles by asking different stakeholder groups how knowledgeable or skilled they think they need to be within their professional roles in relation to a number of aspects of language assessment. While Kremmel and Harding also used a vertical scale with different levels, ranging from “not knowledgeable/skilled at all” to “extremely knowledgeable/skilled,” their main interest was not so much the question of levels but empirical dimensions of LAL.

Secondly, besides broad theoretical levels, the developmental view comprises the question of empirical LAL progressions which are granular enough to be useful for teacher education purposes, such as designing teaching-learning sequences for LAL courses, diagnosing students’ strengths and weaknesses, or providing focused feedback. While the work outlined above helpfully draws attention to the issue of the breadth and depth of assessment knowledge for different stakeholders, the level distinctions are mainly theoretical, and they are not really detailed enough to be operationally useful for professional development purposes. This concept of a detailed vertical progression was taken up in Berger et al.’s (2023) investigation of the perceived level of difficulty of a range of assessment-related AKUs. Using our adaptation of Kremmel and Harding’s (2020) Language Assessment Literacy Survey, for which we reformulated the original questionnaire items into statements of ability (I can ...), knowledge (I know ...), and understanding (I understand ...), we asked pre-service teachers of English in Austria to indicate on a five-point scale how well they thought they could do, know, or understand what is expressed in the statements after attending a one-semester assessment course. Scaling these data by means of multi-facet Rasch measurement, we obtained a continuum of difficulty and ability in relation to these statements. A qualitative content check of this scale helped us to extract salient features of LAL at five levels and condense the calibrations into five level descriptions, which we termed, in ascending order, emergent LAL, operational LAL, generative LAL, technical LAL, and instructional LAL. The same methodology was replicated in a follow-up study involving in-service teachers of English at all types of secondary-level schools throughout Austria (Berger & Heaney, 2022).

A third aspect of the developmental perspective is longitudinal LAL development. As Taylor (2013) admitted, not much is known yet about how LAL grows over time both in individual stakeholders and across stakeholder groups. There is hardly any research to date investigating how novices in language assessment might progress from a low level of proficiency to a higher level of proficiency, not just in terms of how they expand their knowledge base horizontally but also in terms of how they mature along a vertical dimension. This line of inquiry calls for longitudinal approaches (e.g., Xu, 2017), which are notoriously more difficult than other types of studies but necessary if we want to learn more about this under-researched area.

Finally, the developmental perspective covers LAL pedagogies, including both epistemological models underpinning program design and specific courses exemplifying implementation at the local level. The literature offers a number of descriptions of and reflections on existing language assessment courses for pre- and in-service teachers (e.g., Giraldo, 2021; Kremmel et al., 2018; O’Loughlin, 2006; Walters, 2010). Most of these courses are of a rather short duration, usually only one semester, with priority given to topics such as selecting, evaluating, and designing assessment instruments, often at the expense of attention to assessment principles such as ethics, transparency, and fairness (Giraldo, 2021). What the literature clearly lacks is work dedicated to the long-term effects of such courses. In fact, there is evidence to suggest that the impact is as short-lived as the courses themselves, and that their sustainability is questionable at best (Giraldo, 2021). The literature also lacks more general models and heuristics for LAL development in teacher education (e.g., Berger, 2012; Yan & Fan, 2021). To conclude with Fulcher (2021, p. 45), “the next step in LAL research will be the investigation of successful LAL pedagogies for language teachers and other stakeholders.”

The present study integrates the quest for empirical LAL progressions with pedagogical considerations. It is an extension of previous work done to investigate the difficulty of assessment-related abilities (Berger et al., 2023), with the express purpose of providing an empirical basis for course design. Whilst the previous study offers a tentative progression of LAL in the Austrian context, it is limited in that the findings are based purely on pre-service teachers’ perceptions of the difficulty of assessment-related AKUs, as opposed to what they can actually do, know, or understand. This study, in contrast, is based on test performance data reflecting students’ knowledge and understanding. The following research questions are addressed:

) What is the empirical difficulty of the assessment-related AKUs for pre-service teachers of English at the University of Vienna?
) How can the continuum from easy to difficult AKUs be characterized?

3 Institutional Context of the Study

Students in the Bachelor of Education (BEd) program in the English Department at the University of Vienna take a compulsory one-semester course on language assessment. This introductory course is designed to give pre-service teachers of English a basic grounding in some of the key areas of language assessment (Department of English and American Studies, University of Vienna, 2022). Conceptually, within a three-dimensional framework characterizing the course content (focus on formal large-scale language testing and/or classroom-based language assessment), qualifications (focus on skills, knowledge, and/or principles), and degree of collaboration (focus on self-directed and/or collaborative learning), the course can be broadly described as a knowledge-skills-and-principles-integrated, collaborative classroom assessment course (Berger, 2012). Table 1 provides a brief overview of the course (for details, see Berger & Heaney, 2019).

Table 1

Overview of the Assessment Course at the University of Vienna

	1. To help future teachers of English to develop a basic understanding of fundamental concepts in language assessment
Course aims	2. To help them understand how to implement these concepts in the selection, development, and evaluation of assessment instruments
	3. To help them understand the links between assessment, testing, learning, and teaching
	4. To help them to become confident enough to continue learning about language assessment on their own

Topics covered (in this order)	Assessment purpose, kinds of assessments, communicative language ability, construct definition, classroom tests, test specifications, assessment methods, principles of language assessment, test usefulness, assessing the four skills plus grammar and vocabulary, basic rater training for writing and speaking, learning-oriented assessment, feedback, grading
Project-based	1. Developing a writing prompt for a classroom achievement test, piloting the prompt, analyzing the results, revising the prompt, providing formative feedback to the test takers
assignments	2. Selecting a grammar or vocabulary task for an achievement test, evaluating its usefulness, modifying some of the items if necessary

	1. Classwork and homework connected to the project assignments, rater training for writing, and a self-assessment task
Assessment	2. A theory component
	3. A final written assignment

Besides input and theory, the course offers a strong practical component involving two project-based assignments. In groups of four to six, students develop, pilot, and revise a writing prompt suitable for a classroom achievement test, based on some chapters of a textbook. The second assignment is to select an existing grammar or vocabulary task for the same achievement test, evaluate its usefulness for the given purpose, and modify some of the items if necessary. This iterative task development and selection process is accompanied by several loops of feedback and revision, involving different types of feedback, such as the use of exemplars of construct definitions, reflective dialogues on the draft prompts, or peerfeedforward on the selected tasks. These project-based assignments provide students the opportunity to go through parts of a test development cycle in a guided and reflective way, thereby applying and extending the knowledge that they have acquired during the semester.

As can be seen in Table 1, the final grade for this course is partly based on a theory component. While in the wake of the pandemic this theory component was changed to an online open-book test, in pre-coronavirus times, it took the form of a knowledge-based paper-and-pencil test, either taken in one sitting or divided into three smaller parts written at different points during the semester. The present study uses performance data from the paper-and-pencil version of the theory test.

4 Methods

4.1 Participants

The test performances of 420 Austrian pre-service teachers of English were analyzed. They were all undergraduate students enrolled in the BEd program at the Department of English and American Studies in Vienna. This cohort of students attended the compulsory assessment course between summer semester 2013 and winter semester 2019, when the paper-and-pencil version of the LAL theory test was in place.

4.2 Instruments and Procedures

The 28 test booklets available for this study comprised a total of 142 different test items. The number of items contained in each booklet varied between 10 and 30, depending on whether the test was split up into three parts (from 2013 to 2017) or whether it was written in one sitting (from 2017 to 2019). Three types of item formats were used. Firstly, selected-response items required students to choose the correct answers from a list of given options. For example, students were asked to indicate which categories apply to (parts of) the Austrian standardized school-leaving exam in English, with the response options being, inter alia, diagnostic test, achievement test, proficiency test, placement test, formative assessment, summative assessment, assessment for learning, assessment of learning, assessment as learning, norm-referenced test, criterion-referenced test, adaptive test, discrete-point test, and performance test. Secondly, limited-response items elicited a short answer, such as a term, a definition, an explanation, a list of advantages or disadvantages, suggested measures, possible washback effects, etc. For example, students were asked to explain the concept of triangulation in language assessment or to list measures to transform traditional tests into more pedagogically fulfilling learning experiences. The third group of items consisted of open-ended questions which required a slightly longer response, such as an outline of the relationship between formal standardized tests and alternative assessment in terms of practicality, reliability, washback, and authenticity, or a brief discussion of the disadvantages of the multiple-choice format based on an analysis of a given test item.

For this study, the content of each item was analyzed in a content-referencing procedure (McNamara, 1996), a technique sometimes used in the context of Rasch analysis of test data to generate criterion-referenced descriptions of performance. It involves a post hoc investigation of the content of items at various levels of difficulty with a view to describing the nature of achievement at those levels, thereby delineating a continuum of ability and difficulty (McNamara, 1996, p. 200). In the present study, each test item was investigated to identify the knowledge, ability, or understanding it operationalizes. Thus, the 142 items were converted into minimally meaningful statements of ability (can ...), knowledge (knows ...), and understanding (understands ...). For example, the selected-response item illustrated above was broken down into several statements including understands that the Austrian school-leaving exam in English is a proficiency test, understands that the Austrian school-leaving exam is not a diagnostic test, understands that the Austrian school-leaving exam is not an achievement test, etc., or the open-ended question mentioned above became can explain the relationship between formal standardized tests and alternative assessment in terms of practicality, can explain the relationship between formal standardized tests and alternative assessment in terms of reliability, etc. In this way, a total of 206 AKU statements were generated. The statements were then grouped into categories according to the main topics covered in the assessment course and according to the item constructs they represent. Table 2 provides an overview of these categories.

Table 2

Overview of Item Categories and Item Constructs

Item Category	Number AKUs of	Item Construct	Number AKUs of
Assessment concepts	51	Test construct	58
Principles of language assessment	33	Austrian school-leaving exam (SRDP)	23
Designing classroom tests	29	Test method/response format	22
Assessing speaking	21	Performance-based assessment	14
Assessing grammar/vocabulary	20	Scoring and grading	13
Alternatives in assessment	15	Terminology	11
Assessing listening	15	Alternative assessment	10
Assessing reading	14	Validity	10
Assessing writing	6	Washback	10
Grading and evaluation	2	Test purpose	7
		Reliability	6
		Test usefulness	5
		Formative assessment	4
		Achievement testing	4
		Multiple-choice format	3
		Test specifications	2
		Components of language competence	2
		Triangulation	2
Total	206	Total	206

Finally, the students’ responses relating to the AKU statements were scored as either correct or incorrect in the case of dichotomous items. With selected- and limited-response items, converting the students’ answers into a dichotomous score (1/0) was usually straightforward. Less frequently, open-ended questions involved a judgment on the researcher’s part. For example, the item Briefly describe the relationship between formal standardized tests and alternative assessment in terms of reliability required a decision about whether the answer was adequate or not (1/0). In the case of polytomous items, partial credit was awarded. For example, the item List four oral response modes for a speaking test was scored 0-4 depending on how many correct response modes were listed.

4.3 Analysis

Both research questions in this study were addressed in the context of Rasch measurement. In order to scale the AKU statements, a multi-facet Rasch analysis (Linacre, 1989) was conducted, using the computer program FACETS version 3.83.6 (Linacre, 2022). Multi-facet Rasch analysis has the advantage that it allows us to calibrate the relevant assessment parameters, placing them on a single common interval scale and generating difficulty estimates for each element of a facet (Bond & Fox, 2007). As such, it is particularly suitable for the purposes of this study. The AKU statements appear calibrated on a common scale, describing a continuum of difficulty/ability, and clusters of AKU statements at certain points on the continuum can be used to describe the nature of achievement at those levels. The two main facets for this study are candidates and items, as each response in the test is considered to be a function of the interaction between person ability and the difficulty of the item. Test booklet, item category, and item construct were defined as dummy facets, which are not intended for measuring main effects. As the tests contain a mix of dichotomous and polytomous items, several model statements were specified for the analysis. To handle different types of items, FACETS allows a multiple-model analysis, with the dichotomous model for items scored either right or wrong and the partial credit model for polytomous items.

At first, the program was run to clean the data. As the focus of this study was on items, not persons, candidates who behaved markedly different from the model’s expectations (i.e., those elements whose fit values were outside the generally accepted range from 0.5 to 1.5) were removed from the dataset (Bond & Fox, 2007). Then another FACETS analysis was run to obtain the final difficulty estimates for each item.

While the first research question concerning the difficulty of assessment-related AKUs for Austrian pre-service teachers of English was approached quantitatively by generating difficulty estimates through FACETS, the second research question was addressed by qualitatively inspecting the item measurement report. In particular, content clusters among the calibrations were examined, as well as the nature and plausibility of the progression. To this end, cut-scores were set at equal intervals along the logit scale, and the resulting bands were observed qualitatively.

5 Results

5.1 The Difficulty of the Assessment-Related AKUs

The results showed no overfitting elements, but 49 students had infit values higher than 1.5. Removing the misfitting candidates from the data set resulted in a final sample of n = 371. The facets map in Figure 1 provides an overview of the relationships between the elements of the facets, displaying, from left to right, the measurement scale in logit units, the relative ability of the candidates, and the relative difficulty of the AKUs. More able candidates are located at the top, while less able ones are located at the bottom. Similarly, difficult items appear at the top, whereas easier ones appear at the bottom. This visual representation allows a direct comparison of elements. For example, a student plotted at the same point on the logit scale as a dichotomous item has a 50% probability of succeeding with that item.

Figure 1

Facets Map Displaying the Calibrated Elements

Research question one (What is the empirical difficulty of the assessment-related AKUs for Austrian pre-service teachers of English?) can be answered by inspecting the measures of the calibrated AKU statements, illustrative examples of which are given in the Appendix. As can be seen, the item difficulty estimates range from 5.30 logits for the most difficult item (can analyze a given item in terms of Purpura’s [2004] model of grammatical and pragmatic knowledge [logical connection]) to -4.15 logits for the easiest item (understands that the Austrian school-leaving exam is not a form of assessment as learning), covering a total range of 9.45 logits, with a mean of -0.14 (SD = 1.67) and a mean SE of 0.42. The high reliability (0.92) and strata indices (4.71), as well as the significant chi-square statistic, X² (205, n = 371) = 4273.6, p < . 00, show that the items differed significantly in terms of difficulty and that the analysis reliably separated the statements into five levels of difficulty.

5.2 An Empirical Difficulty Continuum

To answer research question two (How can the continuum from easy to difficult AKUs be characterized?), the item measurement report was examined qualitatively. For ease of interpretation, the calibrated list of items was divided into five bands of equal width, where band five represents the lowest level and band one the highest level. Through a process of iterative reading, clusters of related items were identified. Three questions guided this process: (1) Which items with similar difficulty measures are thematically related to form a content cluster? (2) At which point on the scale do the related items start appearing? (3) In which band do most items forming a cluster occur? In this way, it was possible to characterize, in general terms, the individual levels as well as the overall progression. While a complete survey of these clusters is beyond the scope of this paper, the following paragraphs provide some examples by way of illustration.

At the bottom of the scale (band five), there is a large concentration of AKU statements about terminology and basic assessment concepts. It seems that students at an elementary LAL level can explain and illustrate fundamental concepts, provided that the explanations remain general and concrete (e.g., can explain the concept of discrete-point testing, can explain the concept of integrative testing). They can also compare standardized tests with alternatives in assessment (e.g., can explain the relationship between formal standardized tests and alternative assessment in terms of washback, practicality, and authenticity). Another cluster relates to the Austrian standardized school-leaving exam, or SRDP for short (e.g., understands that the Austrian school-leaving exam is not a diagnostic test, understands that the Austrian school-leaving exam is not ‘assessment for learning’). What is notable here is that only negative items relating to the school-leaving exam appear at the bottom level, while all positive items occur at the levels above, which shows that it is easier to understand what this exam is not, as opposed to what it is.

At the next level higher up (band two), there seems to be a concentration of AKU statements which are strongly connected to pedagogical aspects of assessment in the (Austrian) language classroom. Several items relating to washback, formative assessment, and alternative assessment cluster together (e.g., can explain the concept of washback, can give a concrete example of formative assessment). Not only can students explain the basic concepts and enumerate the purposes of formative assessment instruments (e.g., can list four purposes of conferences/interviews in the classroom, can list four purposes of journals in the classroom); slightly more advanced students can also suggest how to leverage summative tests for formative purposes (e.g., can list four measures to transform traditional tests into more pedagogically fulfilling learning experiences). This high density of items connected to the classroom context is not surprising, confirming earlier findings which suggested a link between the perceived training needs, the training received, and the perceived level of difficulty (Berger et al., 2023). In fact, those AKUs which are strongly associated with the daily practical work of language teachers, and which are therefore typically covered in teacher education programs, are not just perceived to be easier but turn out to be measurably easier than items less strongly connected to the day-to-day work of teachers.

There is also a concentration of items on test constructs and construct definition. Items refer to the abilities to distinguish between microskills and macroskills, to characterize language processes, and to identify variables that can potentially affect performance (e.g., can list four microskills of reading, can list four variables that can make listening difficult, understands the significance of the difference between controlled and automatic processing in second-language listening). Further aspects that start appearing more frequently at this level are simple AKUs relating to test methods (e.g., can give two advantages of extended production tasks over selected response tasks, can give three examples of open-ended speaking tasks) and simple items connected to test development (e.g., can give three reasons why it is important to have clear test specifications, can explain the concept of item moderation). As regards task analysis, items calibrated at this end of the scale are restricted to design features (e.g., can analyze a given writing prompt in relation to the design). Students at this level can also explain and illustrate fundamental assessment principles such as validity and reliability and relate them to classroom practice (e.g., can illustrate four types of factors which can contribute to a test’s [un]reliability, can list three measures that can help teachers to increase the reliability in classroom assessment).

In band three, many items about scoring and grading cluster together (e.g., can give three disadvantages of analytic scoring, can list four considerations for rating scale development), as well as items on construct description and analysis (e.g., can analyze a given item in terms of Purpura’s [2004] model of grammatical and pragmatic competence [grammatical form and meaning, sentential and suprasentential levels]).

In band two, two item clusters stand out. Firstly, there is a large concentration of analytical items, particularly in relation to analyzing the construct of a given grammar item in terms of Purpura’s (2004) components of grammatical and pragmatic knowledge, not just with regard to the basic distinctions between form and meaning or sentential and suprasentential level but in terms of the more specific components of grammatical and pragmatic knowledge. Secondly, this level is characterized by a focus on test usefulness and validation (e.g., knows that validation usually involves reasoning and empirical evidence, can explain Bachman and Palmer’s [1996] understanding of construct validity). There also seems to be another focus on classroom-based assessment, however, with a new degree of complexity. Several items refer to the ability to connect test qualities, such as construct representation and practicality, with questions of washback. Finally, at the top level, the concentration of analytical items continues.

5.3 Progression in Classroom-Based Assessment

In addition to the content clusters illustrated above, the progression can also be characterized more specifically by zooming in on particular aspects of the construct. In order to obtain a more granular progression, it is instructive to group the items according to constructs and compare the relative difficulty of the respective AKUs. Table 3 illustrates such a comparison of items relating to aspects of the construct which are particularly relevant to the classroom context, namely formative assessment, alternative assessment, and washback.

Table 3

Calibrated AKUs Relating to Formative Assessment, Alternative Assessment, and Washback

Band	Formative Assessment		Alternative Assessment		Washback
1
					⋅ Can outline the relationship between practicality and washback	1.90
2					⋅ Can outline the relationship between construct representation and washback	1.50

			⋅ Can list four differences between traditional and alternative assessment	0.65	⋅ Can describe two negative washback effects of the Austrian standardized school-leaving exam ⋅ Can justify the idea that direct	0.48
			⋅ Can give four advantages of portfolio assessment	0.63	testing is more likely to have positive washback effects than indirect testing	0.41
	⋅ can Can help list teachers four measures to increase that the reliability in classroom assessment	0.21	⋅ Can give four disadvantages of portfolio assessment	-0.03	⋅ Can describe two positive washback effects of the Austrian standardized school-leaving exam	0.16
3	⋅ Can list four measures to transform traditional tests into more pedagogically fulfilling learning experiences	0.21	⋅ Can list four characteristics of alternative assessment	-0.09	Can justify the idea that ⋅ criterion-referenced testing has the potential to encourage positive attitudes to language learning	-0.45
			Can give four guidelines for self- ⋅ and peer assessment	-0.34	⋅ Can describe one negative washback effect of the Austrian standardized school-leaving exam	-0.85
			⋅ Can list four purposes of journals in the classroom	-0.47	⋅ Can describe one positive washback effect of the Austrian standardized school-leaving exam	-0.97
	⋅ Can give a concrete example of formative assessment	-1.35	⋅ Can list four purposes of conferences/interviews in the classroom	-0.80
			⋅ Can give four guidelines for portfolio assessment	-0.93

	⋅ Understands that the Austrian school-leaving exam is not a form of formative assessment	-2.32	⋅ Can explain the relationship between formal standardized tests and alternative assessment in terms of reliability	-1.95	⋅ Can explain the concept of washback	-2.06
4	⋅ Understands that the Austrian school-leaving exam is not ‘assessment for learning’	-2.51	⋅ Can explain the relationship between formal standardized tests and alternative assessment in terms of authenticity	-2.80	⋅ Can explain the relationship between formal standardized tests and alternative assessment in terms of washback	-3.10
			⋅ Understands that the Austrian school-leaving exam is not a form of alternative assessment	-3.04

			⋅ Can explain the relationship between formal standardized tests and alternative assessment in terms of practicality	-3.51
5	⋅ Understands that the Austrian school-leaving exam is not ‘assessment as learning’	-4.15	Understands that the Austrian ⋅ school-leaving exam is not self- assessment	-5.36

Note. Difficulty measures in logits

The calibrations in Table 3 show a clear progression. With regard to alternative assessment, for example, the continuum moves from understanding what is not part of alternative assessment, to the ability to explain the relationship between formal standardized tests and alternative assessments, and on to knowing examples of purposes of alternative assessment instruments. Next on the continuum is an understanding of how alternative forms of assessment can be implemented. And finally, at the top end of the difficulty spectrum, there seems to be a fuller understanding of alternative assessment in terms of its characteristics and advantages/disadvantages.

6 Discussion

Due to the highly contextualized nature of LAL in general (Tsagari & Vogt, 2017; Xu & Brown, 2017) and the test instrument on which this study is based, in particular, the progressions presented here may not be generalizable to other contexts. It should be emphasized that the calibrations are not intended to represent a developmentally invariable learning trajectory. Nor do they reflect the natural order in which these AKUs are developed over time, let alone prescribe the order in which they should be developed. Building LAL is not like starting at the bottom of a ladder and then moving up to the top rung by rung. Such a view, albeit attractive, would be a gross simplification of the learning process, disregarding the many complexities and factors influencing LAL at both individual and contextual levels (e.g., Crusan et al., 2016; Yan et al., 2018).

Having said that, the findings can provide an empirical basis for the teaching of language testing and assessment, and as such they might be useful in similar settings. In particular, information about the difficulty of assessment-related AKUs can help teacher educators to make some basic pedagogical decisions, such as how to sequence the course content in a way that is not just intuitively plausible but empirically informed and consistent with the notion that developing LAL is a process of dealing with increasing difficulty and complexity, as opposed to merely covering a growing body of content. The following design principles could be useful for difficulty-informed LAL programs in pre-service teacher education:

On the whole, LAL programs could progress from easier to more difficult AKUs.
AKUs should ideally be developed in connection with practical work that has real-world relevance.
Easier AKUs closely associated with the day-to-day work of language teachers could be a good starting point.
Difficult AKUs could be integrated with easier ones in a cyclical and iterative process.
Difficult AKUs could be developed through deep-level processing.
Difficult AKUs may require more instruction time than easier ones.
Possible bottlenecks to learning (i.e., difficult AKUs which might impede learning or prevent students from fully understanding a certain aspect of language assessment) should be identified and addressed at appropriate points.

These design principles can be applied to both individual teaching units and the syllabus at large. By way of illustration, a revised syllabus for the assessment course at the University of Vienna which is in line with these design principles could progress as follows: Initially, the focus could be on formative, learning-oriented, and alternative assessment. These topics are closely related to the day-to-day work of language teachers in Austria, and basic AKUs relating to these topics appear at the bottom end of the difficulty spectrum. The focus could then shift to achievement testing as the most common test purpose in the Austrian language classroom. Once the students have a reasonable level of understanding of the essential links between testing, assessment, teaching, and learning, the program could move on to test development, with practical and project-based assignments being at the heart of the learning process. Although in a classic test development cycle questions of test purpose, test construct, and test method would typically be addressed in that order, for pedagogical reasons the sequence could be reversed so that it is in accordance with the difficulty continuum and perhaps more realistically reflects teachers’ actual practice. Before covering task production, the starting point could be a task selection assignment in which students choose an existing task suitable for a classroom-based achievement test in the given context. This idea that task selection takes precedence over task production is also supported in Berger et al. (2023). Initially, the focus could be on the test method, response formats, and task design before moving on to recreating the construct and specifications in a process of backwards engineering. The students could receive different types of guidance and feedback along the way, for example in the form of a checklist for identifying the task characteristics and relevant test method facets, an exemplar of a construct definition for comparison with the construct they recreated, or a reflective dialogue about possible changes to the chosen task. This process could be adapted for the next, more difficult language skill(s), this time possibly involving task production, not just selection. For example, students could be asked to develop a writing prompt suitable for a classroom achievement test. Special attention should be paid to the question as to how to leverage the results of an achievement test for formative purposes, particularly with regard to the key role that feedback plays when it comes to enhancing learning.

At a later point, the focus could shift from achievement testing to proficiency testing, another important test purpose in the Austrian educational system. A contrastive analysis of achievement test tasks and proficiency test tasks could help students to understand that the Austrian school-leaving exam is not an achievement test, an AKU which was identified as a bottleneck to fully understanding test purposes in the Austrian educational context (because understanding that the Austrian school-leaving exam is not an achievement test turned out to be much more difficult than any other negatively worded AKU relating to the SRDP). Instead of covering the principles of assessment, test usefulness, and validation linearly in a separate thematic unit, they could be integrated with the practical assignments and treated recursively at various points along the way while discussing issues of successively increasing complexity. Ideally, all assignments are linked back to the bigger picture in terms of washback, consequential validity, and the learning orientation of classroom assessment, so as to consolidate students’ understanding of the links between testing, assessment, teaching, and learning.

This approach outlined here is difficulty-sensitive in that it takes account of differences in difficulty between assessment activities. It is one possible response to the challenge of placing theory alongside practical elements; it complies with the wish of many teachers to concentrate on practical assessment matters (Vogt & Tsagari, 2014); and yet it provides knowledge-skills-and-principles-integrated teacher education, as opposed to a-theoretical and merely practice-driven teacher training.

7 Conclusion

This article contributes to current discussions of LAL development by shedding some light on the difficulty of language assessment for pre-service teachers of English. While earlier conceptualizations of LAL at different levels often remain theoretical or rely heavily on questionnaire data and stakeholders’ perceptions of difficulty, this study provides empirical difficulty estimates based on a LAL theory test used in an assessment course in the English teacher education program at the University of Vienna. The items of that test were transformed into statements of AKU, which were then scaled on the basis of multi-facet Rasch measurement. The resulting calibrations were examined qualitatively for content clusters at certain points along the scale so as to characterize the progression in general terms. Broadly speaking, the difficulty continuum ranges from basic concepts and terminology to AKUs closely associated with practical classroom activities, pedagogical aspects of assessment, formative assessment, and alternative assessment; on to AKUs relating to test methods, simple test design, scoring, and grading; then from basic aspects of validity and reliability in relation to classroom practice on to test constructs, construct definition, and issues of validation; and finally analytical skills and the application of theoretical models. Seven design principles for difficulty-informed LAL courses in teacher education programs have been suggested and illustrated.

While the approach taken here complements previous work investigating the vertical dimension of LAL, the study has several limitations. A major conceptual limitation lies in the fact that the progression is based on an existing local test used in a particular setting. This test is designed to cover the theoretical part of a one-semester assessment course, with many items requiring memorization, recall, understanding, and analysis, usually in relation to factual, conceptual, or procedural knowledge. As the practical part of the course is assessed elsewhere, comparatively few items test the practical application of the knowledge, and there are no items which would require students to use their knowledge to create something new. Thus, there is an imbalance between knowledge, skills, and principles, the core components that are often considered to be constitutive of LAL (Davies, 2008; Inbar-Lourie, 2008), and therefore the difficulty continuum is necessarily, by its very nature, partial and incomplete. Further research could redress this imbalance by focusing more strongly on the difficulty of assessment-related practical skills. Another limitation is the narrow scope of the study. Being restricted to a specific teacher education context in Austria, the difficulty continuum presented here may in a large part reflect local teaching and learning traditions, and so the results are not directly generalizable beyond this sample. Further work needs to establish to what extent the difficulty continuum is similar in other contexts. Further work could also use different conceptualizations of difficulty and employ qualitative methods to counterbalance the measurement-driven approach taken here.

Notwithstanding these limitations, the findings of the study have practical implications. Progressions such as the one presented here can be useful for program design, provided that they are considered to be an open and dynamic rather than a normative framework. While the difficulty continuum should not be mistaken for a developmentally determined learning trajectory that students follow in a lockstep fashion, information about the difficulty of assessment-related AKUs can provide an empirical basis for important pedagogical decisions, such as the order in which the content should be addressed, how much instruction time should be devoted to any particular topic, the level of cognitive involvement conducive to learning, the degree of cyclicality in the teaching process, what bottlenecks could impede learning, and how learning could be instructionally scaffolded. As the question of difficulty is not context-independent, it is hoped that more such continua will be available in the future for different settings. Together with longitudinal studies on how LAL develops over time, such empirical investigations of difficulty can help us better understand not just how pre-service teachers expand their knowledge base horizontally by covering an ever-increasing number of assessment topics, but also how they mature and develop along a vertical dimension.

About the author

Armin Berger

Armin BERGER is a senior scientist and senior lecturer in the Department of English and American Studies at the University of Vienna, Austria, where he acts as academic coordinator of the English Language Competence program and teaches specialized classes in language testing and assessment in the teacher education program.

References

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford University Press.Search in Google Scholar

Baker, B. A. (2016). Language assessment literacy as professional competence: The case of Canadian admissions decision makers. Canadian Journal of Applied Linguistics, 19(1), 63-83. https://journals.lib.unb.ca/index.php/CJAL/article/view/23033Search in Google Scholar

Berger, A. (2012). Creating language assessment literacy: A model for teacher education. In J. Hüttner, B. Mehlmauer-Larcher, S. Reichl, & B. Schiftner (Eds.), Theory and practice in EFL teacher education: Bridging the gap (pp. 57-82). Multilingual Matters.10.21832/9781847695260-007Search in Google Scholar

Berger, A., & Heaney, H. (2019). From local islands of knowledge to a shared, global understanding: A concept for improving assessment literacy in pre-service teachers of English at the University of Vienna, Austria. In C. Falkenhagen, H. Funk, M. Reinfried, & L. Volkmann (Eds.), Sprachen lernen integriert: Global, regional, lokal (pp. 187-200). Schneider-Verlag Hohengehren.Search in Google Scholar

Berger, A., & Heaney, H. (2022, March 10). The difficulty of understanding the links between language assessment, teaching, and learning: An empirical continuum of assessment-related competencies [Paper presentation]. Language Testing Research Colloquium 2022, online. https://www.iltaonline.com/page/LTRC2022Search in Google Scholar

Berger, A., Heaney, H., & Sigott, G. (2023). Dimensions of language assessment literacy and their difficulty for pre-service teachers of English in Austria. In C. Amerstorfer & M. von Blanckenburg (Eds.), Activating and engaging learners and teachers: Perspectives for English language education (pp. 159-181). Narr Francke Attempto.Search in Google Scholar

Berry, V., Sheehan, S., & Munro, S. (2017). Exploring teachers’ language assessment literacy: A social constructivist approach to understanding effective practices. In E. Guitierrez Eugenio (Ed.), Learning and assessment: Making the connections—Proceedings of the ALTE 6th International Conference, 3-5 May 2017 (pp. 201-207).Search in Google Scholar

Black, P., & Wiliam, D. (2010). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 92(1), 81-90. https://doi.org/10.1177/00317217100920011910.1177/003172171009200119Search in Google Scholar

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences. Erlbaum.Search in Google Scholar

Brindley, G. (2001). Language assessment and professional development. In C. Elder, A. Brown, E. Grove, K. Hill, N. Iwashita, T. Lumley, T. McNamara, & K. O’Loughlin (Eds.), Experimenting with uncertainty: Essays in honor of Alan Davies (pp. 126-136). Cambridge University Press.Search in Google Scholar

Crusan, D., Plakans, L., & Gebril, A. (2016). Writing assessment literacy: Surveying second language teachers’ knowledge, beliefs, and practices. Assessing Writing, 28, 43-56. https://doi.org/10.1016/j.asw.2016.03.00110.1016/j.asw.2016.03.001Search in Google Scholar

Davies, A. (2008). Textbook trends in teaching language testing. Language Testing, 25(3), 327-347. https://doi.org/10.1177/026553220809015610.1177/0265532208090156Search in Google Scholar

Department of English and American Studies, University of Vienna. (2022). Assessment. Universität Wien. https://ufind.univie.ac.at/de/vvz_sub.html?path=272204&semester=2022SSearch in Google Scholar

Deygers, B., & Malone, M. E. (2019). Language assessment literacy in university admission policies, or the dialogue that isn’t. Language Testing, 36(3), 347-368. https://doi.org/10.1177/026553221982639010.1177/0265532219826390Search in Google Scholar

Fulcher, G. (2012). Assesment literacy for the language classroom. Language Assessment Quarterly, 9(2), 113-132. https://doi.org/10.1080/15434303.2011.64204110.1080/15434303.2011.642041Search in Google Scholar

Fulcher, G. (2021). Language assessment literacy in a learning-oriented assessment framework. In A. Gebril (Ed.), Learning-oriented language assessment: Putting theory into practice (pp. 34-48). Routledge.10.4324/9781003014102-4Search in Google Scholar

Gan, L., & Lam, R. (2020). Understanding university English instructors’ assessment training needs in the Chinese context. Language Testing Asia, 10(11), 1-18. https://doi.org/10.1186/s40468-020-00109-y10.1186/s40468-020-00109-ySearch in Google Scholar

Giraldo, F. (2021). Language assessment literacy and teachers’ professional development: A review of the literature. Profile: Issues in Teachers’ Professional Development, 23(2), 265-279. https://doi.org/10.15446/profile.v23n2.9053310.15446/profile.v23n2.90533Search in Google Scholar

Hamp-Lyons, L. (2017). Language assessment literacy for langauge learning-oriented assessment. Papers in Language Testing and Assessment, 6(1), 88-111.10.58379/LIXL1198Search in Google Scholar

Inbar-Lourie, O. (2008). Constructing a language assessment knowledge base: A focus on language assessment courses. Language Testing, 25(3), 384-402. https://doi.org/10.1177/026553220809015810.1177/0265532208090158Search in Google Scholar

Kremmel, B., Eberharter, K., Holzknecht, F., & Konrad, E. (2018). Fostering language assessment literacy through teacher involvement in high-stakes test development. In D. Xerri & P. Vella Briffa (Eds.), Teacher involvement in high-stakes language testing (pp. 173-194). Springer. https://doi.org/10.1007/978-3-319-77177-9_1010.1007/978-3-319-77177-9_10Search in Google Scholar

Kremmel, B., & Harding, L. (2020). Towards a comprehensive, empirical model of language assessment literacy across stakeholder groups: Developing the language assessment literacy survey. Language Assessment Quarterly, 17(1), 100-120. https://doi.org/10.1080/15434303.2019.167485510.1080/15434303.2019.1674855Search in Google Scholar

Lam, R. (2015). Language assessment training in Hong Kong: Implications for language assessment literacy. Language Testing, 32(2), 169-197. https://doi.org/10.1177/026553221455432110.1177/0265532214554321Search in Google Scholar

Lee, I., & Coniam, D. (2013). Introducing assessment for learning for EFL writing in an assessment of learning examination-driven system in Hong Kong. Journal of Second Language Writing, 22(1), 34-50. https://doi.org/10.1016/j.jslw.2012.11.00310.1016/j.jslw.2012.11.003Search in Google Scholar

Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.Search in Google Scholar

Linacre, J. M. (2022). Facets computer program for many-facet Rasch measurment (Version 3.83.6.). https://www.winsteps.com/facets.htmSearch in Google Scholar

McNamara, T. (1996). Measuring second language performance. Pearson.Search in Google Scholar

O’Loughlin, K. (2006). Learning about second language assessment: Insights from a postgraduate student on-line subject forum. University of Sydney Papers in TESOL, 1, 71-85.Search in Google Scholar

O’Loughlin, K. (2013). Developing the assessment literacy of university proficiency test users. Language Testing, 30(3), 363-380. https://doi.org/10.1177/026553221348033610.1177/0265532213480336Search in Google Scholar

Pill, J., & Harding, L. (2013). Defining the language assessment literacy gap: Evidence from a parliamentary inquiry. Language Testing, 30(3), 381-402. https://doi.org/10.1177/026553221348033710.1177/0265532213480337Search in Google Scholar

Purpura, J. (2004). Assessing grammar. Cambridge University Press.10.1017/CBO9780511733086Search in Google Scholar

Taylor, L. (2013). Communicating the theory, practice and principles of language testing to test stakeholders: Some reflections. Language Testing, 30(3), 403-412. https://doi.org/10.1177/026553221348033810.1177/0265532213480338Search in Google Scholar

Tsagari, D., & Vogt, K. (2017). Assessment literacy of foreign language teachers around Europe: Research, challenges and future prospects. Papers in Language Testing and Assessment, 6(1), 41-63.10.58379/UHIX9883Search in Google Scholar

Vogt, K., & Tsagari, D. (2014). Assessment literacy of foreign language teachers: Findings of a European study. Language Assessment Quarterly, 11(4), 374-402. https://doi.org/10.1080/15434303.2014.96004610.1080/15434303.2014.960046Search in Google Scholar

Walters, F. S. (2010). Cultivating assessment literacy: Standards evaluation through language-test specification reverse engineering. Language Assessment Quarterly, 7(4), 317-342. https://doi.org/10.1080/15434303.2010.51604210.1080/15434303.2010.516042Search in Google Scholar

Xu, H. (2017). Exploring novice EFL teachers’ classroom assessment literacy development: A three-year longitudinal study. The Asia-Pacific Education Researcher, 26, 219-226. https://doi.org/10.1007/s40299-017-0342-510.1007/s40299-017-0342-5Search in Google Scholar

Xu, Y., & Brown, G. T. L. (2017). University English teacher assessment literacy: A survey-test report from China. Papers in Language Testing and Assessment, 6(1), 133-158.10.58379/UZON5145Search in Google Scholar

Yan, X., & Fan, J. (2021). “Am I qualified to be a language tester?”: Understanding the development of language assessment literacy across three stakeholder groups. Language Testing, 38(2), 219-246. https://doi.org/10.1177/026553222092992410.1177/0265532220929924Search in Google Scholar

Yan, X., Zhang, C., & Fan, J. J. (2018). “Assessment knowledge is important, but ...”: How contextual and experiential factors mediate assessment practice and training needs of language teachers. System, 74, 158-168. https://doi.org/10.1016/j.system.2018.03.00310.1016/j.system.2018.03.003Search in Google Scholar

Appendix: Extract of the Item Measurement Report (Illustrative Items Listed in Descending Order of Difficulty)

Item ID	Item	Measure (Logits)	Model Standard Error	Infit Mean Square
136	Can analyze a given item in terms of Purpura’s (2004) model of grammatical and pragmatic knowledge (logical connection – item 4)	5.30	1.02	1.02
135	Can analyze a given item in terms of Purpura’s (2004) model of grammatical and pragmatic knowledge (cohesive meaning – item 4)	3.82	0.53	1.07
82	Can analyze a given item in terms of Purpura’s (2004) model of grammatical and pragmatic knowledge (referencing – item 1)	3.72	0.75	1.02

130	Can analyze a given item in terms of Purpura’s (2004) model of grammatical and pragmatic knowledge (modality – item 3)	2.71	0.36	0.97
184	Knows that validation usually involves reasoning	2.53	0.56	0.85
93	Can analyze a given item in terms of Purpura’s (2004) model of grammatical and pragmatic knowledge (cohesive meaning – item 2)	2.30	0.56	0.81
149	Can explain Bachman and Palmer’s (1996) understanding of construct validity	1.99	0.48	0.78
185	Knows that validation usually involves empirical evidence	1.60	0.66	0.59

144	Can give three disadvantages of analytic scoring	0.76	0.29	0.70
145	Can list four considerations for rating scale development	0.63	0.15	0.79
128	Can analyze a given item in terms of Purpura’s (2004) model of grammatical and pragmatic knowledge (sentential level – item 3)	0.53	0.26	1.04
91	Can analyze a given item in terms of Purpura’s (2004) model of grammatical and pragmatic knowledge (grammatical meaning – item 2)	0.40	0.55	1.16

84	Can list four microskills of reading	0.35	0.10	1.16
54	Can illustrate four types of factors which can contribute to a test’s (un)reliability	0.27	0.06	1.15
197	Can explain the concept of item moderation	0.23	0.71	0.61
67	Can list four measures that can help teachers to increase the reliability in classroom assessment	0.21	0.07	0.88
104	Can list four measures to transform traditional tests into more pedagogically fulfilling learning experiences	0.21	0.07	1.10
188	Can list four purposes of journals in the classroom	-0.47	0.40	0.41
178	Can list four variables that can make listening difficult	-0.54	0.50	0.80
139	Can give three examples of open-ended speaking tasks	-1.00	0.32	1.20
61	Can give a concrete example of formative assessment	-1.35	0.27	0.97
88	Can analyze a given writing prompt in relation to the design (prompt 1)	-1.62	0.40	0.92
151	Can explain the concept of washback	-2.06	0.74	1.03

12	Understands that the Austrian school-leaving exam is not a diagnostic test	-2.32	0.42	1.01
23	Understands that the Austrian school-leaving exam is not assessment for learning	-2.51	0.46	1.02
102	Can explain the relationship between formal standardized tests and alternative assessment in terms of washback	-3.10	0.58	0.99
63	Can explain the concept of discrete-point testing	-3.29	1.01	0.97
64	Can explain the concept of integrative testing	-3.29	1.01	0.97
100	Can explain the relationship between formal standardized tests and alternative assessment in terms of practicality	-3.51	0.71	0.98
26	Understands that the Austrian school-leaving exam is not a form of assessment as learning	-4.15	1.00	1.01
	M (n = 206)	-0.14	0.42	0.99
	SD	1.67	0.35	0.16

	Strata: 4.71 Reliability: 0.92
	Fixed (all same) chi-square: 4273.6 df: 205 Significance (probability): 0.00

Note. 30 out of 206 items. The dashed lines indicate cut-off points at equal intervals.

Published Online: 2023-06-08

Published in Print: 2023-06-27

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/CJAL-2023-0209

Keywords for this article

language assessment literacy; pre-service teacher education; difficulty of language assessment; multi-facet Rasch analysis; course design principles

Creative Commons

BY 4.0