November 2016
TESOL HOME Convention Jobs Book Store TESOL Community

Paula Winke & Susan Gass, Michigan State University, East Lansing, Michigan, USA

Paula Winke

Susan Gass

Idea units have been adopted by second language (L2) researchers to measure reading, writing, and listening comprehension, including L2 video comprehension. While the term idea units has been defined in several ways—for example, as individual, simple sentences; basic semantic propositions; or phrases (Kroll, 1977)—there is no general consensus on the methods to be undertaken in operationalizing idea units for scoring. In this article, we (a) overview select empirical studies that employed idea-unit scoring, (b) discuss methodological issues, and (c) suggest guidelines for researchers and teachers who use them to measure comprehension.

Overview of Empirical Studies

In an early paper on L2 summary recall, Johns and Mayes (1990) had 80 ESL students read an English text on pollution (approximately 600 words) and write a 100-word summary. The students kept the original text while crafting their summary. The authors segmented the original text into 77 possible idea units as defined by Kroll (1977). The students’ texts were classified into (a) correct replications or (b) distortions. This study is often cited in the methods sections of academic articles when authors are describing their recall protocol scoring, but the original empirical work by Johns and Mayes (1990) lacks information on how they scored the summaries. Because the students were able to keep the original text while they summarized, the task may not have been a true measure of reading comprehension (cf. discussion by Riley & Lee, 1996).

Two decades later, Ableeva and Lantolf (2011) published a paper on whether dynamic assessment promotes French L2 listening comprehension. Seven language learners listened to six video recordings of speakers of French talking about food and restaurants; three recordings were listened to before and three after different assessment types. The researchers measured the effects of the assessment type by gain scores (change) from pretesting to posttesting on idea units recalled. They used pausal unit analysis, which, they reported, is counting idea units. They followed the scoring guidelines by Riley and Lee (1996).

Based on Riley and Lee’s (1996) work, Ableeva and Lantolf (2011) first segmented the transcripts of the videos into “syntactically related units” (p. 140). Unlike Johns and Mayes (1990), Ableeva and Lantolf (2011) had three independent researchers do the segmenting, and then the researchers compared their work. They discussed differences and achieved a consensus on the segmentation. A second group of researchers weighted the segments (or idea units) into main ideas, supporting ideas, and details. Researchers then analyzed the oral recalls of the learners to derive the total number of idea units accurately produced and marked whether the idea units recalled were main ideas, supporting ideas, or details. Paraphrases were counted, but distortions (untrue ideas, facts, or details) were not.Logical inferences were considered distortions and were not counted. Ableeva and Lantolf (2011) only used the main-idea scores in their paper.

Methodological Issues

We decided to use idea-unit scoring in one of our ongoing research projects on the use of captions for second language learning (Winke, Gass, & Sydorenko, 2013). In mapping out the procedures for the recall protocols and for the idea-unit scoring, we noted that previous authors had not stated whether grammar or spelling mistakes were allowed. The directions given to the test-takers were not given in previous research; we drafted these ourselves. We had no guidance on how to compute hierarchical scoring. Should we award more points for main ideas and fewer for supporting idea and details, or should we give fewer points for commonly identified main ideas and more points for abstruse supporting ideas and rarely recalled details? We also noted that none of the researchers who used idea-unit scoring reported the type of reliability that was run (when a reliability statistic was reported).

We decided to tell the learners before they watched the video (a short, commercially produced video about bears) about the upcoming test. They could take notes while watching, although they could not use the notes when recalling. We decided to have students type their recalls on computers using the target language or their native language, as they wanted. We scored idea units regardless of spelling, grammatical mistakes, or language (we translated non-English responses into English). The directions were as follows:

In the space below, please type (in English or in your native language) everything you understood and recall from the video. (Type out/retell the story from the video.) Please provide as many details as you can. There is no time limit.

We segmented the transcripts of our videos as modeled by Ableeva and Lantolf (2011). We then organized the idea-unit segments onto an easy-to-use, one-page score sheet. When we first began scoring, we noted that the learners sometimes wrote correct things that were not exactly represented in the segments. For example, in the video, there were idiomatic expressions or understatements that conveyed concrete meanings but which could be rephrased more bluntly. These were the logical inferences that Ableeva and Lantolf (2011) did not count, but we decided that we wouldcount them (see, for example, item 15 in Figure 1). We went through an iterative process of amending and verifying the coding sheet by first using it on a subset (about 10%) of the recalls before we started coding. We added correct alternative interpretations to the scoring sheet in italics. We put the main ideas in bold and supplemental ideas and details in roman. If all ideas (main, supplemental, and details) were conveyed, the student received a full point. If fewer than all were conveyed, the student received a half-point. We also had on the scoring sheet a comment area for notes on scoring decisions. These notes were helpful when we needed to negotiate score assignments or amend the scoring sheet again (and then rescore previously scored recalls, as needed). See Figure 1 for an example of a portion of the final scoring sheet (the first 15 of 36 idea units).

Figure 1. Idea-unit scoring sheet sample. Main ideas are in bold and supplemental ideas and details in roman.

To calculate interrater reliability, we input the two raters’ scores (A and B) into an Excel spreadsheet (Figure 2). We then calculated a correlation (Pearson Product Moment) coefficient (r = .98), percent agreement (which averaged at 93%, with a 1 and 1 assignment being 100% agreement, a .5 and 1 being 50% agreement, and 0 and 1 being 0% agreement), and Cronbach’s alpha, with all 36 items resulting in an alpha of .94; when using only 34 of the items that provided variance (that is, by eliminating items that no one got or that everyone got), the alpha increased to .96.

Figure 2. Scoring sample sheet, for estimating interrater reliability.

For research purposes, the key to idea-unit scoring is in the segmenting of the original input (which should be done collaboratively between two or more researchers) and also in creating a scoring rubric or sheet through an iterative process. Researchers decide the parameters of what will be included as correct responses. Researchers can also calculate various reliability estimates and should report them. Multiple estimates are needed because each one is only an estimate, and the true reliability should be seen as anywhere between those derived.


Comprehension is difficult to measure because it is an internal, cognitive process. Thus, comprehension is measured indirectly. A question that follows is whether good comprehension should necessarily entail a good memory of what one comprehended. And is comprehension of little worth if one cannot convey what he or she comprehended through good speaking or writing skills? Or is comprehension more of an online process that should not overlap (in measurement) with memory skills?

We believe idea-unit scoring is a preferable method of measuring listening comprehension, even if the scoring is difficult, because test-takers must rely on their language skills, memory, processing strategies, and background knowledge to convey what they comprehended. Thus, the construct of comprehension conveyed by idea-unit scoring is multicomponential and skill integrated, and it is tied to the learners’ overall knowledge base. It also represents an authentic and communicatively oriented task; recalling and reporting is akin to something one might do in real life. Teachers can embrace idea-unit scoring for classroom-based comprehension assessment because it is authentic and informative. And most important, teachers can learn a lot about their students through recall scoring, which makes it ideal for formative, classroom-based assessment purposes.


Ableeva, R., & Lantolf, J. (2011). Mediated dialogue and the microgenesis of second language listening comprehension. Assessment in Education: Principles, Policy & Practice, 18(2), 133–149.

Johns, A. M., & Mayes, P. (1990). An analysis of summary protocols of university ESL students. Applied Linguistics, 11(3), 253–271.

Kroll, B. (1977). Combining ideas in written and spoken English: A look at subordination and coordination. In E. O. Keenan & T. L. Bennett (Eds.), Discourse across time and space. Southern California Occasional Papers in Linguistics, No. 5. Los Angeles, CA: University of Southern California.

Riley, G., & Lee, J. F. (1996). A comparison of recall and summary protocols as measures of second language reading composition. Language Testing, 13(2), 173–189.

Winke, P., Gass, S., & Sydorenko, T. (2013). Factors influencing the use of captions by foreign language learners: An eye-tracking study. The Modern Language Journal, 97(1), 254–275.

Paula Winke is an associate professor of second language studies at Michigan State University, where she teaches in the TESOL MA and Second Language Studies PhD Programs. Her research interests include language assessment and task-based language teaching. She is the 2012 recipient of the TESOL International Association Distinguished Research Award.

Susan Gass is university distinguished professor of second language studies (SLS) at Michigan State University, where she serves as director of the English Language Center and of the SLS Program. She has published widely in the field of second language acquisition, including books on second language acquisition and research methods.

« Previous Newsletter Home Print Article Next »
In This Issue
Search Back Issues
Forward to a Friend
Print Issue
RSS Feed