June 2011
Karen Price, Boston University, Boston, Massachusetts, USA, kprice@tiac.net

Understanding the process of second language acquisition entails understanding the process of interaction. Observation of language-in-action accompanied by note-taking may provide only superficial insights or require greater detail for substantive discussion or analysis. Even observers with extensive training may lose one-half to two-thirds of the data in real-time coding (Kieren & Munro, 1985). Therefore, teachers and researchers may choose to capture classroom interactions with audio or video and then transcribe these events for subsequent review, discussion, and analysis of learner language.

There are many different ways to transcribe language. A transcript is simply an approach to the notation of language, a “selective process reflecting theoretical goals” (Ochs, 1979). The researcher’s decisions regarding what to transcribe and how to transcribe the data reflect the orientation of the researcher and constrain the analysis and interpretation of the language episode (Lapadat & Lindsay, 1999). For example, broad transcriptions document only sounds and words that are important to meaning. They often denote pauses and use standard spelling. Unlike broad transcriptions, narrow transcriptions document phonological features, using diacritics to transcribe how a word might be pronounced with contrasting accents in English, for example.


Free software can assist in the manual transcription of audio recordings by enabling the transcriber to control the speed of the audio/video playback. Two options include Windows Media Player and Audacity, which allow the user to control the speed of playback or start and stop the audio through the use of assigned keys on the computer keyboard. A third option, Express Scribe, also enables the user to control the playback speed through the use of a foot pedal.


To date, attempts to automate the transcription of audio/video recordings through voice recognition have not been particularly successful. Errors in automated transcription are numerous even in high fidelity with only a single speaker. However, the use of software such as Adobe Premiere with Adobe After Effects, which result in many errors in the automated transcription, can be time-saving because the software allows the user to control the playback easily while making corrections to the transcript as the correct timecodes are automatically inserted for each edited word(s). Manual synchronization of words and timecode is thus avoided because the edited words continue to correspond to their precise occurrence in the video; the software automatically codes and inserts them as both words and metadata. Thus, searching for a word or string will queue the video to all the instances of the word(s) in one video or in several videos designated by the user.


The following free software applications are widely used by applied linguistics in the annotation and linking of media files to transcripts.

Anvil, originally designed for gesture research, is a video annotation tool that can import data from phonetic tools such as Praat. It can display waveform and pitch contour and offers frame-accurate, multilayered annotation.

CLAN-CA, developed in the context of the CHILDES and TalkBank projects, allows users to link audio and video documents, pictures, and notes to a transcript. The software aids in the transcription, coding, analysis, and sharing of transcripts of conversations linked to either audio or video media.

EXMARaLDA, an acronym of Extensible Markup Language for Discourse Annotation, is a system of data formats and tools for the computer-assisted transcription and annotation of spoken language, and for the construction and analysis of spoken-language corpora.

In addition to facilitating manual transcription, applications such as these provide numerous ways for users to annotate a broad or narrow language transcript, easily synchronizing a researcher’s notes with the words and phrases of the language transcript. For example, even though verbal behavior is situated in a larger, interactional context, nonverbal behavior is often not noted in transcriptions. Research papers are rife with illustrations of transcripts that note only the language spoken, sometimes leading to conflicting conclusions, depending upon the interpretation of the intent of a given utterance. The possibility that one of the speakers was pointing to direct an interlocutor’s attention, that another shrugged her shoulders in response, or that others were rolling their eyes may result in different interpretations of the language interaction. The notation of nonverbal behavior that triggers an utterance may yield information critical to the understanding of the language episode that is not apparent from a transcript of only the spoken language.


Although not specifically designed for language research, CAQDAS applications offer extensive suites of integrated tools. Atlas.ti and Transana are two applications, each providing users many ways to identify, arrange, and rearrange pertinent clips, assign keywords to clips, and create complex collections of interrelated clips.

Like other applications, Transana also helps researchers annotate, segment, and code audio and video data as it automatically synchronizes the transcript with the (audio/video) data and annotations. However, researchers in different locations can use Transana to annotate the same video file simultaneously, using same or different coding schemas, without writing over the coding scheme or annotation of other researchers. While one researcher may code a segment for speech acts, another researcher may associate the same segment with imported data from Praat, software designed for the analysis of speech. A researcher interested in assessing language complexity might be interested in coding AS-units, a unit for measuring spoken language, defined as a single speaker’s utterance consisting of an independent clause or subclausal unit, together with any subordinate clause(s) associated with it (Foster, Tonkyn, & Wigglesworth, 2000). However, the researcher focused on assessing fluency may be more interested in coding various types of hesitation phenomena such as false starts, repetitions, reformulations, and replacements.

Coded segments can appear in a variety of visual displays, such as color-coded bars along the entire timeline, so that users can visualize patterns in the data. Users can click on components of the visual display to play the video or audio segments while displaying the associated transcript or notes from one researcher or multiple researchers. A segment can also be retrieved from a lexical search of lexical data (i.e., transcript, annotations, lexical codes) with which it is associated.


Just as the selective process of language transcription reflects theoretical goals, so does the process of determining what type of technology to use. Rather than assume that all language interactions should be recorded with the same technology in the same way, researchers should address the question: “What type of recording and what type of transcription might be most useful for my research purposes?”

For example, the author designed a system (Price, 1992) to record real-time data without any transcription of the language. The software simply identified individual speakers, when they spoke, and information about them as to gender, age, and language background. Through visual displays and automated quantitative analysis, teachers and students could, in real time, consult summaries of wait-time between speakers, directions of communication among speakers, volume of utterances, and amount of talk-time for an individual speaker. Summaries could be displayed, based upon gender, age group, or native-language group. Even without transcription of language, the querying of this data served as a catalyst for empowering change among the students and teachers. In one class, for example, students were intrigued to see significant differences in wait-time between the Asian and Latin-American students. Although measured in milliseconds, students saw significant differences, prompting discussion of the role of wait-time in turn-taking and participation in conversation.

Explicitly or implicitly, each technology and medium of recording influences what can be transcribed, and to some extent how the data is transcribed. Rather than assume that one process is always preferable to another, researchers must be attentive as to how the choices and variables may frame the conclusions. As Konopásek (2008) argued, these technologies should not be considered “mere tools for coding and retrieving, but also as complex virtual environments for embodied and practice-based knowledge making (pg.9).”


Karen Price has authored more than 20 articles and developed early prototypes of technology now commonly used, such as lexical searching of video. She enjoys conducting workshops and consulting in developing countries as well as for entities that have included Microsoft, Annenberg, USAID, U.S. State Dept., AmidEast, Fulbright, and Kodak. She is currently a visiting scholar at Boston University, where she conducts research and teaches graduate courses on SLA and CALL.

