Welcome!


About the project


The Student-Transcribed Corpus of Spoken American English is a collection of student-made, high-quality speech transcripts and their corresponding audio files. The corpus records speech by native speakers of American English from a number of different settings, such as interviews, conference talks and private vlogs. The data can be accessed for free and the query results can be downloaded as a .csv file via the online search interface.

Give it a try. For example, you could search the corpus for the expressions below, associated predominantly with spoken rather than written English:

The corpus can be used in teaching, for research, or just to play around with for fun.
Enjoy :)



logo

Features


The corpus boasts a number of interesting features. The most important ones are the following:

  1. The corpus comes with audio files (.mp3) for all individual sentence tokens. Hence, the user can effortlessly and immediately listen to every sentence returned by a serch query. An example is shown below:

    So , but yeah , English is my first language , only language that I speak.

    There are not a lot of corpora that implement such a feature. (Another example is the French Corpus Oral de Fran├žais de Suisse Romande hosted by the University of Neuchatel, Switzerland.) This makes the corpus quite intriguing and eminently suited for work on spoken language features.

  2. The transcript files come with rich meta-data, keeping track of text name, transcriber, transcription date, audio source and more. They are also coded for social information of every speaker, such as regional, age, social, gender and ethnic variables, as well as for situational variables such as speech situation and genre. For more information, see the corpus documentation.

  3. The transcripts have an extremely high quality.

    • They follow a comprehensive transcription manual. This ensures consistency across all the transcripts.

    • The guidelines are very strict and regulate major difficulties (e.g., sentence tokenization, fillers and disfluencies), specific conventions (e.g., numerals, interjections, contractions) and even more minor aspects of transcription (e.g., punctuation, capitalization). This embodies solid linguistic motivation for the transcription conventions, which facilitates the retrieval of the speech material.

    • All transcripts were created by humans and were thoroughly corrected. This results in high accuracy and, in particular, higher accuracy than could be achieved with automatic subtitling or speech recognition software.

    • Finally, the transcripts implement a single tier for orthography, all other pieces of information, pauses and intonations, corrections, or speaker identifications, being relegated to separate annotation schemes. This leads to good readability of the transcripts for human users since few unconventional markers interfere with the text.

    To learn more about the transcription guidelines, see the corpus transcription.

  4. The transcripts have been fully part-of-speech tagged and lemmatized. This was done automatically - the tagging accuracy is thus unknown and may be relatively low. However, the annotated files (XML-formatted) can easily be corrected and extended and may thus have great potential for further annotation for syntactic functions or intonational patterns in the future. To learn more about the current simple tagging, see the corpus annotation.

Size


How large is the corpus? As of today, 2022/12/03, the corpus:

  • includes a total of 152,304 word tokens (including disfluencies and punctuation),
  • transcribed from 12.4 hours of continuous speech,
  • contained in 9,955 syntactic sentence tokens,
  • produced by 60 different speakers,
  • in 95 distinct files or speech situations.

Stay updated


To find out what's new, check out the SpokenCorpus.org News page.

Disclaimer


The site may have programming errors. This webpage originated as a student project during an introductory course on corpus linguistics. No grant money, external help, or professional programming were available to faciliate its construction. Therefore, please excuse potential bugs, limited functionality, or other shortcomings.