Annotation

Current State

POS-tags and lemmas

All corpus files have been automatically tagged and lemmatized. Users can search part-of-speech tags with an underscore (e.g., _DT will find all determiners) and lemmas with an at-sign (e.g., @TAKE will find all word forms of the lemma 'take'). For details, see the search interface.

Warning

Since the annotation was done automatically and since the texts are spoken - the transcripts involving several orhtographic conventions that the tragger was not trained on - the accuracy of the automatic tagging may not be very high. Searches using part-of-speech tags or lemmas may results in a precision and recall errors. The performance of the automatic annotation has not been evaluated.

Tokenization into sentence units

The corpus is fully tokenized into syntactic units. These sentence tokens are structured very carefully according to explicit guidelines (see the corpus transcription). The resulting units correspond to the notion of 'sentence' in a consistent and principled way and are better motivated linguistically than in many other comparable corpora (for alternatives, see the links section). Proper tokenization of this kind could be a helpful first step towards adding syntactic parsing in the future.

Tagset

The tagset used is the Penn tagset as implemented in Schmid's TreeTagger. The following table lists all tags and which part of speech they stand for.

POS Tag	Description	Example
CC	coordinating conjunction	and
CD	cardinal number	1, third
CDZ	possesive pronoun	one’s
DT	determiner	the
EX	existential there	there is
FW	foreign word	d’hoevre
IN	preposition, subordinating conjunction	in, of, like
IN/that	that as subordinator	that
JJ	adjective	green
JJR	adjective, comparative	greener
JJS	adjective, superlative	greenest
LS	list marker	1)
MD	modal	could, will
NN	noun, singular or mass	table
NNS	noun plural	tables
NNSZ	possessive noun plural	people’s, women’s
NNZ	possessive noun, singular or mass	year’s, world’s
NP	proper noun, singular	John
NPS	proper noun, plural	Vikings
NPSZ	possesive proper noun, plural	Boys’, Workers’
NPZ	possesive noun, singular	Britain’s, God’s
PDT	predeterminer	both the boys
PP	personal pronoun	I, he, it
PPZ	possessive pronoun	my, his
RB	adverb	however, usually, naturally, here, good
RBR	adverb, comparative	better
RBS	adverb, superlative	best
RP	particle	give up
SENT	Sentence-break punctuation	. ! ?
SYM	Symbol	/ [ = *
TO	infinitive ‘to’	togo
UH	interjection	uhhuhhuhh
VB	verb be, base form	be
VBD	verb be, past tense	was, were
VBG	verb be, gerund/present participle	being
VBN	verb be, past participle	been
VBP	verb be, present, non-3d person	am, are
VBZ	verb be, 3rd person sing. present	is
VH	verb have, base form	have
VHD	verb have, past tense	had
VHG	verb have, gerund/present participle	having
VHN	verb have, past participle	had
VHP	verb have, sing. present, non-3d	have
VHZ	verb have, 3rd person sing. present	has
VV	verb, base form	take
VVD	verb, past tense	took
VVG	verb, gerund/present participle	taking
VVN	verb, past participle	taken
VVP	verb, present, not 3rd person	take
VVZ	verb, 3rd person sing. present	takes
WDT	wh-determiner	which
WP	wh-pronoun	who, what
WPZ	possessive wh-pronoun	whose
WRB	wh-abverb	where, when
Z	possessive ending	‘s

Encoding

All corpus files have been formatted in XML. The two most important labelled spans are <w>...</w> for 'word' and <token>...</token> for 'sentence token.'
The word label is used for all transcribed material, including words, punctuations signs as well as disfluencies. All word tags contain two attributes, pos, valued with a tag from the Penn tagset, and lemma, including the lemma written in all caps.
The token labels tokenize the text into sentence tokens. They contain two attributes, id, assigning to each token a unique number, and time, containing the time stamp where the corresponding material can be heard in the audio file.
The image below illustrates the structure of the XML corpus files.

Excerpt from one of the XML-encoded corpus files.

The Student-Transcribed Corpus
of Spoken American English

www.SpokenCorpus.org

Annotation

Current State

POS-tags and lemmas

Warning

Tokenization into sentence units

Tagset

Encoding

The Student-Transcribed Corpusof Spoken American English

www.SpokenCorpus.org

Annotation

Current State

POS-tags and lemmas

Warning

Tokenization into sentence units

Tagset

Encoding

The Student-Transcribed Corpus
of Spoken American English