Annotation
Current State
POS-tags and lemmas
All corpus files have been automatically tagged and lemmatized. Users can search part-of-speech tags with an underscore (e.g., _DT will find all determiners) and lemmas with an at-sign (e.g., @TAKE will find all word forms of the lemma 'take'). For details, see the search interface.
Warning
Since the annotation was done automatically and since the texts are spoken - the transcripts involving several orhtographic conventions that the tragger was not trained on - the accuracy of the automatic tagging may not be very high. Searches using part-of-speech tags or lemmas may results in a precision and recall errors. The performance of the automatic annotation has not been evaluated.
Tokenization into sentence units
The corpus is fully tokenized into syntactic units. These sentence tokens are structured very carefully according to explicit guidelines (see the corpus transcription). The resulting units correspond to the notion of 'sentence' in a consistent and principled way and are better motivated linguistically than in many other comparable corpora (for alternatives, see the links section). Proper tokenization of this kind could be a helpful first step towards adding syntactic parsing in the future.
Tagset
The tagset used is the Penn tagset as implemented in Schmid's TreeTagger. The following table lists all tags and which part of speech they stand for.
POS Tag | Description | Example |
CC | coordinating conjunction | and |
CD | cardinal number | 1, third |
CDZ | possesive pronoun | one’s |
DT | determiner | the |
EX | existential there | there is |
FW | foreign word | d’hoevre |
IN | preposition, subordinating conjunction | in, of, like |
IN/that | that as subordinator | that |
JJ | adjective | green |
JJR | adjective, comparative | greener |
JJS | adjective, superlative | greenest |
LS | list marker | 1) |
MD | modal | could, will |
NN | noun, singular or mass | table |
NNS | noun plural | tables |
NNSZ | possessive noun plural | people’s, women’s |
NNZ | possessive noun, singular or mass | year’s, world’s |
NP | proper noun, singular | John |
NPS | proper noun, plural | Vikings |
NPSZ | possesive proper noun, plural | Boys’, Workers’ |
NPZ | possesive noun, singular | Britain’s, God’s |
PDT | predeterminer | both the boys |
PP | personal pronoun | I, he, it |
PPZ | possessive pronoun | my, his |
RB | adverb | however, usually, naturally, here, good |
RBR | adverb, comparative | better |
RBS | adverb, superlative | best |
RP | particle | give up |
SENT | Sentence-break punctuation | . ! ? |
SYM | Symbol | / [ = * |
TO | infinitive ‘to’ | togo |
UH | interjection | uhhuhhuhh |
VB | verb be, base form | be |
VBD | verb be, past tense | was, were |
VBG | verb be, gerund/present participle | being |
VBN | verb be, past participle | been |
VBP | verb be, present, non-3d person | am, are |
VBZ | verb be, 3rd person sing. present | is |
VH | verb have, base form | have |
VHD | verb have, past tense | had |
VHG | verb have, gerund/present participle | having |
VHN | verb have, past participle | had |
VHP | verb have, sing. present, non-3d | have |
VHZ | verb have, 3rd person sing. present | has |
VV | verb, base form | take |
VVD | verb, past tense | took |
VVG | verb, gerund/present participle | taking |
VVN | verb, past participle | taken |
VVP | verb, present, not 3rd person | take |
VVZ | verb, 3rd person sing. present | takes |
WDT | wh-determiner | which |
WP | wh-pronoun | who, what |
WPZ | possessive wh-pronoun | whose |
WRB | wh-abverb | where, when |
Z | possessive ending | ‘s |
Encoding
All corpus files have been formatted in XML. The two most important labelled spans are <w>...</w> for 'word' and <token>...</token> for 'sentence token.'
The word label is used for all transcribed material, including words, punctuations signs as well as disfluencies. All word tags contain two attributes, pos, valued with a tag from the Penn tagset, and lemma, including the lemma written in all caps.
The token labels tokenize the text into sentence tokens. They contain two attributes, id, assigning to each token a unique number, and time, containing the time stamp where the corresponding material can be heard in the audio file.
The image below illustrates the structure of the XML corpus files.

Excerpt from one of the XML-encoded corpus files.