Annotation


Current State

POS-tags and lemmas


All corpus files have been automatically tagged and lemmatized. Users can search part-of-speech tags with an underscore (e.g., _DT will find all determiners) and lemmas with an at-sign (e.g., @TAKE will find all word forms of the lemma 'take'). For details, see the search interface.


Warning


Since the annotation was done automatically and since the texts are spoken - the transcripts involving several orhtographic conventions that the tragger was not trained on - the accuracy of the automatic tagging may not be very high. Searches using part-of-speech tags or lemmas may results in a precision and recall errors. The performance of the automatic annotation has not been evaluated.


Tokenization into sentence units


The corpus is fully tokenized into syntactic units. These sentence tokens are structured very carefully according to explicit guidelines (see the corpus transcription). The resulting units correspond to the notion of 'sentence' in a consistent and principled way and are better motivated linguistically than in many other comparable corpora (for alternatives, see the links section). Proper tokenization of this kind could be a helpful first step towards adding syntactic parsing in the future.


Tagset

The tagset used is the Penn tagset as implemented in Schmid's TreeTagger. The following table lists all tags and which part of speech they stand for.


POS Tag Description Example
CC coordinating conjunction and
CD cardinal number 1, third
CDZ possesive pronoun one’s
DT determiner the
EX existential there there is
FW foreign word d’hoevre
IN preposition, subordinating conjunction in, of, like
IN/that that as subordinator that
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables
NNSZ possessive noun plural people’s, women’s
NNZ possessive noun, singular or mass year’s, world’s
NP proper noun, singular John
NPS proper noun, plural Vikings
NPSZ possesive proper noun, plural Boys’, Workers’
NPZ possesive noun, singular Britain’s, God’s
PDT predeterminer both the boys
PP personal pronoun I, he, it
PPZ possessive pronoun my, his
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
SENT Sentence-break punctuation . ! ?
SYM Symbol / [ = *
TO infinitive ‘to’ togo
UH interjection uhhuhhuhh
VB verb be, base form be
VBD verb be, past tense was, were
VBG verb be, gerund/present participle being
VBN verb be, past participle been
VBP verb be, present, non-3d person am, are
VBZ verb be, 3rd person sing. present is
VH verb have, base form have
VHD verb have, past tense had
VHG verb have, gerund/present participle having
VHN verb have, past participle had
VHP verb have, sing. present, non-3d have
VHZ verb have, 3rd person sing. present has
VV verb, base form take
VVD verb, past tense took
VVG verb, gerund/present participle taking
VVN verb, past participle taken
VVP verb, present, not 3rd person take
VVZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WPZ possessive wh-pronoun whose
WRB wh-abverb where, when
Z possessive ending ‘s

Encoding

All corpus files have been formatted in XML. The two most important labelled spans are <w>...</w> for 'word' and <token>...</token> for 'sentence token.'
The word label is used for all transcribed material, including words, punctuations signs as well as disfluencies. All word tags contain two attributes, pos, valued with a tag from the Penn tagset, and lemma, including the lemma written in all caps.
The token labels tokenize the text into sentence tokens. They contain two attributes, id, assigning to each token a unique number, and time, containing the time stamp where the corresponding material can be heard in the audio file.
The image below illustrates the structure of the XML corpus files.


XML example

Excerpt from one of the XML-encoded corpus files.