Lost in translation

Corpus Linguistics, Annotation
Kron


Goals of this lecture
oFocus on annotation:
1.what makes a good annotation scheme;
2.what standards exist;
3.what markup languages exist.
n
n

Corpora and annotation
oUnannotated corpora:
nsimple plain text
nthe linguistic information is implicit
ne.g. no explicit representation of man as a noun
oAnnotated corpora:
nno longer just text
nreal repositories of linguistic information
othe relevant linguistic information is now explicit

Types of corpora
oCorpora are often defined according to what kind of annotation they contain.
o
npart-of-speech annotation (tagging)
oannotation of morphosyntactic categories (BNC)
nparsed corpora (treebanks)
oannotation of syntactic structure (Penn Treebank, SMULTRON)
nanaphora
oannotation of pronominal coreferents in context (GNOME corpus)

How is it done?
oDepends on the type of annotation being carried out.
oMany kinds of annotation are done manually.
oSome kinds of annotation, especially POS tagging can be done semi-automatically:
nmany available POS taggers
nstart with a manually tagged sample of text
ntrain the tagger on the sample
ntagger is then applied to new data, and tries to “guess” the POS of new words
nthis is not an error-free process! Current state of the art achieves about 96-7% accuracy

BNC example
o<s>
o<w NN2>Explosives
o<w VVD>found
o<w PRP>on
o<w NP0>Hampstead
o<w NP0>Heath
o<PUN>.
oExplosives found on Hampstead Heath
new sentence
plural noun
past tense verb
preposition
proper noun
proper noun
punctuation

The Penn Treebank parsed corpus
o(S (NPSBJ1 Chris)
o (VP wants
o (S (NPSBJ *1)
o (VP to
o (VP throw
o (NP the ball))))))
oPredicate Argument Structure:
o wants(Chris, throw(Chris, ball))
Empty embedded subject
linked to NP subject no. 1

The GNOME anaphora corpus
o
o<ne cat="pn" per="per3" num="sing" gen="neut"
o ani="inanimate" disc="disc-old">
o Dermovate Cream
o</ne>
ois
o<ne cat="a-np" per="per3" num="sing" gen="neut"
o ani="inanimate" disc="disc-new">
o a
o <mod type="preadj">strong</mod>
o and
o <mod type="preadj">rapidly effective</mod>
o treatment
o</ne>
o

Part 1
Annotation principles, standards and guidelines


Annotation Principles (Leech 1993)
1.Recoverability:
nit should be possible to remove the annotation and extract the raw text
2.Extractability:
nit should be possible to extract the annotation itself to store it separately
3.Transparency of guidelines:
nthe annotation should be based on explicit guidelines which are available to the end user

Annotation Principles (II)
4.Transparency of method
nIt should be clear who annotated what (often many people are involved in the project)
nTypically, projects will also report some statistical measure of inter-annotator agreement
nThe extent to which different annotators agree will reflect on:
ohow good the guidelines are
ohow theory-neutral the annotation is
5.
o

Annotation principles (III)
5.Fallibility
nThe annotation scheme is not infallible; the user should be made aware of this.
nE.g. the BNC documentation actually reports on errors in the POS tagging
o6. Theory-neutrality
nAs far as possible, the annotation should not be based on narrow theoretical principles.
nE.g. A treebank with syntactic info is usually parsed with a simple, context-free grammar.
nUsing something more specific, like Chomsky’s Principles and Parameters Framework, would mean it’s
useful to a narrower community.

Annotation principles (IV)
o7. Standards:
nno single annotation scheme has the right to be considered an a priori standard
ne.g. there are many different formats for annotating part of speech info, or syntactic structure
nHowever, there are published standards which provide a minimum for format and amount of
information to include.

Comments on Leech (1993)
oRather than standards, these are “desiderata” for annotation schemes.
oThey don’t really specify the form or content of an annotation scheme.
oHowever, there have been concerted efforts to define real standards to which corpora should
conform.
o

The concept of a markup language
oA markup language provides a way of specifying meta-data about a document.
oWhy “language”?
nit specifies a basic “vocabulary” of elements;
nit specifies a syntax for well-formed expressions.

The “SGML” family of markup languages
oSGML (Standard Generalised Markup Language): one of the first truly standardised formalisms
o
oBasic idea:
ncreate a tag which has some “meaning”
oe.g. <W> means “word”, <P> means “paragraph”
nwrap portions of a document with start/end tags
oe.g. <W>chair</W>
oend tags can often be omitted: <W>chair
nthe “meaning” of the tag must be specified
ntag can have attributes:
oe.g. <S n=101>
ntags can be nested inside eachother
o
n

Descendants of SGML: HTML
oHTML: “Hypertext Markup Language”
ndeveloped by the World-Wide Web Consortium (W3C)
nbased on the SGML tagging principle
ndefines a basic representation language for document layout
nused by web browsers: when you visit a page, your browser “interprets” the html and renders the
layout visually.
nfixed set of tags such as:
o <P>: paragraph
o<IMG>: image
oetc

Descendants of SGML: XML
oXML: Extensible Markup Language
ndeveloped by the World-Wide Web Consortium (W3C)
nnowadays, this is ubiquitous, and has largely replaced SGML as the markup language of choice
nstricter syntax than SGML: end-tags can’t be omitted
nless complex than SGML in other ways
nunlike HTML, specifies only a syntax; the actual tags can be anything depending on the
application.
n

Organizační diagram
XML documents are trees
DOCUMENT
PARAGRAPH
PARAGRAPH
PARAGRAPH
SENTENCE
SENTENCE
WORD
WORD

XML Documents are trees
o<DOCUMENT>
o <PARAGRAPH>
o <SENTENCE>
o <WORD>
o …
o </WORD>
o </SENTENCE>
o …
o </PARAGRAPH>
Organizační diagram
DOCUMENT
PARAGRAPH
PARAGRAPH
PARAGRAPH
SENTENCE
SENTENCE
WORD
WORD

Meta-data in XML
oWhat properties does a book have?
nauthor, ISBN, publisher, number of pages, genre: fiction, etc
o
o <BOOK type=“fiction”>
o <AUTHOR gender=“male”>John Smith</AUTHOR>
o <PUBLISHER>CUP</PUBLISHER>
o <TITLE>Lost in translation</TITLE>
o …
o </BOOK>
o
oThis contains “data” such as John SMith, CUP, Lost in Translation…
ntags have attributes (e.g. gender for author, type for book)
o
oIt contains meta-data (data about the data) in the form of tags
o
oEasy for a machine to know which pieces of information are about what.

The Text Encoding Initiative (TEI)
oSponsored by the main academic bodies with an interest in machine-readable textual markup.
o
oAims:
nprovide standardised formats for annotation
nallow interchange of data: If corpus X is annotated according to TEI standards, then it is easy
to:
odevelop tools to “read” the annotation
omake the annotation comprehensible to others
o
oNB: The TEI does not specify the content, i.e. what the annotation should contain. It does specify
how it should be done, i.e. the form.

The “document” according to TEI
oA document (e.g. a corpus text) consists of:
na header
oinformation about the text such as author, date, source, etc.
nthe text itself
oincluding annotation of textual elements, such as paragraphs, words, etc
nEncoded using tags and entity references
na Document Type Declaration (DTD)
oa formal representation which tells a computer program what elements the text contains, and what
they mean

In graphics…
HEADER
TEXT
• element
• element
•…
DTD
Usually, the DTD is a separate document, explaining what each annotated element means

Example: Structure of a BNC document (fragment)
o<bncdoc>
o <header>
o <fileDesc>
o     (description of the file)
o </fileDesc>
o <srcDesc>
o     (source of the text, including publisher)
o </srcDesc>
o </header>
o <text>
o (the actual text + annotation)
o </text>
o</bncdoc>

Markup language
oThe TEI uses SGML
oTags in SGML (and TEI):
nAlways use angle brackets
nIndicate start and end
o<tag> text </tag>
oend-tag often omitted if not required
nUsed for text elements:
oparagraph, word, sentence…

Markup language (cont/d)
oTEI also specifes a format for entity references:
nan entity reference is a kind of abbreviation for some detailed formatting or linguistic
information
oFormat:
nenclosed using & and ;
oExample:
n&eacute; è represents the letter e with an acute accent, i.e. é
nman&nn1; è represents the information that man is a noun in the singular
oInterpretation of entity references:
neach different entity reference used in the text is defined in detail in the document header
o

Example: tags and references in a BNC document (fragment)
o<text>
o <s n=020>
o <w EX0>there
o <w VBB>are
o <w PRP>between
o <w CRD>40&ndash;60,000
o <w NN0>people
o    …
o</text>
Sentence element with number
Word element with Part of Speech
Word element + entity reference
&ndash; = a dash

Beyond format: Content guidelines
oEAGLES
n“Expert Advisory Groups on Language Engineering Standards”
nEU-sponsored teams of experts who drew up guidelines on many aspects of language engineering,
including corpus annotation.
oAim:
n“best-practice” recommendations on what to annotate, at all levels (textual, part-of-speech, etc)
ncover a wide variety of languages
nguidelines on corpora are TEI-conformant.
o
oMain document: Corpus Encoding Standard (CES). Assumes SGML as the markup language.
o
oLater development: XCES: The CES using XML as the markup language.

Part 2
Levels of corpus annotation


LIN 3098 -- Corpus Linguistics
Textual/Extra-textual level
oInformation about the text, origins etc.
ncf the earlier example of the BNC header
ncf. McEnery & Wilson’s examples from other corpora
oExtra-textual information can be very detailed, e.g. include gender of author.
oTextual information can include things like questions, abbreviations and their expansions, etc.
n

LIN 3098 -- Corpus Linguistics
Orthographic level
oProblems with different alphabets, accents etc.
nMaltese: ħ, ġ, ż, ċ; German: umlaut etc; Russian: cyrillic alphabet
o
oTEI recommends use of entity references:
nù è &ugrave;
nġ è &gdot;
nalso, recommends sticking to the basic (“English”) ISO-646 character set
n
oMore recently, the UNICODE standard provides for a single, unified representation of all
characters in (hopefully) all alphabets and writing systems as they are, without needing any
special graphics capabilities.
nevery character is mapped to a unique numeric code
nall codes are readable by a computer
o
oTEI also recommends representing changes of typography etc (boldface, italic...) using start/end
tags.
n

LIN 3098 -- Corpus Linguistics
The challenges of spoken data
oSpeech does not contain “sentences” but “utterances”.
o
oTranscription of spoken data entails decisions about:
nwhether to assume sentence-based transcription or intonation units
nwhat to do about pauses, false starts, coughing...
nwhat to do about interruptions and overlapping speech
nwhether to add punctuation
o
oExample:
nLondon-Lund corpus uses intonation units for speech, with no punctuation
n

LIN 3098 -- Corpus Linguistics
Spoken data in the BNC
o<u who=D00011>
n<s n=00011>
n<event desc="radio on">
o
o<w PNP><pause dur=34>You
o<w VVD>got
o<w TO0>ta
o<unclear>
o<w NN1>Radio
o<w CRD>Two
o<w PRP>with
o<w DT0>that <c PUN>.
o</u>
oMany other tags to mark non-linguistic phenomena...
Utterance tag + speaker ID attribute
Sentence tag within utterance
Non-verbal action during speech
Pauses marked with duration
Unclear, non-transcribed speech

LIN 3098 -- Corpus Linguistics
Levels of linguistic annotation
opart-of-speech (word-level)
olemmatisation (word-level)
oparsing (phrase & sentence-level)
osemantics (multi-level)
nsemantic relationships between words and phrases
nsemantic features of words
odiscourse features (supra-sentence level)
ophonetic transcription
oprosody

Part of speech tagging
oPurpose:
nLabel every token with information about its part of speech.
o
oRequirements:
nA tagset which lists all the relevant labels.

LIN 3098 -- Corpus Linguistics
Part of speech tagsets
oTagging schemes can be very granular. Maltese example:
nVV1SR: verb, main, 1st pers, sing, perf
oimxejt – “I walked”
nVA1SP: verb, aux, 1st pers, sing, past
okont miexi – “I was walking”
nNNSM-PS1S: noun, common, sing, masc + poss. pronoun, sing, 1st pers
omissier-i – “my father”

How POS Taggers tend to work
1.Start with a manually annotated portion of text (usually several thousand words).
nthe/DET man/NN1 walked/VV
n
2.Extract a lexicon and some probabilities from it.
nE.g. Probability that a word is NN given that the previous word is DET.
nUsed for tagging new (previously unseen) words.
3.
3.Run the tagger on new data.
4.

LIN 3098 -- Corpus Linguistics
Challenges in POS tagging
oRecall that the process is usually semi-automatic.
o
oGranularity vs. correctness
nthe finer the distinctions, the greater the likelihood of error
nmanual correction is extremely time-consuming
n

LIN 3098 -- Corpus Linguistics
EAGLES recommendations on POS tagging
oSet of obligatory features for all languages
nNoun, verb, interjection, unique, residual, etc
oSet of recommended features:
nNoun:  number, gender, case, type
oSet of optional features:
ngeneric: apply to “all” languages (e.g. noun=count or mass)
nlanguage-specific: e.g. Danish has a suffixed definite article, so has a “definiteness” feature
for Nouns
n