on.corpora – classes for interpreting annotation

class on.corpora.subcorpus(a_ontonotes, physical_root_dir, cursor=None, prefix=[], suffix=[], lang=None, source=None, genre=None, strict_directory_structure=False, extensions=['parse', 'prop', 'sense', 'parallel', 'coref', 'name', 'speaker'], max_files='', old_id='')

A subcorpus represents an arbitrary collection of documents.

Initializing

The best way to deal with subcorpora is not to initialize them yourself at all. Create an ontonotes object with the config file, then ask it about its subcorpora. See on.ontonotes .

The following may be too much detail for your purposes.

When you __init__ a subcorpus, that’s only telling it which documents to include. It doesn’t actually load any of them, just makes a list. The load_banks() method does the actual file reading.

Which collection of documents a subcorpus represents depends on how you load it. The main way to do this is to use the constructor of on.ontonotes .

Loading the subcorpus directly through its constructor is complex, but provides slightly more flexibility. You need to first determine how much you current directory structure matches the one that ontonotes ships with. If you left it in the format:

.../data/<lang>/annotations/<genre>/<source>/<section>/<files>

Then all you need to do is initialize on.corpora.subcorpus with:

a_subcorpus = on.corpora.subcorpus(a_ontonotes, data_location)

where data_location is as much of the data as you want to load, perhaps .../data/english/annotations/nw/wsj/03.

If you’re not using the original directory structure, you need to specify lang, genre, and source (ex: 'english', 'nw', and 'wsj') so that ids can be correctly determined.

If you want to load some of the data under a directory node but not all, prefix and suffix let you choose to load only some files. All documents have a four digit numeric ID that identifies them given their language, genre, and source. As in, the document .../data/english/annotations/nw/wsj/00/wsj_0012 (which has multiple files (.parse, .name, .sense, ...)) has id 0012. Prefix and suffix are lists of strings that have to match these IDs. If you set prefix to ['0', '11', '313'] then the only documents considered will be those with ids starting with '0', '11' or '313'. Similarly with suffix. So:

prefix = ['00', '01'], suffix = ['1', '2', '3', '4']

means we’ll load (for cnn):

cnn_0001
cnn_0002
...
cnn_0004
cnn_0011
...
cnn_0094
cnn_0101
...
cnn_0194

but no files whose ids end not in 1, 2, 3, or 4 or whose ids start with anything except ‘00’ and ‘01’.

Using

A subcorpus that’s been fully initialized always contains a treebank, and generally contains other banks. To access a bank you can use [] syntax. For example, to access the sense bank, you could do:

a_sense_bank = a_subcorpus['sense']

If you iterate over a subcorpus you get the names of all the loaded banks in turn. So you could do something like:

for a_bank_name, a_bank in a_subcorpus.iteritems():
    print 'I found a %s bank and it had %d %s_documents' % (
         a_bank_name, len(a_bank))
load_banks(config)

Load the individual bank data for the subcorpus to memory

Once a subcorpus is initialized we know what documents it represent (as in cnn_0013) but we’ve not loaded the actual files (as in cnn_0013.parse, cnn_0013.sense, ...). We often only want to load some of these, so specify which extensions (prop, parse, coref) you want with the corpus.banks config variable

This code will, for each bank, load the files and then enrich the treebank with appropriate links. For example, enriching the treebank with sense data sets the on.corpora.tree.tree.on_sense attribute of every tree leaf that’s been sense tagged. Once all enrichment has happened, one can go through the trees and be able to access all the annotation.

(Minor exception: some name and coreference data is not currently fully aligned with the tree and is inaccessible in this manner)

write_to_db(a_cursor, only_these_banks=[])

Write the subcorpus and all files and banks within to the database.

Generally it’s better to use on.ontonotes.write_to_db() as that will write the type tables as well. If you don’t, perhaps for reasons of memory usage, write individual subcorpora to the database, you need to call on.ontonotes.write_type_tables_to_db() after the last time you call write_to_db().

Parameters:
  • a_cursor – The ouput of on.ontonotes.get_db_cursor()
  • only_these_banks – if set, load only these extensions to the db
copy()

make a duplicate of this subcorpus that represents the same documents

Note: if you had already loaded some banks these are absent in the copy.

backed_by()

Returns either ‘db’ or ‘fs’

We can be pulling our data from the database or the filesystem depending on how we were created. Note that even if we’re reading from the file system, if the db is available we use it for sense inventory and frame lookups.

__getitem__(key)

The standard way to access individual banks is with [] notation.

The keys are extensions. To iterate over multiple banks in parallel, do something like:

for a_tree_doc, a_sense_doc in zip(a_subcorpus['parse'], a_subcorpus['sense']):
   pass

Note that this will not work if some parses do not have sense documents.

all_banks(standard_extension)

The way to get a list of all banks of a type.

For example, if you have:

cnn_0000.no_traces_parse
cnn_0000.auto_traces_parse
cnn_0000.parse

If you want to iterate over all trees in all treebanks, you could do:

for a_treebank in a_subcorpus.all_banks('parse'):
   for a_tree_document in a_treebank:
      for a_tree in a_tree_document:
         pass
class on.corpora.abstract_bank(a_subcorpus, tag, extension)

A superclass for all bank clsses

All banks support the following psuedocode usage:

if load_from_files:
   a_some_bank = some_bank(a_subcorpus, tag[, optional arguments])
else: # load from db
   a_some_bank = some_bank.from_db(a_subcorpus, tag, a_cursor[, opt_args])

# only if a_some_bank is not a treebank
a_some_bank.enrich_treebank(a_treebank[, opt_args])

for a_some_document in a_some_bank:
   # get the corresponding another document
   another_document = another_bank.get_document(a_some_document)

if write_to_files:
   a_some_document.dump_view(a_cursor=None, out_dir)
else: # write to db
   a_some_document.write_to_db(a_cursor)

See:

class on.corpora.document_bank(a_treebank, tag, lang_id, genre, source)
class on.corpora.file(base_dir, file_id, subcorpus_id)

A file. Currently synonimous document

class on.corpora.document(a_tree_document, lang_id, genre, source)

The text of a document. In current usage there is only ever one document per file, but there could in theory be more than one.

class on.corpora.sentence(a_tree)

Represents a sentence; a list of tokens. Generally working with on.corpora.tree.tree objects is easier.

class on.corpora.token(a_leaf)

A token. Just a word and a part of speech

tree – Syntactic Parse Annotation

See:

Correspondences:

Database Tables Python Objects File Elements
treebank treebank All .parse files for a on.corpora.subcorpus
None tree_document A .parse file
tree tree An S-expression in a .parse file
syntactic_link syntactic_link The numbers after ‘-‘ and ‘=’ in trees
lemma lemma .lemma files (arabic only)
class on.corpora.tree.treebank(a_subcorpus, tag, cursor=None, extension='parse', file_input_extension=None)

The treebank class represents a collection of tree_document classes and provides methods for manipulating trees. Further, because annotation in other banks was generally done relative to these parse trees, much of the code works relative to the trees. For example, the on.corpora.document data, their on.corpora.sentence data, and their on.corpora.token data are all derived from the trees.

Attributes:

banks

A hash from standard extensions (coref, name, ...) to bank instances

class on.corpora.tree.tree_document(document_id, parse_list, sentence_id_list, headline_flag_list, paragraph_id_list, absolute_file_path, a_treebank, subcorpus_id, a_cursor=None, extension='parse')

Contained by: treebank

Contains: tree (roots)

Attributes:

The following two attributes are set during enrichment with parallel banks. For sentence level annotation, see tree.translations and tree.originals .

translations

A list of tree_document instances in other subcorpora that represent translations of this document

original

A single tree_document instance representing the original document that this one was translated from. It doesn’t make sense to have more than one of these.

Methods:

sentence_tokens_as_lists(make_sgml_safe=False, strip_traces=False)

all the words in this document broken into lists by sentence

So ‘Good morning. My name is John.’ becomes:

[['Good', 'morning', '.'],
 ['My', 'name', 'is', 'john', '.']]

This doesn’t actually return a list, but instead a generator. To get this as a (very large) list of lists, just call it as list(a_tree_document.sentence_tokens_as_lists())

If ‘make_sgml_safe’ is True, on.common.util.make_sgml_safe() is called for each word.

If ‘strip_traces’ is True, trace leaves are not included in the output.

class on.corpora.tree.tree(tag, word=None, document_tag='gold')

root trees, internal nodes, and leaves are all trees.

Contained by: tree_document if a root tree, tree otherwise Contains: None if a leaf, tree otherwise

Attributes:

Always available:

parent

The parent of this tree. If None then we are the root

lemma

Applicable only to Arabic leaves. The morphological lemma of the word as a string. In Chinese the word is the lemma, so use word. In English the best you can do is use either the lemma attribute of on_sense or the lemma attribute of proposition .

See Also: lemma_object

lemma_object

Applicable only to Arabic leaves. There is a lot more information in the .lemma file for each leaf than just the lemma string, so if available a lemma instance is here.

word

Applicable only to leaves. The word of text corresponding to the leaf. To extract all the words for a tree, see get_word_string() .

For arabic, the word of a tree is always the vocalized unicode representation. For other representations, see the get_word() method.

tag

Every node in the tree has a tag, which represents part of speech, phrase type, or function type information. For example, the leaf (NNS cabbages) has a tag of NNS while the subtree (NP (NNS cabbages)) has a tag of NP.

children

A list of the child nodes of this tree. For leaves this will be the empty list.

reference_leaves

A list of trace leaves in this tree that point to this subtree

identity_subtree

The subtree in this tree that this trace leaf points to

Available only after enrichment:

The following attributes represent the annotation. They are set during the enrichment process, which happens automatically unless you are invoking things manually at a low level. You must, of course, specify that a bank is to be loaded for its annotations to be available. For example, if the configuration variable corpora.banks is set to "parse sense", then leaves will have on_sense attributes but not proposition attributes.

Each annotation variable specifies it’s level, the bank that sets it, and the class whose instance it is set to. Leaf level annotation applies only to leaves, tree level annotation only to sentences, and subtree annotation to any subtree in between, including leaves.

Order is not significant in any of the lists.

on_sense

Leaf level, sense bank, on_sense

proposition

Subtree level, proposition bank, proposition

This is attached to the same subtree as the primary predicate node of the proposition.

predicate_node_list

Subtree level, proposition bank, list of predicate_node

argument_node_list

Subtree level, proposition bank, list of argument_node

Subtree level, proposition bank, list of link_node

named_entity

Subtree level, name bank, name_entity

start_named_entity_list

Leaf level, name bank, list of name_entity

Name entities whose initial word is this leaf.

end_named_entity_list

Leaf level, name bank, list of name_entity

Name entities whose final word is this leaf.

Subtree level, coreference bank, coreference_link

coreference_chain

Subtree level, coreference bank, coreference_chain

The coreference chain that coreference_link belongs to.

Leaf level, coreference bank, list of coreference_link

Coreference links whose initial word is this leaf.

Leaf level, coreference bank, list of coreference_link

Coreference links whose final word is this leaf.

coref_section

Tree level, coreference bank, string

The Broadcast Conversation documents, because they are very long, were divided into sections for coreference annotation. We tried to break them up at natural places, those where the show changed topic, to minimize the chance of cross-section coreference. The annotators then did standard coreference annotation on each section separately as if it were its own document. Post annotation, we merged all sections into one .coref file, with each section as a TEXT span. So you can have a pair of references to John Smith in section 1 and another pair of references in section 2, but they form two separate chains. That is, every coreference chain is within only one coreference section.

translations

Tree level, parallel_bank, list of tree

Trees (in other subcorpora) that are translations of this tree

originals

Tree level, parallel_bank, list of tree

Trees (in other subcorpora) that are originals of this tree

speaker_sentence

Tree level, speaker bank, speaker_sentence

Methods:

is_noun()

Is the part of speech of this leaf NN or NNS assuming the Penn Treebank’s tagset?

is_verb()

Is the part of speech of this leaf VN or VBX for some X assuming the Penn Treebank’s tagset?

is_aux(prop=False)

Does this leaf represent an auxilliary verb ?

Note: only makes sense if english

All we do is say that a leaf is auxilliary if:

  • it is a verb
  • the next leaf (skipping adverbs) is also a verb

This does not deal with all cases. For example, in the sentence ‘Have you eaten breakfast?’, the initial ‘have’ is an auxilliary verb, but we report it as not so. There should be no false positives, but we don’t get all the cases.

If the argument ‘prop’ is true, then we use a less restrictive definition of auxilliary that represents closer what is legal for proptaggers to tag. That is, if we have a verb following a verb, and the second verb is under an NP, don’t count the first verb as aux.

is_leaf()

does this tree node represent a leaf?

is_trace()

does this tree node represent a trace?

is_root()

check whether the tree is a root

is_punct()

is this leaf punctuation?

is_conj()

is this leaf a conjunction?

is_trace_indexed()

if we’re a trace, do we have a defined reference index?

is_trace_origin()

if we’re a trace, do we have a defined identity index?

get_root()

Return the root of this tree, which may be ourself.

get_subtree(a_id)

return the subtree with the specified id

get_leaf_by_word_index(a_word_index)

given a word index, return the leaf at that index

get_leaf_by_token_index(a_token_index)

given a token index, return the leaf at that index

get_subtree_by_span(start, end)

given start and end of a span, return the highest subtree that represents it

The arguments start and end may either be leaves or token indecies.

Returns None if there is no matching subtree.

pretty_print(offset='', buckwalter=False, vocalized=True)

return a string representing this tree in a human readable format

get_word(buckwalter=False, vocalized=True, clean_speaker_names=True, interpret_html_escapes=True, indexed_traces=True)
get_word_string(buckwalter=False, vocalized=True)

return the words for this tree, separated by spaces.

get_trace_adjusted_word_string(buckwalter=False, vocalized=True)

The same as get_word_string() but without including traces

get_plain_sentence()

display this sentence with as close to normal typographical conventions as we can.

Note that the return value of this function does not follow ontonotes tokenization.

pointer(indexing='token')

(document_id, tree_index, sentence_index)

Return a triple in the format of the pointers in a sense or prop file.

fix_trace_index_locations()

Reconcile the two forms of trace index notation; set up syntactic link pointers

All trees distributed with ontonotes have been through this process.

There are two forms used in the annotation, one by ann taylor, the other by the ldc.

terminology note: we’re using ‘word’ and ‘tag’ as in (tag word) and (tag (tag word))

In the original Penn Treebank system, a trace is a link between exactly two tree nodes, with the target index on the tag of the parent and the reference index on the word of the trace. If there are more than one node in a trace, they’re chained, with something like:

(NP-1 (NP target)) ...
   (NP-2 (-NONE- *-1)) ...
     (NP (-NONE- *-2)) ...

In the LDC system a trace index can apply to arbitrarily many nodes and all indecies are on the tag of the parent. So this same situation would be notated as:

(NP-1 (NP target)) ...
  (NP-1 (-NONE- *)) ...
    (NP-1 (-NONE- *)) ...

We’re leaving everything in the original Penn Treebank format alone, but changing the ldc format to a hybrid mode where there can be multiple nodes in a trace chain, but the reference indecies are on the words:

(NP-1 (NP target)) ...
  (NP (-NONE- *-1)) ...
    (NP (-NONE- *-1)) ...

There are a few tricky details:

  • We have to be able to tell by looking at a tree which format it’s in

    • if there are ever words with trace indicies, this mean’s we’re using the original Penn Treebank format
  • We might not be able to tell what the target is

    • The ideal case has one or more -NONE- tags on traces and exactly one without -NONE-
    • If there are more than one without -NONE- then we need to pick one to be the target. Choose the leftmost
    • If there is none without -NONE- then we also need to pick one to be the target. Again choose the leftmost.

We also need to deal with gapping. This is for sentences like:

‘Mary likes Bach and Susan, Bethoven’

These are notated as:

(S (S (NP-SBJ=1 Mary)
      (VP likes
         (NP=2 Bach)))
    and
   (S (NP-SBJ=1 Susan)
       ,
       (NP=2 Beethoven)))

in the LDC version, and as:

(S (S (NP-SBJ-1 Mary)
      (VP likes
         (NP-2 Bach)))
    and
   (S (NP-SBJ=1 Susan)
       ,
       (NP=2 Beethoven)))

in the original Penn Treebank version.

We convert them all to the original Penn Treebank version, with the target having a single hyphen and references having the double hyphens.

There can also be trees with both gapping and normal traces, as in:

(NP fears
  (SBAR that
    (S (S (NP-SBJ=2 the Thatcher government)
          (VP may (VP be (PP-PRD=3 in (NP turmoil)))))
        and
       (S (NP-SBJ-1=2 (NP Britain 's) Labor Party)
          (VP=3 positioned
            (S (NP-SBJ-1 *)
               (VP to (VP regain (NP (NP control)
                                 (PP of (NP the government)))))))))))

So we need to deal with things like ‘NP-SBJ-1=2‘ properly.

Also set up leaf.reference_leaves and leaf.identity_subtree

get_sentence_index()

the index of the sentence (zero-indexed) that this tree represents

get_word_index(sloppy=False)

the index of the word in the sentence not counting traces

sloppy is one of:
False: be strict ‘next’ : if self is a trace, take the next non-trace leaf ‘prev’ : if self is a trace, take the prev non-trace leaf
get_token_index()

the index of the word in the sentence including traces

get_height()

how many times removed this node is from its initial leaf.

Examples:

  • height of (NNS cabbages) is 0
  • height of (NP (NNS cabbages)) is 1
  • height of (PP (IN of) (NP (NNS cabbages))) is also 1 because (IN of) is its initial leaf

Used by propositions

leaves(regen_cache=False)

generate the leaves under this subtree

subtrees(regen_cache=False)

generate the subtrees under this subtree, including this one

order is always top to bottom; if A contains B then index(A) < index(B)

__getitem__(x)

get a leaf, list of leaves, or subtree of this tree

The semantics of this when used with a slice are tricky. For many purposes you would do better to use one of the following instead:

This function is nice, though, especially for interactive use. The logic is:

  • if the argument is a single index, return that leaf, or index error if it does not exist
  • otherwise think of the tree as a list of leaves. The argument is then interpreted just as list.__getitem__ does, with full slice support.
  • if such interpretation leads to a list of leaves that is a proper subtree of this one, return that subtree
    • note that if a subtree has a single child, two such subtrees can match. If more than one matches, we take the highest one.
  • if all the slice can be interpreted to represent is an arbitrary list of leaves, return that list.

For example, consider the following tree:

(TOP (S (PP-MNR (IN Like)
        (NP (JJ many)
            (NNP Heartland)
            (NNS states)))
        (, ,)
        (NP-SBJ (NNP Iowa))
        (VP (VBZ has)
            (VP (VBN had)
                (NP (NP (NN trouble))
                    (S-NOM (NP-SBJ (-NONE- *PRO*))
                           (VP (VBG keeping)
                               (NP (JJ young)
                                   (NNS people))
                               (ADVP-LOC (ADVP (RB down)
                                               (PP (IN on)
                                                   (NP (DT the)
                                                       (NN farm))))
                                         (CC or)
                                         (ADVP (RB anywhere)
                                               (PP (IN within)
                                                   (NP (NN state)
                                                       (NNS lines))))))))))
        (. .)))

The simplest thing we can do is look at individual leaves, such as tree[2]:

(NNP Heartland)

Note that leaves act as subtrees, so even if we index a leaf, tree[2][0] we get it back again:

(NNP Heartland)

If we look at a valid subtree, like with tree[0:4], we see:

(PP-MNR (IN Like)
        (NP (JJ many)
            (NNP Heartland)
            (NNS states)))

If our indexes do not fall exactly on subtree bounds, we instead get a list of leaves:

[(IN Like),
 (JJ many),
 (NNP Heartland),
 (NNS states),
 (, ,)]

Extended slices are supported, though they’re probably not very useful. For example, we can make a list of the even leaves of the tree in reverse order with tree[::-2]:

[(. .),
 (NN state),
 (RB anywhere),
 (NN farm),
 (IN on),
 (NNS people),
 (VBG keeping),
 (NN trouble),
 (VBZ has),
 (, ,),
 (NNP Heartland),
 (IN Like)]
get_other_leaf(index)

Get leaves relative to this one. An index of zero is this leaf, negative one would be the previous leaf, etc. If the leaf does not exist, we return None

class on.corpora.tree.lemma(input_string, b_transliteration, comment, index, offset, unvocalized_string, vocalized_string, vocalized_input, pos, gloss, lemma, coarse_sense, leaf_id)

arabic trees have extra lemma information

Links between tree nodes

Example:

(TOP (SBARQ (WHNP-1 (WHADJP (WRB How)
                            (JJ many))
                    (NNS ups)
                    (CC and)
                    (NNS downs))
            (SQ (MD can)
                (NP-SBJ (CD one)
                        (NN woman))
                (VP (VB have)
                    (NP (-NONE- *T*-1))))
            (. /?)))

The node (-NONE- *T*-1) is a syntactic link back to (WHNP-1 (WHADJP (WRB How) (JJ many)) (NNS ups) (CC and) (NNS downs)).

Links have an identity subtree (How many ups and downs) and a reference subtree (-NONE- *T*-1) and are generally thought of as a link from the reference back to the identity.

class on.corpora.tree.compound_function_tag(a_function_tag_string, subtree)
exception on.corpora.tree.tree_exception

proposition – Proposition Annotation

See:

Correspondences:

Database Tables Python Objects File Elements
proposition_bank proposition_bank All .prop files in an on.corpora.subcorpus
None proposition_document A single .prop file
proposition proposition A line in a .prop file, with everything after the ----- an “argument field”
None predicate_analogue REL argument fields (should only be one)
None argument_analogue ARG argument fields
None link_analogue LINK argument fields
predicate predicate Asterisk-separated components of a predicate_analogue. Each part is coreferential.
argument argument Asterisk-separated components of an argument_analogue. Each part is coreferential.
proposition_link link Asterisk-separated components of a link_analogue. Each part is coreferential.
predicate_node predicate_node Comma-separated components of predicates. The parts together make up the predicate.
argument_node argument_node Comma-separated components of arguments. The parts together make up the argument.
link_node link_node Comma-separated components of links. The parts together make up the link.
None frame_set An xml frame file (FF)
pb_sense_type on.corpora.sense.pb_sense_type Field six of a prop line and a FF’s frameset/predicate/roleset element’s id attribute
pb_sense_type_argument_type argument_composition For a FF’s frameset/predicate element, a mapping between roleset.id and roleset/role.n
tree on.corpora.tree.tree The first three fields of a prop line

This may be better seen with an example. The prop line:

bc/cnn/00/cnn_0000@all@cnn@bc@en@on 191 3 gold say-v say.01 ----- 1:1-ARGM-DIS 2:1-ARG0 3:0-rel 4:1*6:1,8:1-ARG1

breaks up as:

Python Object File Text
proposition bc/cnn/00/cnn_0000@all@cnn@bc@en@on 191 3 gold say-v say.01 ----- 1:1-ARGM-DIS 2:1-ARG0 3:0-rel 4:1*6:1,8:1-ARG1
on.corpora.tree.tree bc/cnn/00/cnn_0000@all@cnn@bc@en@on 191 3
predicate_analogue 3:0-rel
predicate 3:0
predicate_node 3:0
argument_analogue each of 1:1-ARGM-DIS, 2:1-ARG0, and 4:1*6:1,8:1-ARG1
argument each of 1:1, 2:1, 4:1, and 6:1,8:1
argument_node each of 1:1, 2:1, 4:1, 6:1, and 8:1

Similarly, the prop line:

bc/cnn/00/cnn_0000@all@cnn@bc@en@on 309 5 gold go-v go.15 ----- 2:0*1:1-ARG1 4:1-ARGM-ADV 5:0,6:1-rel

breaks up as:

Python Object File Text
proposition bc/cnn/00/cnn_0000@all@cnn@bc@en@on 309 5 gold go-v go.15 ----- 2:0*1:1-ARG1 4:1-ARGM-ADV 5:0,6:1-rel
on.corpora.tree.tree bc/cnn/00/cnn_0000@all@cnn@bc@en@on 309 5
predicate_analogue 5:0,6:1-rel
predicate 5:0,6:1
predicate_node each of 5:0 and 6:1
argument_analogue each of 2:0*1:1-ARG1 and 4:1-ARGM-ADV
argument each of 2:0, 1:1 and 4:1
argument_node each of 2:0, 1:1, and 4:1

Classes:

class on.corpora.proposition.proposition_bank(a_subcorpus, tag, a_cursor=None, extension='prop', a_frame_set_hash=None)

Extends: on.corpora.abstract_bank

Contains: proposition_document

class on.corpora.proposition.proposition_document(document_id, extension='prop')

Contained by: proposition_bank

Contains: proposition

class on.corpora.proposition.proposition(encoded_prop, subcorpus_id, document_id, a_proposition_bank=None, tag=None)

a proposition annotation; a line in a .prop file

Contained by: proposition_document

Contains: predicate_analogue , argument_analogue , and link_analogue (in that order)

Attributes:

lemma

Which frame_set this leaf was annotated against

pb_sense_num

Which sense in the frame_set the arguments are relative to

predicate

A predicate_analogue instance

quality
gold
double annotated, adjudicated, release format
type
v
standard proposition
n
nominalization proposition
argument_analogues

A list of argument_analogue

A list of link_analogue

document_id
enc_prop

This proposition and all it contains, encoded as a string. Lines in the .prop files are in this format.

Methods:

write_to_db(cursor)

write this proposition and all its components to the database

__getitem__(idx)

return first the predicate_analogue, then the argument analogues, then the link analogues

class on.corpora.proposition.abstract_proposition_bit(a_parent)

any subcomponent of a proposition after the ‘-----‘.

Attributes:

id
index_in_parent
lemma
pb_sense_num
proposition
document_id
enc_self

Encode whatever we represent as a string, generally by combining the encoded representations of sub-components

Class Hierarchy:

class on.corpora.proposition.abstract_holder(sep, a_parent)

represents any proposition bit that holds other proposition bits

Extends abstract_proposition_bit

See:

class on.corpora.proposition.abstract_analogue(a_parent, a_analogue_type)

represents argument_analogue, predicate_analogue, link_analogue

Example: 0:1,3:2*2:0-ARGM

Extends: abstract_holder

Represents:

This class is used for the space separated portions of a proposition after the ‘-----

All children are coreferential, and usually all but one are traces.

class on.corpora.proposition.abstract_node_holder(a_parent)

represents argument, predicate, link

Example: 0:1,3:2

Extends: abstract_holder

Represents:

This class is used for any bit of a proposition which has representation A,B where A and B are nodes

class on.corpora.proposition.abstract_node(sentence_index, token_index, height, parent)

represents argument_node, predicate_node, and link_node

Example: 0:1

Extends: abstract_proposition_bit

Represents:

This class is used for any bit of a proposition which has representation A:B

Attributes:

sentence_index

which tree we’re in

token_index

which leaf in the tree we are

height

how far up from the leaves we are (a leaf is height 0)

parent

an abstract_node_holder to add yourself to

subtree

which on.corpora.tree.tree we’re aligned to. None until enrichment.

is_ich_node

True only for argument nodes. True when proposition taggers would separate this node from others with a ‘;‘ in the encoded form. That is, True if the subtree we are attached to is indexed to an *ICH* leaf or we have an *ICH* leaf among our leaves, False otherwise.

errcomms

This is a list, by default the empty list. If errors are found in loading this proposition, strings that can be passed to on.common.log.reject() or on.common.log.adjust() are appended to it along with comments, like:

errcomms.append(['reason', ['explanation', 'details', ...]])

Initially, a node is created with sentence and token indecies. During enrichment we gain a reference to a on.corpora.tree.tree instance. After enrichment, requests for sentence and token indecies are forwarded to the subtree.

class on.corpora.proposition.predicate_analogue(enc_predicates, a_type, sentence_index, token_index, a_proposition)

The REL-tagged field of a proposition.

Extends: abstract_analogue

Contained by: proposition

Contains: predicate

class on.corpora.proposition.predicate(enc_predicate, sentence_index, token_index, a_predicate_analogue)

Extends: abstract_node_holder

Contained by: predicate_analogue

Contains: predicate_node

class on.corpora.proposition.predicate_node(sentence_index, token_index, height, a_predicate, primary=False)

represents the different nodes of a multi-word predicate

Extends: abstract_node

Contained by: predicate

Attributes:

a_predicate

on.corpora.proposition.predicate

sentence_index

which tree in the document do we belong to

token_index

token index of this node within the predicate’s tree

height

how far up in the tree from the leaf at token_index we need to go to get the subtree this node represents

primary

are we the primary predicate?

class on.corpora.proposition.argument_analogue(enc_argument_analogue, a_proposition)

Extends: abstract_analogue

Contained by: proposition

Contains: argument

class on.corpora.proposition.argument(enc_argument, a_argument_analogue)

Extends: abstract_node_holder

Contained by: argument_analogue

Contains: argument_node

class on.corpora.proposition.argument_node(sentence_index, token_index, height, a_argument)

Extends: abstract_node

Contained by: argument

Extends: abstract_analogue

Contained by: proposition

Contains: link

Extends: abstract_node_holder

Contained by: link_analogue

Contains: link_node

Attributes:

associated_argument

the argument_analogue this link is providing additional detail for

Extends: abstract_node

Contained by: link

class on.corpora.proposition.frame_set(a_xml_string, a_subcorpus=None, lang_id=None)

information for interpreting a proposition annotation

sense – Word Sense Annotation

See:

Word sense annotation consists of specifying which sense a word is being used in. In the .sense file format, a word sense would be annotated as:

This tells us that word 9 of sentence 6 in broadcast news document cnn_0001 has the lemma “fire”, is a noun, and has sense 4. The sense numbers, such as 4, are defined in the sense inventory files. Looking up sense 4 of fire-n in data/english/metadata/sense-inventories/fire-n.xml, we see:

<sense n="4" type="Event" name="the discharge of a gun" group="1">
  <commentary>
    FIRE[+event][+physical][+discharge][+gun]
    The event of a gun going off.
  </commentary>
  <examples>
    Hold your fire until you see the whites of their eyes.
    He ran straight into enemy fire.
    The marines came under heavy fire when they stormed the hill.
  </examples>
  <mappings><wn version="2.1">2</wn><omega></omega><pb></pb></mappings>
  <SENSE_META clarity=""/>
</sense>

Just knowing that word 9 of sentence 6 in some document has some sense is not very useful on its own. We need to match this data with the document it was annotated against. The python code can do this for you. First, load the data you’re interested in, to memory with on.corpora.tools.load_to_memory. Then we can iterate over all the leaves to look for cases where a_leaf was tagged with a noun sense “fire”:

fire_n_leaves = []
for a_subcorpus in a_ontonotes:
   for a_tree_document in a_subcorpus["tree"]:
      for a_tree in a_tree_document:
         for a_leaf in a_tree.leaves():
            if a_leaf.on_sense: # whether the leaf is sense tagged
               if a_leaf.on_sense.lemma == "fire" and a_leaf.on_sense.pos == "n":
                  fire_n_leaves.append(a_leaf)

Now say we want to print the sentences for each tagged example of “fire-n”:

# first we collect all the sentences for each sense of fire
sense_to_sentences = defaultdict(list)
for a_leaf in fire_n_leaves:
   a_sense = a_leaf.on_sense.sense
   a_sentence = a_leaf.get_root().get_word_string()
   sense_to_sentences[a_sense].append(a_sentence)

# then we print them
for a_sense, sentences in sense_to_sentences.iteritems():
   a_sense_name = on_sense_type.get_name("fire", "n", a_sense)

   print "Sense %s: %s" % (a_sense, a_sense_name)
   for a_sentence in sentences:
      print "  ", a_sentence

   print ""

Correspondences:

Database Tables Python Objects File Elements
sense_bank sense_bank All .sense files in a on.corpora.subcorpus
None sense_tagged_document A single .sense file
on_sense on_sense A line in a .sense file
None sense_inventory A sense inventory xml file (SI)
on_sense_type on_sense_type Fields four and six of a sense line and the inventory/sense element of a SI
on_sense_lemma_type on_sense_lemma_type The inventory/ita element of a SI
wn_sense_type wn_sense_type The inventory/sense/mappings/wn element of a SI
pb_sense_type pb_sense_type The inventory/sense/mappings/pb element of a SI
tree on.corpora.tree.tree The first three fields of a sense line

Classes:

class on.corpora.sense.sense_bank(a_subcorpus, tag, a_cursor=None, extension='sense', a_sense_inv_hash=None, a_frame_set_hash=None, indexing='word')

Extends: on.corpora.abstract_bank

Contains: sense_tagged_document

class on.corpora.sense.sense_tagged_document(sense_tagged_document_string, document_id, a_sense_bank, a_cursor=None, preserve_ita=False, indexing='word')

Contained by: sense_bank

Contains: on_sense

class on.corpora.sense.on_sense(document_id, tree_index, word_index, lemma, pos, ann_1_sense, ann_2_sense, adj_sense, sense, adjudicated_flag, a_cursor=None, indexing='word')

A sense annotation; a line in a .sense file.

Contained by: sense_tagged_document

Attributes:

lemma

Together with the pos , a reference to a sense_inventory .

pos

Either n or v. Indicates whether this leaf was annotated by people who primarily tagged nouns or verbs. This should agree with on.corpora.tree.tree.is_noun() and is_verb() methods for English and Arabic, but not Chinese.

sense

Which sense in the sense_inventory the annotators gave this leaf.

class on.corpora.sense.on_sense_type(lemma, pos, group, sense_num, name, sense_type)

Information to interpret on_sense annotations

Contained by: sense_inventory

Attributes:

lemma
sense_num
pos

Either ‘n’ or ‘v’, depending on whether this is a noun sense or a verb sense.

wn_sense_types

list of wn_sense_type instances

pb_sense_types

list of pb_sense_type instances (frame senses)

sense_type

the type of the sense, such as ‘Event’

Methods:

classmethod get_name(a_lemma, a_pos, a_sense)

given a lemma, pos, and sense number, return the name from the sense inventory

class on.corpora.sense.on_sense_lemma_type(a_on_sense)

computes and holds ita statistics for a lemma/pos combination

class on.corpora.sense.sense_inventory(a_fname, a_xml_string, a_lang_id, a_frame_set_hash={})

Contains: on_sense_type

class on.corpora.sense.pb_sense_type(lemma, num)

A frame sense

Contained by: on.corpora.proposition.frame_set, on_sense_type

class on.corpora.sense.wn_sense_type(lemma, wn_sense_num, pos, wn_version)

a wordnet sense, for mapping ontonotes senses to wordnet senses

Contained by: on_sense_type

coreference – Coreferential Entity Annotation

See:

Coreference annotation consists of indicating which mentions in a text refer to the same entity. The .coref file format looks like this:

Correspondences:

Database Tables Python Objects File Elements
coreference_bank coreference_bank All .coref files in an on.corpora.subcorpus
None coreference_document A .coref file (a DOC span)
tree.coreference_section on.corpora.tree.tree.coref_section An annotation section of a .coref file (a TEXT span)
tree on.corpora.tree.tree A line in a .coref file
coreference_chain coreference_chain All COREF spans with a given ID
coreference_chain.type coreference_chain.type The TYPE field of a coreference link (the same for all links in a chain)
coreference_chain.speaker coreference_chain.speaker The TYPE field of a coreference chain (the same for all links in a chain)
coreference_link coreference_link A single COREF span
coreference_link.type coreference_link.type The SUBTYPE field of a coreference link

Note that coreference section information is stored very differently the files than in the database and python objects. For more details see the on.corpora.tree.tree.coref_section documentation

Classes:

class on.corpora.coreference.coreference_bank(a_subcorpus, tag, a_cursor=None, extension='coref', indexing='token', messy_muc_input='false')

Contains: coreference_document

class on.corpora.coreference.coreference_document(enc_doc_string, document_id, extension='coref', indexing='token', a_cursor=None, adjudicated=True, messy_muc_input=False)

Contained by: coreference_bank

Contains: coreference_chain

class on.corpora.coreference.coreference_chain(type, identifier, section, document_id, a_cursor=None, speaker='')

Contained by: coreference_document

Contains: coreference_link

Attributes:

identifier

Which coref chain this is. This value is unique to this document, though not across documents.

type

Whether we represent an APPOS reference or an IDENT one.

section

Which section of the coreference document we belong in. See on.corpora.tree.tree.coref_section for more details.

document_id

The id of the document that we belong to

A list of coreference_link instances. Better to use [] or iteration on the chain than to use this list directly, though.

speaker

A string or the empty string. For coref chains that are coreferent with one of the speakers in the document, this will be set to the speaker’s name. To see which speakers are responsible for which sentences, either use the .speaker file or look at the on.corpora.tree.speaker_sentence attribute of trees. During the coreference annotation process the human annotators had access to the name of the speaker for each line.

Note that the speaker attribute does not represent the person who spoke this sentence.

A coreference annotation

Contained by: coreference_chain

Attributes:

string
start_token_index
end_token_index
start_word_index
end_word_index
sentence_index
start_leaf

An on.corpora.tree.tree instance. None until enrichment

end_leaf

An on.corpora.tree.tree instance. None until enrichment

subtree

An on.corpora.tree.tree instance. None until enrichment. After enrichment, if we could not align this span with any node in the tree, it remains None.

subtree_id

After enrichment, evaluates to subtree .id. This value is written to the database, and so is available before enrichment when one is loading from the database.

type

All coreference chains with type IDENT have coreference links with type IDENT. If the coreference chain has type APPOS (appositive) then one coreference link will be the HEAD while the other links will be ATTRIB.

coreference_chain

What on.corpora.coreference.coreference_chain contains this link.

start_char_offset

In the case of a token like ‘Japan-China’ we want to be able to tag ‘Japan’ and ‘China’ separately. We do this by specifying a character offset from the beginning and end to describe how much of the token span we care about. So in this case to tag only ‘China’ we would set start_char_offset to 6. To tag only ‘Japan’ we would set end_char_offset to ‘6’. If these offsets are 0, then, we use whole tokens.

These correspond to the ‘S_OFF’ and ‘E_OFF’ attributes in the coref files.

For the most complex cases, something like ‘Hong Kong-Zhuhai-Macau’, we specify both the start and the end offsets. The coref structure looks like:

<COREF>Hong <COREF><COREF>Kong-Zhuhai-Macau</COREF></COREF></COREF>

And the offsets are E_OFF=13 for Hong Kong, S_OFF=5 and E_OFF=6 for ‘Zhuhai’, and S_OFF=12 for ‘Macau’

end_char_offset

See coreference_link.start_char_offset

Before enrichment, generally either the token indices or the word indices will be set but not both. After enrichment, both sets of indices will work and will delegate their responses to start_leaf or end_leaf as appropriate.

name – Name-Entity Annotation

See:

Correspondences:

Database Tables Python Objects File Elements
name_bank name_bank All .name files in an on.corpora.subcorpus
None name_tagged_document A .name file
tree on.corpora.tree.tree A line in a .name file
name_entity name_entity A single ENAMEX, TIMEX, or NUMEX span
None name_entity_set All name_entity instances for one on.corpora.tree.tree
class on.corpora.name.name_bank(a_subcorpus, tag, a_cursor=None, extension='name', indexing='word')

Contains name_tagged_document

class on.corpora.name.name_tagged_document(document_string, document_id, extension='name', indexing='word', a_cursor=None)

Contained by: name_bank

Contains: name_entity_set

class on.corpora.name.name_entity(sentence_index, document_id, type, start_index, end_index, string, indexing='word', start_char_offset=0, end_char_offset=0)

A name annotation

Contained by: name_entity_set

Attributes:

string
start_token_index
end_token_index
start_word_index
end_word_index
sentence_index
start_leaf

An on.corpora.tree.tree instance. None until enrichment

end_leaf

An on.corpora.tree.tree instance. None until enrichment

subtree

An on.corpora.tree.tree instance. None until unrichment. After enrichment, if we could not align this span with any node in the tree, it remains None.

subtree_id

After enrichment, evaluates to subtree .id. This value is written to the database, and so is available before enrichment when one is loading from the database.

type

The type of this named entity, such as PERSON or NORP.

Before enrichment, generally either the token indecies or the word indecies will be set but not both. After enrichment, both sets of indecies will work and will delegate their responses to start_leaf or end_leaf as appropriate.

class on.corpora.name.name_entity_set(a_document_id)

all the name entities for a single sentence

Contained by: name_tagged_document

Contains: name_entity

ontology – Ontology Annotation

class on.corpora.ontology.ontology(a_id, a_upper_model, a_sense_pool_collection, a_cursor=None)
class on.corpora.ontology.upper_model(a_id, a_um_string, a_cursor=None)
class on.corpora.ontology.sense_pool(a_sense_pool_id, a_sense_pool_string, a_cursor=None)
class on.corpora.ontology.sense_pool_collection(a_id, root_dir, a_cursor=None)
class on.corpora.ontology.concept(a_concept_string, a_cursor=None)
class on.corpora.ontology.feature(a_feature)
exception on.corpora.ontology.no_such_parent_concept_error
exception on.corpora.ontology.no_such_parent_sense_pool_error

speaker – Speaker Metadata for Broadcast Conversation Documents

See:

Speaker metadata is additional information collected at the sentence level about speakers before annotation. The data is stored in .speaker files:

$ head data/english/annotations/bc/cnn/00/cnn_0000.speaker
0.0584225900682 12.399083739    speaker1        male    native
0.0584225900682 12.399083739    speaker1        male    native
0.0584225900682 12.399083739    speaker1        male    native
0.0584225900682 12.399083739    speaker1        male    native
0.0584225900682 12.399083739    speaker1        male    native
12.3271665044   21.6321665044   paula_zahn      female  native
12.3271665044   21.6321665044   paula_zahn      female  native
12.3271665044   21.6321665044   paula_zahn      female  native
12.3271665044   21.6321665044   paula_zahn      female  native
12.3271665044   27.7053583252   paula_zahn      female  native

There is one .speaker line for each tree in the document, so above is the speaker metadata for the first 10 trees in cnn_0000. The columns are start_time, stop_time, name, gender, and competency. These values are available in attributes of speaker_sentence with those names.

You might notice that the start and stop times don’t make sense. How can speaker1 say five things where each begins at time 0.05 and ends at time 12.4? When speakers said a group of parsable statements in quick succession, start and stop times were usually only recorded for the group. I’m going to refer to these groups as ‘annotation groups’. An annotation group is roughly analogous to a sentence (by which I mean a single tree); it represents a sequence of words that the annotator doing the transcription grouped together.

Another place this is confusing is with paula_zahn’s final sentence. It has the same start time as her previous four sentences, but a different end time. This is because that tree contains words from two different annotation groups. When this happens, the .speaker line will use the start_time of the initial group and the end_time of the final group. When this happens with the other columns (speakers completing each other’s sentences) we list all values separated by commas, but this is rare. One example would be tree 41 in english bc msnbc document 0006 where George Bush completes one of Andrea Mitchel’s sentences. With ‘CODE’ statements added to make it clear where the breaks between speakers go, the tree looks like:

( (NP (CODE <176.038501264:182.072501264:Andrea_Mitchel:42>)
      (NP (NNS Insights) (CC and) (NN analysis))
      (PP (IN from)
          (NP (NP (NP (NNP Bill) (NNP Bennett))
                  (NP (NP (NN radio) (NN host))
                      (CC and)
                      (NP (NP (NN author))
                          (PP (IN of)
                              (NP-TTL (NP (NNP America))
                                      (NP (DT The) (NNP Last) (NNP Best) (NNP Hope)))))))
              (CODE <182.072501264:185.713501264:Andrea_Mitchel:43>)
              (NP (NP (NNP John) (NNP Harwood))
                  (PP (IN of)
                      (NP (NP (DT The)
                              (NML (NNP Wall) (NNP Street))
                              (NNP Journal))
                          (CC and)
                          (NP (NNP CNBC)))))
              (CODE <185.713501264:188.098501264:Andrea_Mitchel:44>)
              (NP (NP (NNP Dana) (NNP Priest))
                  (PP (IN of)
                      (NP (DT The) (NNP Washington) (NNP Post))))
              (CODE <188.098501264:190.355501264:George_W_Bush:45>)
              (CC And)
              (NP (NP (NNP William) (NNP Safire))
                  (PP (IN of)
                      (NP (DT The)
                          (NML (NNP New) (NNP York))
                          (NNP Times))))))
      (. /.)))

This gives a speaker file that looks like:

$ cat data/english/annotations/bc/msnbc/00/msnbc_0006.speaker
...
160.816917439   163.569917439   Andrea_Mitchel  female  native
163.569917439   173.243917439   George_W_Bush   male    native
173.243917439   176.038501264   Andrea_Mitchel  female  native
176.038501264   190.355501264   Andrea_Mitchel,Andrea_Mitchel,Andrea_Mitchel,George_W_Bush      female,female,female,male       native
194.102780118   204.535780118   George_W_Bush   male    native
204.535780118   212.240780118   George_W_Bush   male    native
...

Note that the information about when in the statement George Bush took over for Andrea Mitchel is not retained.

This happens 14 times in the english bc data and not at all in the chinese.

Correspondences:

Database Tables Python Objects File Elements
None speaker_bank All .speaker files in an on.corpora.subcorpus
None speaker_document A .speaker file
speaker_sentence speaker_sentence A line in a .speaker file
class on.corpora.speaker.speaker_bank(a_subcorpus, tag, a_cursor=None, extension='speaker')

Contains: speaker_document

class on.corpora.speaker.speaker_document(document_id, extension='speaker')

Contained by: speaker_bank

Contains: speaker_document

class on.corpora.speaker.speaker_sentence(line_number, document_id, start_time, stop_time, name, gender, competence)

Contained by: speaker_document

Attributes:

start_time

What time this utterance or series of utterances began. If some speaker says three things in quick succession, we may have parsed these as three separate trees but timing information could have only been recorded for the three as a block.

stop_time

What time this utterance or series of utterances ended. The same caveat as with start_time applies.

name

The name of the speaker. This might be something like ‘speaker_1’ if the data was not entered.

gender

The gender of the speaker. Generally ‘male’ or ‘female’.

competence

The competency of the speaker in the language. Generally ‘native’.

parallel – Alignment Metadata for Parallel Texts

See:

Correspondences:

Database Tables Python Objects File Elements
None parallel_bank All .parallel files in an on.corpora.subcorpus
parallel_document parallel_document The second line (original/translation line) in a .parallel file
parallel_sentence parallel_sentence All lines in a .parallel file after the first two (map lines)
class on.corpora.parallel.parallel_bank(a_subcorpus, tag, a_cursor=None, extension='parallel')
class on.corpora.parallel.parallel_document(id_original, id_translation, extension='parallel')
class on.corpora.parallel.parallel_sentence(id_original, id_translation)