Towards Robust Linguistic Analysis using OntoNotes
This webpage is a supplement to the following paper. It is our intention to provide the data sets, system outputs and models for all the layers in OntoNotes across all languages so as to help future researchers perform consistent comparisons.
-
Towards Robust Linguistic Analysis using OntoNotes
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, Zhi Zhong
Proceedings of the Seventeenth Conference on Computational Natural Language Learning,
Sofia, Bulgaria, August 2013
The page is currently under construction and as of now it provides the English OntoNotes v5.0 data in the format used by the CoNLL 2011/2012 shared tasks and as used by the experiments in the above paper. Unlike, the CoNLL shared tasks that used a portion of the OntoNotes data, however, it includes all the files in the OntoNotes v5.0 release. Since not all files in the OntoNotes release have been annotated with all layers, the corresponding columns for those layers have been filled with a default value. These files are provided in three separate tarballs containing the train, development and test data partitions respectively using the scheme used used in the CoNLL-2011/2012 tasks as well as the above paper.
Training, Development and Test Partitions
This section provides links to the release v12 on GitHub along with the list of files in each of the Training, Development and Test partitions for each of the three languages English, Chinese and Arabic (Chinese and Arabic versions are forthcoming)
- Data
- v12 release
- This is a corrected version that addresses important fixes to the original release as identified by Oscar Täckström, Kuzman Ganchev and Dipanjan Das. This superseeds the version used in the above paper which is in the process of being updated in the ACL Anthology.
Steps for assembling the data
- Unpack the OntoNotes release -- LDC2013T19.tgz obtained from LDC
You will find that the data files are in the following location directory tree, first by language and then by genre.
$ tar zxvf LDC2013T19.tgz
$ tree -L 3 -d ontonotes-release-5.0/data/files/data
ontonotes-release-5.0/
??? data
??? files
??? data
??? arabic
? ??? annotations
? ??? nw
??? chinese
? ??? annotations
? ??? bc
? ??? bn
? ??? mz
? ??? nw
? ??? tc
? ??? wb
??? english
??? annotations
??? bc
??? bn
??? mz
??? nw
??? pt
??? tc
??? wb
- Create the CoNLL format files for the separate training, development and test tar.gz files above.
Once you untar the training, development and test archives, you will see that the files are in the following directory tree:
conll-formatted-ontonotes-5.0/
??? data
??? development
| ??? data
| ??? english
| ??? annotations
| ??? bc
| ??? bn
| ??? mz
| ??? nw
| ??? pt
| ??? tc
| ??? wb
??? test
? ??? data
? ??? english
? ??? annotations
? ??? bc
? ??? bn
? ??? mz
? ??? nw
? ??? pt
? ??? tc
? ??? wb
??? train
??? data
??? english
??? annotations
??? bc
??? bn
??? mz
??? nw
??? pt
??? tc
??? wb
The data directory in the tar.gz files from this webpage correspond to the second data directory (...files/data) in the OntoNotes release. Each leaf directory contains files of the form:
[source]_[four-digit-number].[extension]
with extension of the form:
[extension] := [version]_[quality]_[type]
[version] := v[number]
[quality] := gold|auto
[layer] := skel|conll
In the above tar balls, all files are of gold quality and will be converted to *conll after assembly.
- Download and run the scripts
Download the scripts from the following location
Scripts:
Following is the list of all scripts:
conll-formatted-ontonotes-5.0
??? scripts
??? skeleton2conll.py
??? skeleton2conll.sh
You can now generate the *_conll files from each corresponding *_skel file. The *_skel file is very similar to the *_conll file it contains information on all the layers of annotation except the underlying words. Owing to copyright restrictions on the underlying text, we have to do this workaround. The skeleton2conll.sh shell script is a wrapper for the skeleton2conll.py script that takes a *_skel file as input and generates the corresponding *_conll file. The script to get the words back from the trees is non-trivial for the some genre as we have eliminated disfluencies marked by phrases type EDITED in the Treebank. The usage for this script described with an example below:
Usage:
skeleton2conll.sh -D /path/to/ontonotes-v5.0-release/data/files/data]
[path/to/conll-formatted-ontonotes-5.0]
Description:
[path/to/ontonotes-v5.0-release/data/files/data]:
Location of the "data" directory under the ontonotes-v5.0-release obtained
from uncompressing the release from LDC. [path/to/conll-formatted-ontonotes-5.0]:
The top-level directory of the package downloaded from this webpage
inside which the *_skel files exist that need to be convered to *_conll files.
Example:
The following will create *_conll files for all the *_skel files in the
conll-formatted-ontonotes-5.0/ directory.
skeleton2conll.sh -D /nfs/.../ontonotes-release-5.0/data/files/data
/nfs/.../conll-formatted-ontonotes-5.0/
*_conll File Format
The *_conll files contain data in a tabular structure similar to that used by previous CoNLL shared tasks. We are using a [tag] -based extension naming approch where a [tag] is applied to the .conll file to name it, say .[tag]_conll . The [tag] itself can have multiple components and serves to highlight the characteristics of that .conll file. For example, the two tags that we use in the data are "v0_gold" and "v0_auto". Each of it has two (parts separated by underscores). The first one has the same value "v0" in both cases and indicates the version of the file. The second has two values "gold" and "auto". The "gold" indicates that the annotation is that file is hand-annotated and adjudicated quality, whereas the second means it was produced using a combination of automatic tools. The contents of each of these files comprises of a set of columns. Each column either representing a linear annotation on a sentence, for example, a part of speech annotation which is one part of speech per word, and so one column per layer (in this case part of speech), or there are multiple columns taken in sync with another column and representing the part that all other words in the sentence play with respect to that word. This is the classic case of predicate argument structure as introduced in the CoNLL-2005 shared task. In this case the number of columns that represent that layer of annotation is variable one per each predicate. For convenience, we have kept the coreference layer information in the very last column and the predicate argument structure information in a variable number of columns preceeding that.
The columns in the *_conll file represent the following:
Column | Type | Description |
1 | Document ID | This is a variation on the document filename |
2 | Part number | Some files are divided into multiple parts numbered as 000, 001, 002, ... etc. |
3 | Word number | This is the word index of the word in that sentence. |
4 | Word itself | This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release. |
5 | Part-of-Speech | This is the Penn Treebank style part of speech. When parse information is missing, all part of speeches except the one for which there is some sense or proposition annotation are marked with a XX tag. The verb is marked with just a VERB tag. |
6 | Parse bit | This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column. When the parse information is missing, the first word of a sentence is tagged as "(TOP*" and the last word is tagged as "*)" and all intermediate words are tagged with a "*". |
7 | Predicate lemma | The predicate lemma is mentioned for the rows for which we have semantic role information or word sense information. All other rows are marked with a "-". |
8 | Predicate Frameset ID | This is the PropBank frameset ID of the predicate in Column 7. |
9 | Word sense | This is the word sense of the word in Column 3. |
10 | Speaker/Author | This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. When not available the rows are marked with an "-". |
11 | Named Entities | These columns identifies the spans representing various named entities. For documents which do not have named entity annotation, each line is represented with an "*". |
12:N | Predicate Arguments | There is one column each of predicate argument structure information for the predicate mentioned in Column 7. If there are no predicates tagged in a sentence this is a single column with all rows marked with an "*". |
N | Coreference | Coreference chain information encoded in a parenthesis structure. For documents that do not have coreference annotations, each line is represented with a "-". |
|