See:
- Dealing with config file, command line options, etc:
- Buckwalter Arabic encoding:
- buckwalter2unicode()
- unicode2buckwalter()
- devocalize_buckwalter()
- DB:
- SGML (.name and .coref files):
- File System:
- Other:
Functions:
- on.common.util.buckwalter2unicode(b_word, sgml_safety=True)¶
Given a string in Buckwalter ASCII encoded Arabic, return the Unicode version.
- on.common.util.unicode2buckwalter(u_word, sgml_safe=False, devocalize=False)¶
Given a Unicode word, return the Buckwalter ASCII encoded version.
If sgml_safe is set, run the output through make_sgml_safe() before returning.
If devocalize is set delete a,u,i,o before returning.
- on.common.util.register_config(section, value, allowed_values=[], doc=None, required=False, section_required=False, allow_multiple=False)¶
make decorator so funcs can specify which config options they take.
usage is:
@register_config('corpus', 'load', 'specify which data to load to the db in the format lang-genre-source') def load_banks(config): ...The special value ‘__dynamic’ means that some config values are created dynamically and we can’t verify if a config argument is correct simply by seeing if it’s on the list. Documentation is also generated to this effect.
If allowed_values is non-empty, then check to see that the setting the user chose is on the list.
If allow_multiple is True, then when checking whether only allowed values are being given the key is first split on whitespace and then each component is tested.
If required is True, then if the section exists it must specify this value. If the section does not exist, it is free to ignore this value. See section_required .
If section_required is True, then issue an error if section is not defined by the user. Often wanted in combination with required .
- on.common.util.insert_ignoring_dups(inserter, a_cursor, *values)¶
insert values to db ignoring duplicates
The caller can be a string, another class instance or a class:
string : take to be an sql insert statement class : use it’s sql_insert_statement field, then proceed as with string instance: get it’s __class__ and proceed as with classSo any of the following are good:
insert_ignoring_dups(self, a_cursor, id, tag) insert_ignoring_dups(cls, a_cursor, id, tag) insert_ignoring_dups(self.__class__.weirdly_named_sql_insert_statement, a_cursor, id, tag)
- on.common.util.matches_an_affix(s, affixes)¶
Does the given id match the affixes?
Affixes = prefixes, suffixes
Given either a four digit string or a document id, return whether at least one of the prefixes and at least one of the suffixes matches it
- on.common.util.output_file_name(doc_id, doc_type, out_dir='')¶
Determine what file to write an X_document to
doc_id: a document id doc_type: the type of the document, like a suffix (parse, prop, name, ...) out_dir: if set, make the output as a child of out_dir
- on.common.util.get_lemma(a_leaf, verb2morph, noun2morph, fail_on_not_found=False)¶
return the lemma for a_leaf’s word
if we have appropriate word2morph hashes, look the work up there. Otherwise just return the word. Functionally, for chinese we use the word itself and for english we have the hashes. When we get to doing arabic we’ll need to add a case.
if fail_on_not_found is set, return “” instead of a_leaf.word if we don’t have a mapping for this lemma.
- on.common.util.load_config(cfg_name=None, config_append=[])¶
Load a configuration file to memory.
The given configuration file name can be a full path, in which case we simply read that configuration file. Otherwise, if you give ‘myconfig’ or something similar, we look in the current directory and the home directory. We also look to see if files with this name and extension ‘.conf’ exist. So for ‘myconfig’ we would look in the following places:
- ./myconfig
- ./myconfig.conf
- [home]/.myconfig
- [home]/.myconfig.conf
Once we find the configuration, we load it. We also extend ConfigParser to support [] notation. So you could look up key k in section s with config[s,k]. See FancyConfigParser() .
If config_append is set we use parse_cfg_args() and add any values it creates to the config object. These values override any previous ones.
- on.common.util.mkdirs(long_path)¶
Make the given path exist. If the path already exists, raise an exception.
- on.common.util.load_options(parser=None, argv=[], positional_args=True)¶
parses sys.argv, possibly exiting if there are mistakes
If you set parser to a ConfigParser object, then you have control over the usage string and you can prepopulate it with options you intend to use. But don’t set a --config / -c option; load_options uses that to find a configuration file to load
If a parser was passed in, we return (config, parser, [args]). Otherwise we return (config, [args]). Args is only included if positional_args is True and there are positional arguments
See load_config() for details on the --config option.
- on.common.util.parse_cfg_args(arg_list)¶
Parse command-line style config settings to a dictionary.
If you want to override configuration file values on the command line or set ones that were not set, this should make it simpler. Given a list in format [section.key=value, ...] return a dictionary in form { (section, key): value, ...}.
So we might have:
['corpus.load=english-mz', 'corpus.data_in=/home/user/corpora/ontonotes/data/']we would then return the dictionary:
{ ('corpus', 'load') : 'english-mz', ('corpus', 'data_in') : '/home/user/corpora/ontonotes/data/' }See also load_config() and load_options()
- on.common.util.listdir(dirname)¶
List a dir’s child dirs, sorted and without hidden files.
Basically os.listdir(), sorted and without hidden (in the Unix sense: starting with a ‘.’) files.
- on.common.util.listdir_full(dirname)¶
A full path to file version of on.common.util.listdir().
- on.common.util.listdir_both(dirname)¶
return a list of short_path, full_path tuples
identical to zip(listdir(dirname), listdir_full(dirname))
- exception on.common.util.NotInConfigError¶
Because people might want to use a dictionary in place of a ConfigParser object, use a NotInConfigError as the error to catch for config[section, value] call. For example:
try: load_data(config['Data', 'data_location']) except on.common.util.NotInConfigError: print 'Loading data failed. Sorry.'
- class on.common.util.bunch(**kwargs)¶
a simple class for short term holding related variables
change code like:
def foo_some(a_ontonotes, b_ontonotes): a_sense_bank = ... a_ontonotes.foo(a_sense_bank) a_... a_... b_sense_bank = ... b_ontonotes.foo(b_sense_bank) b_... b_... big_func(a_bar, b_bar)To:
def foo_some(): a = bunch(ontonotes=a_ontonotes) b = bunch(ontonotes=b_ontonotes) for v in [a,b]: v.sense_bank = ... v.ontonotes.foo(v.sense_bank) v. ... v. ... big_func(a.bar, b.bar)Or:
def foo_some(): def foo_one(v): v.sense_bank = ... v.ontonotes.foo(v.sense_bank) v. ... v. ... return v big_func(foo_one(bunch(ontonotes=a_ontonotes)).bar, foo_one(bunch(ontonotes=b_ontonotes)).bar)Basically it lets you group similar things. It’s adding hierarchy to the local variables. It’s a hash table with more convenient syntax.
- on.common.util.is_db_ref(a_hash)¶
Is this hash a reference to the database?
If a hash (sense inventories, frames, etc) is equal to {'DB' : a_cursor} that means instead of using the hash as information we should go look for our information in the database instead.
- on.common.util.make_db_ref(a_cursor)¶
Create a hash substitute that means ‘go look in the db instead’.
See is_db_ref()
- on.common.util.is_not_loaded(a_hash)¶
Do we have no intention of loading the data a_hash is supposed to contain?
If a hash has a single key ‘NotLoaded’ that means we don’t intend to load that hash and we shouldn’t complain about data inconsistency involving the hash. So if we’re loading senses and the sense_inventory_hash is_not_loaded() then we shouldn’t drop senses for being references against lemmas that don’t exist.
- on.common.util.make_not_loaded()¶
Create a hash substitute that means ‘act as if you had this information’
See is_not_loaded()
- on.common.util.esc(*varargs)¶
given a number of arguments, return escaped (for mysql) versions of each of them
- on.common.util.make_sgml_safe(s, reverse=False, keep_turn=True)¶
return a version of the string that can be put in an sgml document
This means changing angle brackets and ampersands to ‘-LAB-‘, ‘-RAB-‘, and ‘-AMP-‘. Needed for creating .name and .coref files.
If keep_turn is set, <TURN> in the input is turned into [TURN], not turned into -LAB-TURN-RAB-
- on.common.util.make_sgml_unsafe(s)¶
return a version of the string that has real <, >, and &.
Convert the ‘escaped’ versions of dangerous characters back to their normal ascii form. Needed for reading .name and .coref files, as well as any other sgml files like the frames and the sense inventories and pools.
See make_sgml_safe()
- class on.common.util.FancyConfigParser(defaults=None)¶
make a config parser with support for config[section, value]
raises FancyConfigParserError on improper usage.
See:
Functions:
- on.common.log.error(error_string, terminate_program=True, current_frame=False)¶
Print error messages to stderr, optionally sys.exit.
- on.common.log.warning(warning_string, verbosity=0)¶
print warning string depending on the value of on.common.log.VERBOSITY
- on.common.log.info(text, newline=True)¶
write the text to standard error followed by a newline
- on.common.log.debug(debug_object, debug_flag, verbosity=0)¶
- on.common.log.status(*args)¶
write each argument to stderr, space separated, with a trailing newline