src.data_sources module¶

Implementation classes for model data query/fetch functionality, selected by the user via --data_manager; see Model data sources and Data layer: Overview.

class src.data_sources.SampleDataFile(first_arg=None, *args, **kwargs)[source]¶

Bases: object

Dataclass describing catalog entries for sample model data files.

sample_dataset: str = sentinel.Mandatory¶

frequency: src.util.datelabel.DateFrequency = sentinel.Mandatory¶

variable: str = sentinel.Mandatory¶

remote_path: str = sentinel.Mandatory¶

__init__(sample_dataset: str = sentinel.Mandatory, frequency: src.util.datelabel.DateFrequency = sentinel.Mandatory, variable: str = sentinel.Mandatory, remote_path: str = sentinel.Mandatory) → None ¶: Initialize self. See help(type(self)) for accurate signature.

__post_init__(*args, **kwargs)¶

classmethod from_string(str_, *args)¶: Create an object instance from a string representation str_. Used by regex_dataclass() for parsing field values and automatic type coercion.

class src.data_sources.SampleDataAttributes(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = 'CMIP', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, sample_dataset: str = '')[source]¶

Bases: src.data_manager.DataSourceAttributesBase

Data-source-specific attributes for the DataSource providing sample model data.

convention: str = 'CMIP'¶

sample_dataset: str = ''¶

__post_init__(log=<Logger>)[source]¶: Validate user input.

CASENAME = sentinel.Mandatory¶

CASE_ROOT_DIR = ''¶

FIRSTYR = sentinel.Mandatory¶

LASTYR = sentinel.Mandatory¶

__init__(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = 'CMIP', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, sample_dataset: str = '') → None ¶: Initialize self. See help(type(self)) for accurate signature.

log = <Logger src.data_manager (WARNING)>¶

class src.data_sources.SampleLocalFileDataSource(*args, **kwargs)[source]¶

Bases: src.data_manager.SingleLocalFileDataSource

DataSource for handling POD sample model data stored on a local filesystem.

col_spec = DataframeQueryColumnSpec(expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, pod_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, var_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, remote_data_col=None, daterange_col=None)¶

property CATALOG_DIR¶: Placeholder class used in the definition of the abstract_attribute() decorator.

__post_init__()¶

property active¶

property all_columns¶

check_group_daterange(group_df, expt_key=None, log=<Logger>)¶: Sort the files found for each experiment by date, verify that the date ranges contained in the files are contiguous in time and that the date range of the files spans the query date range.

child_deactivation_handler(child, child_exc)¶: When a DataKey (child) has been deactivated during query or fetch, log a message on all VarlistEntries using it, and deactivate any VarlistEntries with no remaining viable DataKeys.

child_status_update(exc=None)¶

close_log_file(log=True)¶

data_key(value, expt_key=None, status=None)¶: Constructor for an instance of DataKeyBase that’s used by this DataSource.

deactivate(exc, level=None)¶

property df¶: Synonym for the DataFrame containing the catalog.

property failed¶

fetch_data()¶

fetch_dataset(var, d_key)¶: Fetches data corresponding to data_key. Populates its local_data attribute with a list of identifiers for successfully fetched data (paths to locally downloaded copies of data).

property full_name¶

generate_catalog()¶: Crawl the directory hierarchy via iter_files() and return the set of found files as rows in a Pandas DataFrame.

get_expt_key(scope, obj, parent_id=None)¶

Set experiment attributes with case, pod or variable scope. Given obj, construct a DataFrame of epxeriment attributes that are found in the queried data for all variables in obj.

If more than one choice of experiment is possible, call DataSource-specific heuristics in resolve_func to choose between them.

init_extra_log_handlers()¶

init_log(log_dir, fmt=None)¶

is_fetch_necessary(d_key, var=None)¶

iter_children(child_type=None, status=None, status_neq=None)¶

Generator iterating over child objects associated with this object.

Parameters

status – None or ObjectStatus, default None. If None, iterates over all child objects, regardless of status. If a ObjectStatus value is passed, only iterates over child objects with that status.
status_neq – None or ObjectStatus, default None. If set, iterates over child objects which don’t have the given status. If status is set, this setting is ignored.

iter_files()¶: Generator that yields instances of _FileRegexClass generated from relative paths of files in CATALOG_DIR. Only paths that match the regex in _FileRegexClass are returned.

iter_vars(active=None, pod_active=None)¶

Iterator over all VarlistEntrys (grandchildren) associated with this case. Returns PodVarTuples (namedtuples) of the Diagnostic and VarlistEntry objects corresponding to the POD and its variable, respectively.

Parameters

active –
bool or None, default None. Selects subset of VarlistEntrys which are returned in the namedtuples:
- active = True: only iterate over currently active VarlistEntries.
- active = False: only iterate over inactive VarlistEntries
  (VarlistEntries which have either failed or are currently unused alternate variables).
- active = None: iterate over both active and inactive
  VarlistEntries.
pod_active – bool or None, default None. Same as active, but filtering the PODs that are selected.

iter_vars_only(active=None)¶: Convenience wrapper for iter_vars() that returns only the VarlistEntry objects (grandchildren) from all PODs in this DataSource.

name: str = sentinel.Mandatory¶

post_fetch_hook(vars)¶: Called after fetching each batch of query results.

post_query_and_fetch_hook()¶: Called once, after the iterative request_data() process ends. Use to, eg, close database or remote filesystem connections.

post_query_hook(vars)¶: Called after select_experiment(), after each query of a new batch of variables.

pre_fetch_hook(vars)¶: Called before fetching each batch of query results.

pre_query_and_fetch_hook()¶: Called once, before the iterative request_data() process starts. Use to, eg, initialize database or remote filesystem connections.

pre_query_hook(vars)¶: Called before querying the presence of a new batch of variables.

preprocess_data()¶: Hook to run the preprocessing function on all variables.

query_and_fetch_cleanup(signum=None, frame=None)¶: Called if framework is terminated abnormally. Not called during normal exit.

query_data()¶

query_dataset(var)¶: Verify that only a single file was found from each experiment.

property remote_data_col¶: Name of the column in the catalog containing the path to the remote data file.

request_data()¶: Top-level method to iteratively query, fetch and preprocess all data requested by PODs, switching to alternate requested data as needed.

resolve_expt(expt_df, obj)¶: Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.

resolve_pod_expt(expt_df, obj)¶: Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.

resolve_var_expt(expt_df, obj)¶: Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.

select_data()¶

set_experiment()¶: Ensure that all data we’re about to fetch comes from the same experiment. If data from multiple experiments was returned by the query that just finished, either employ data source-specific heuristics to select one or return an error.

set_expt_key(obj, expt_key)¶

setup()¶

setup_fetch()¶: Called once, before the iterative request_data() process starts. Use to, eg, initialize database or remote filesystem connections.

setup_pod(pod)¶

Update POD with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)

Could arguably be moved into Diagnostic’s init, at the cost of dependency inversion.

setup_query()¶: Generate an intake_esm catalog of files found in CATALOG_DIR. Attributes of files listed in the catalog (columns of the DataFrame) are taken from the match groups (fields) of the class’s _FileRegexClass.

setup_var(pod, v)¶

Update VarlistEntry fields with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)

Could arguably be moved into VarlistEntry’s init, at the cost of dependency inversion.

status: src.core.ObjectStatus = 1¶

tear_down_fetch()¶: Called once, after the iterative request_data() process ends. Use to, eg, close database or remote filesystem connections.

tear_down_query()¶: Called once, after the iterative request_data() process ends. Use to, eg, close database or remote filesystem connections.

variable_dest_path(pod, var)¶: Returns the absolute path of the POD’s preprocessed, local copy of the file containing the requested dataset. Files not following this convention won’t be found by the POD.

class src.data_sources.MetadataRewriteParser(data_mgr, pod)[source]¶

Bases: src.xr_parser.DefaultDatasetParser

After loading and parsing the metadata on dataset ds but before applying the preprocessing functions, update attrs on ds with the new metadata values that were specified in ExplicitFileDataSource’s config file.

__init__(data_mgr, pod)[source]¶

Constructor.

Parameters

data_mgr – DataSource instance calling the preprocessor.
pod (Diagnostic) – POD whose variables are being preprocessed.

setup(data_mgr, pod)[source]¶

Make a lookup table to map VarlistEntry IDs to the set of metadata that we need to alter.

If user has provided the name of variable used by the data files (via the var_name attribute), set that as the translated variable name. Otherwise, variables are untranslated, and we use the herusitics in xr_parser.DefaultDatasetParser.guess_dependent_var() to determine the name.

check_calendar(ds)¶

Checks the ‘calendar’ attribute has been set correctly for time-dependent data (assumes CF conventions).

Sets the “calendar” attr on the time coordinate, if it exists, in order to be read by the calendar property defined in the cf_xarray accessor.

check_ds_attrs(var, ds)¶

Final checking of xarray Dataset attribute dicts before starting functions in src.preprocessor.

Only checks attributes on the dependent variable var and its coordinates: any other netCDF variables in the file are ignored.

check_metadata(ds_var, *attr_names)¶: Wrapper for normalize_attr(), specialized to the case of getting a variable’s standard_name.

compare_attr(our_attr_tuple, ds_attr_tuple, comparison_func=None, fill_ours=True, fill_ds=False, overwrite_ours=None)¶

Worker function to compare two attributes (on our_var, the framework’s record, and on ds, the “ground truth” of the dataset) and update one in the event of disagreement.

This handles the special cases where the attribute isn’t defined on our_var or ds.

Parameters

our_attr_tuple – tuple specifying the attribute on our_var
ds_attr_tuple – tuple specifying the same attribute on ds
comparison_func – function of two arguments to use to compare the attributes; defaults to __eq__.
fill_ours (bool) – If the attr on our_var is missing, fill it in with the value from ds.
fill_ds (bool) – If the attr on ds is missing, fill it in with the value from our_var.
overwrite_ours (bool) –
Action to take if both attrs are defined but have different values:
- None (default): Update our_var if fill_ours is True,
  but in any case raise a MetadataEvent.
- True: Change our_var to match ds.
- False: Change ds to match our_var.

static get_unmapped_names(ds)¶

Get a dict whose keys are variable or attribute names referred to by variables in the Dataset ds, but not present in the dataset itself.

Returns: Values of the dict are sets of names of variables in the dataset that referred to the missing name (keys).
Return type: (dict)

guess_attr(attr_desc, attr_name, options, default=None, comparison_func=None)¶

Select and return element of options equal to attr_name. If none are equal, try a case-insensititve string match.

Parameters

attr_desc (str) – Description of the attribute (only used for log messages.)
attr_name (str) – Expected name of the attribute.
options (iterable of str) – Attribute names that are present in the data.
default (str, default None) – If supplied, default value to return if no match.
comparison_func (optional, default None) – String comparison function to use.

Raises

KeyError – if no element of options can be coerced to match key_name.

Returns

Element of options matching attr_name.

normalize_attr(new_attr_d, d, key_name, key_startswith=None)¶

Sets the value in dict d corresponding to the key key_name.

If key_name is in d, no changes are made. If key_name is not in d, we check possible nonstandard representations of the key (case-insensitive match via guess_attr() and whether the key starts with the string key_startswith.) If no match is found for key_name, its value is set to the sentinel value ATTR_NOT_FOUND.

Parameters

new_attr_d (dict) – dict to store all found attributes. We don’t change attributes on d here, since that can interfere with xarray.decode_cf(), but instead modify this dict in place and pass it to restore_attrs() so they can be set once that’s done.
d (dict) – dict of Dataset attributes, whose keys are to be searched for key_name.
key_name (str) – Expected name of the key.
key_startswith (optional, str) – If provided and if key_name isn’t found in d, a key starting with this string will be accepted instead.

normalize_calendar(attr_d)¶: Finds the calendar attribute, if present, and normalizes it to one of the values in the CF standard before xarray.decode_cf() decodes the time axis.

normalize_dependent_var(var, ds)¶: Use heuristics to determine the name of the dependent variable from among all the variables in the Dataset ds, if the name doesn’t match the value we expect in our_var.

normalize_metadata(var, ds)¶: Normalize name, standard_name and units attributes after decode_cf and cf_xarray setup steps and metadata dict has been restored, since those methods don’t touch these metadata attributes.

normalize_pre_decode(ds)¶: Initial munging of xarray Dataset attribute dicts, before any parsing by xarray.decode_cf() or the cf_xarray accessor.

normalize_standard_name(new_attr_d, attr_d)¶: Method for munging standard_name attribute prior to parsing.

normalize_unit(new_attr_d, attr_d)¶: Hook to convert unit strings to values that are correctly parsed by cfunits/UDUnits2. Currently we handle the case where “mb” is interpreted as “millibarn”, a unit of area (see UDUnits mailing list.) New cases of incorrectly parsed unit strings can be added here as they are discovered.

parse(var, ds)¶

Calls the above metadata parsing functions in the intended order; intended to be called immediately after the Dataset ds is opened.

Note

decode_cf=False should be passed to the xarray open_dataset method, since that parsing is done here instead.

Calls normalize_pre_decode() to do basic cleaning of metadata attributes.
Call xarray’s decode_cf, using cftime to decode CF-compliant date/time axes.
Assign axis labels to dimension coordinates using cf_xarray.
Verify that calendar is set correctly (check_calendar()).
Reconcile metadata in var and ds (reconcile_* methods).
Verify that the name, standard_name and units for the variable and its
coordinates are set correctly (check_* methods).

Parameters

var (VarlistEntry) – VerlistEntry describing metadata we expect to find in ds.
ds (Dataset) – xarray Dataset of locally downloaded model data.

Returns

ds, with data unchanged but metadata normalized to expected values. Except in specific cases, attributes of var are updated to reflect the ‘ground truth’ of data in ds.

reconcile_attr(our_var, ds_var, our_attr_name, ds_attr_name=None, **kwargs)¶: Compare attribute of a DMVariable (our_var) with what’s set in the xarray.Dataset (ds_var).

reconcile_coord_bounds(our_coord, ds, ds_coord_name)¶: Reconcile standard_name and units attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var), for the bounds on the dimension coordinate our_coord.

reconcile_dimension_coords(our_var, ds)¶

Reconcile name, standard_name and units attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var), for all dimension coordinates used by our_var.

Parameters

our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.
ds – xarray Dataset.

reconcile_name(our_var, ds_var_name, overwrite_ours=None)¶: Reconcile the name of the variable between the ‘ground truth’ of the dataset we downloaded (ds_var) and our expectations based on the model’s convention (our_var).

reconcile_names(our_var, ds, ds_var_name, overwrite_ours=None)¶

Reconcile the name and standard_name attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var).

Parameters

our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.
ds – xarray Dataset.
ds_var_name (str) – Name of the variable in ds we expect to correspond to our_var.
overwrite_ours (bool, default False) – If True, always update the name of our_var to what’s found in ds.

reconcile_scalar_coords(our_var, ds)¶

Reconcile name, standard_name and units attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var), for all scalar coordinates used by our_var.

Parameters

our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.
ds – xarray Dataset.

reconcile_scalar_value_and_units(our_var, ds_var)¶: Compare scalar coordinate value of a DMVariable (our_var) with what’s set in the xarray.Dataset (ds_var). If there’s a discrepancy, log an error but change the entry in our_var.

reconcile_time_units(our_var, ds_var)¶

Special case of reconcile_units() for the time variable. In normal operation we don’t know (or need to know) the calendar or reference date (for time units of the form ‘days since 1970-01-01’), so it’s OK to set these from the dataset.

Parameters

our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.
ds_var – xarray DataArray.

reconcile_units(our_var, ds_var)¶

Reconcile the units attribute between the ‘ground truth’ of the dataset we downloaded (ds_var) and our expectations based on the model’s convention (our_var).

Parameters

our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.
ds_var – xarray DataArray.

reconcile_variable(var, ds)¶: Top-level method for the MDTF-specific dataset validation: attempts to reconcile name, standard_name and units attributes for the variable and coordinates in translated_var (our expectation, based on the DataSource’s naming convention) with attributes actually present in the Dataset ds.

restore_attrs_backup(ds)¶: xarray.decode_cf() and other functions appear to un-set some of the attributes defined in the netCDF file. Restore them from the backups made in munge_ds_attrs(), but only if the attribute was deleted.

class src.data_sources.MetadataRewritePreprocessor(*args, **kwargs)[source]¶

Bases: src.preprocessor.DaskMultiFilePreprocessor

Subclass DaskMultiFilePreprocessor in order to look up and apply edits to metadata that are stored in ExplicitFileDataSourceConfigEntry objects in the config_by_id attribute of ExplicitFileDataSource.

__init__(data_mgr, pod)¶: Initialize self. See help(type(self)) for accurate signature.

clean_nc_var_encoding(var, name, ds_obj)¶

Clean up the attrs and encoding dicts of obj prior to writing to a netCDF file, as a workaround for the following known issues:

Missing attributes may be set to the sentinel value ATTR_NOT_FOUND by xr_parser.DefaultDatasetParser. Depending on context, this may not be an error, but attributes with this value need to be deleted before writing.
Delete the _FillValue attribute for all independent variables (coordinates and their bounds), which is specified in the CF conventions but isn’t the xarray default; see https://github.com/pydata/xarray/issues/1598.
‘NaN’ is not recognized as a valid _FillValue by NCL (see https://www.ncl.ucar.edu/Support/talk_archives/2012/1689.html), so unset the attribute for this case.
xarray to_netcdf() raises an error if attributes set on a variable have the same name as those used in its encoding, even if their values are the same. We delete these attributes prior to writing, after checking equality of values.

clean_output_attrs(var, ds)¶: Call clean_nc_var_encoding() on all sets of attributes in the Dataset ds.

edit_request(data_mgr, pod)¶: Edit POD’s data request, based on the child class’s functionality. If the child class has a function that can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

load_ds(var)¶: Top-level method to load dataset and parse metadata; spun out so that child classes can modify it. Calls child class read_dataset().

log_history_attr(var, ds)¶: Update history attribute on xarray Dataset ds with log records of any metadata modifications logged to var’s _nc_history_log log handler. Out of simplicity, events are written in chronological rather than reverse chronological order.

property open_dataset_kwargs¶: Arguments passed to xarray open_dataset() and open_mfdataset().

process(var)¶: Top-level wrapper for doing all preprocessing of data files.

process_ds(var, ds)¶: Top-level method to apply selected functions to dataset; spun out so that child classes can modify it.

read_dataset(var)¶: Open multi-file Dataset specified by the local_data attribute of var, wrapping xarray open_mfdataset().

read_one_file(var, path_list)¶

property save_dataset_kwargs¶: Arguments passed to xarray to_netcdf().

setup(data_mgr, pod)¶: Method to do additional configuration immediately before process() is called on each variable for pod.

write_dataset(var, ds)¶: Writes processed Dataset ds to location specified by dest_path attribute of var, using xarray to_netcdf()

write_ds(var, ds)¶: Top-level method to write out processed dataset; spun out so that child classes can modify it. Calls child class write_dataset().

class src.data_sources.GlobbedDataFile(first_arg=None, *args, **kwargs)[source]¶

Bases: object

Applies a trivial regex to the paths returned by the glob.

dummy_group: str = sentinel.Mandatory¶

remote_path: str = sentinel.Mandatory¶

__init__(dummy_group: str = sentinel.Mandatory, remote_path: str = sentinel.Mandatory) → None ¶: Initialize self. See help(type(self)) for accurate signature.

__post_init__(*args, **kwargs)¶

classmethod from_string(str_, *args)¶: Create an object instance from a string representation str_. Used by regex_dataclass() for parsing field values and automatic type coercion.

class src.data_sources.ExplicitFileDataSourceConfigEntry(glob_id: src.util.basic.MDTF_ID = None, pod_name: str = sentinel.Mandatory, name: str = sentinel.Mandatory, glob: str = sentinel.Mandatory, var_name: str = '', metadata: dict = <factory>, _has_user_metadata: bool = None)[source]¶

Bases: object

glob_id: src.util.basic.MDTF_ID = None¶

pod_name: str = sentinel.Mandatory¶

name: str = sentinel.Mandatory¶

glob: str = sentinel.Mandatory¶

var_name: str = ''¶

metadata: dict¶

__post_init__()[source]¶

property full_name¶

classmethod from_struct(pod_name, var_name, v_data)[source]¶

to_file_glob_tuple()[source]¶

__init__(glob_id: src.util.basic.MDTF_ID = None, pod_name: str = sentinel.Mandatory, name: str = sentinel.Mandatory, glob: str = sentinel.Mandatory, var_name: str = '', metadata: dict = <factory>, _has_user_metadata: bool = None) → None ¶: Initialize self. See help(type(self)) for accurate signature.

class src.data_sources.ExplicitFileDataAttributes(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = '', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, config_file: str = None)[source]¶

Bases: src.data_manager.DataSourceAttributesBase

config_file: str = None¶

__post_init__(log=<Logger>)[source]¶: Validate user input.

CASENAME = sentinel.Mandatory¶

CASE_ROOT_DIR = ''¶

FIRSTYR = sentinel.Mandatory¶

LASTYR = sentinel.Mandatory¶

__init__(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = '', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, config_file: str = None) → None ¶: Initialize self. See help(type(self)) for accurate signature.

convention = ''¶

log = <Logger src.data_manager (WARNING)>¶

class src.data_sources.ExplicitFileDataSource(*args, **kwargs)[source]¶

Bases: src.data_manager.OnTheFlyGlobQueryMixin, src.data_manager.LocalFetchMixin, src.data_manager.DataframeQueryDataSourceBase

DataSource for dealing data in a regular directory hierarchy on a locally mounted filesystem. Assumes data for each variable may be split into several files according to date, with the dates present in their filenames.

col_spec = DataframeQueryColumnSpec(expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, pod_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, var_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, remote_data_col=None, daterange_col=None)¶

expt_key_cols = ()¶

expt_cols = ()¶

property CATALOG_DIR¶: Placeholder class used in the definition of the abstract_attribute() decorator.

parse_config(config_d)[source]¶: Parse contents of JSON config file into a list of :class`ExplicitFileDataSourceConfigEntry` objects.

iter_globs()[source]¶: Iterator returning FileGlobTuple instances. The generated catalog contains the union of the files found by each of the globs.

__post_init__()¶

property active¶

property all_columns¶

check_group_daterange(group_df, expt_key=None, log=<Logger>)¶: Sort the files found for each experiment by date, verify that the date ranges contained in the files are contiguous in time and that the date range of the files spans the query date range.

child_deactivation_handler(child, child_exc)¶: When a DataKey (child) has been deactivated during query or fetch, log a message on all VarlistEntries using it, and deactivate any VarlistEntries with no remaining viable DataKeys.

child_status_update(exc=None)¶

close_log_file(log=True)¶

data_key(value, expt_key=None, status=None)¶: Constructor for an instance of DataKeyBase that’s used by this DataSource.

deactivate(exc, level=None)¶

property df¶: Synonym for the DataFrame containing the catalog.

property failed¶

fetch_data()¶

fetch_dataset(var, d_key)¶: Fetches data corresponding to data_key. Populates its local_data attribute with a list of identifiers for successfully fetched data (paths to locally downloaded copies of data).

property full_name¶

generate_catalog()¶: Build the catalog from the files returned from the set of globs provided by rel_path_globs().

get_expt_key(scope, obj, parent_id=None)¶

Set experiment attributes with case, pod or variable scope. Given obj, construct a DataFrame of epxeriment attributes that are found in the queried data for all variables in obj.

If more than one choice of experiment is possible, call DataSource-specific heuristics in resolve_func to choose between them.

init_extra_log_handlers()¶

init_log(log_dir, fmt=None)¶

is_fetch_necessary(d_key, var=None)¶

iter_children(child_type=None, status=None, status_neq=None)¶

Generator iterating over child objects associated with this object.

Parameters

status – None or ObjectStatus, default None. If None, iterates over all child objects, regardless of status. If a ObjectStatus value is passed, only iterates over child objects with that status.
status_neq – None or ObjectStatus, default None. If set, iterates over child objects which don’t have the given status. If status is set, this setting is ignored.

iter_files(path_glob)¶: Generator that yields instances of _FileRegexClass generated from relative paths of files in CATALOG_DIR. Only paths that match the regex in _FileRegexClass are returned.

iter_vars(active=None, pod_active=None)¶

Iterator over all VarlistEntrys (grandchildren) associated with this case. Returns PodVarTuples (namedtuples) of the Diagnostic and VarlistEntry objects corresponding to the POD and its variable, respectively.

Parameters

active –
bool or None, default None. Selects subset of VarlistEntrys which are returned in the namedtuples:
- active = True: only iterate over currently active VarlistEntries.
- active = False: only iterate over inactive VarlistEntries
  (VarlistEntries which have either failed or are currently unused alternate variables).
- active = None: iterate over both active and inactive
  VarlistEntries.
pod_active – bool or None, default None. Same as active, but filtering the PODs that are selected.

iter_vars_only(active=None)¶: Convenience wrapper for iter_vars() that returns only the VarlistEntry objects (grandchildren) from all PODs in this DataSource.

name: str = sentinel.Mandatory¶

post_fetch_hook(vars)¶: Called after fetching each batch of query results.

post_query_and_fetch_hook()¶: Called once, after the iterative request_data() process ends. Use to, eg, close database or remote filesystem connections.

post_query_hook(vars)¶: Called after select_experiment(), after each query of a new batch of variables.

pre_fetch_hook(vars)¶: Called before fetching each batch of query results.

pre_query_and_fetch_hook()¶: Called once, before the iterative request_data() process starts. Use to, eg, initialize database or remote filesystem connections.

pre_query_hook(vars)¶: Called before querying the presence of a new batch of variables.

preprocess_data()¶: Hook to run the preprocessing function on all variables.

query_and_fetch_cleanup(signum=None, frame=None)¶: Called if framework is terminated abnormally. Not called during normal exit.

query_data()¶

query_dataset(var)¶: Find all rows of the catalog matching relevant attributes of the DataSource and of the variable (VarlistEntry). Group these by experiments, and for each experiment make the corresponding DataFrameDataKey and store it in var’s data attribute. Specifically, the data attribute is a dict mapping experiments (labeled by experiment_keys) to data found for that variable by this query (labeled by the DataKeys).

property remote_data_col¶: Name of the column in the catalog containing the path to the remote data file.

request_data()¶: Top-level method to iteratively query, fetch and preprocess all data requested by PODs, switching to alternate requested data as needed.

resolve_expt(expt_df, obj)¶: Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.

resolve_pod_expt(expt_df, obj)¶: Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.

resolve_var_expt(expt_df, obj)¶: Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.

select_data()¶

set_experiment()¶: Ensure that all data we’re about to fetch comes from the same experiment. If data from multiple experiments was returned by the query that just finished, either employ data source-specific heuristics to select one or return an error.

set_expt_key(obj, expt_key)¶

setup()¶

setup_fetch()¶: Called once, before the iterative request_data() process starts. Use to, eg, initialize database or remote filesystem connections.

setup_pod(pod)¶

Update POD with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)

Could arguably be moved into Diagnostic’s init, at the cost of dependency inversion.

setup_query()¶: Generate an intake_esm catalog of files found in CATALOG_DIR. Attributes of files listed in the catalog (columns of the DataFrame) are taken from the match groups (fields) of the class’s _FileRegexClass.

setup_var(pod, v)¶

Update VarlistEntry fields with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)

Could arguably be moved into VarlistEntry’s init, at the cost of dependency inversion.

status: src.core.ObjectStatus = 1¶

tear_down_fetch()¶: Called once, after the iterative request_data() process ends. Use to, eg, close database or remote filesystem connections.

tear_down_query()¶: Called once, after the iterative request_data() process ends. Use to, eg, close database or remote filesystem connections.

variable_dest_path(pod, var)¶: Returns the absolute path of the POD’s preprocessed, local copy of the file containing the requested dataset. Files not following this convention won’t be found by the POD.

class src.data_sources.CMIP6DataSourceAttributes(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = 'CMIP', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, activity_id: str = '', institution_id: str = '', source_id: str = '', experiment_id: str = '', variant_label: str = '', grid_label: str = '', version_date: str = '', model: dataclasses.InitVar = '', experiment: dataclasses.InitVar = '')[source]¶

Bases: src.data_manager.DataSourceAttributesBase

convention: str = 'CMIP'¶

activity_id: str = ''¶

institution_id: str = ''¶

source_id: str = ''¶

experiment_id: str = ''¶

variant_label: str = ''¶

grid_label: str = ''¶

version_date: str = ''¶

model: dataclasses.InitVar = ''¶

experiment: dataclasses.InitVar = ''¶

CATALOG_DIR: str¶

__post_init__(log=<Logger>, model=None, experiment=None)[source]¶

CASENAME = sentinel.Mandatory¶

CASE_ROOT_DIR = ''¶

FIRSTYR = sentinel.Mandatory¶

LASTYR = sentinel.Mandatory¶

__init__(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = 'CMIP', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, activity_id: str = '', institution_id: str = '', source_id: str = '', experiment_id: str = '', variant_label: str = '', grid_label: str = '', version_date: str = '', model: dataclasses.InitVar = '', experiment: dataclasses.InitVar = '') → None ¶: Initialize self. See help(type(self)) for accurate signature.

log = <Logger src.data_manager (WARNING)>¶

class src.data_sources.CMIP6ExperimentSelectionMixin[source]¶

Bases: object

Encapsulate attributes and logic used for CMIP6 experiment disambiguation so that it can be reused in DataSources with different parents (eg. different FetchMixins for different data fetch protocols.)

Assumes inheritance from DataframeQueryDataSourceBase – should enforce this.

property CATALOG_DIR¶

resolve_expt(df, obj)[source]¶

Disambiguate experiment attributes that must be the same for all variables in this case:

If variant_id (realization, forcing, etc.) not specified by user,
choose the lowest-numbered variant
If version_date not set by user, choose the most recent revision

resolve_pod_expt(df, obj)[source]¶

Disambiguate experiment attributes that must be the same for all variables for each POD:

Prefer regridded to native-grid data (questionable)
If multiple regriddings available, pick the lowest-numbered one

resolve_var_expt(df, obj)[source]¶

Disambiguate arbitrary experiment attributes on a per-variable basis:

If the same variable appears in multiple MIP tables, select the first
MIP table in alphabetical order.

__init__()¶: Initialize self. See help(type(self)) for accurate signature.

class src.data_sources.CMIP6LocalFileDataSource(*args, **kwargs)[source]¶

Bases: src.data_sources.CMIP6ExperimentSelectionMixin, src.data_manager.LocalFileDataSource

DataSource for handling model data named following the CMIP6 DRS and stored on a local filesystem.

col_spec = DataframeQueryColumnSpec(expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, pod_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, var_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, remote_data_col=None, daterange_col='date_range')¶

property CATALOG_DIR¶: Placeholder class used in the definition of the abstract_attribute() decorator.

__post_init__()¶

property active¶

property all_columns¶

check_group_daterange(group_df, expt_key=None, log=<Logger>)¶: Sort the files found for each experiment by date, verify that the date ranges contained in the files are contiguous in time and that the date range of the files spans the query date range.

child_deactivation_handler(child, child_exc)¶: When a DataKey (child) has been deactivated during query or fetch, log a message on all VarlistEntries using it, and deactivate any VarlistEntries with no remaining viable DataKeys.

child_status_update(exc=None)¶

close_log_file(log=True)¶

data_key(value, expt_key=None, status=None)¶: Constructor for an instance of DataKeyBase that’s used by this DataSource.

deactivate(exc, level=None)¶

property df¶: Synonym for the DataFrame containing the catalog.

property failed¶

fetch_data()¶

fetch_dataset(var, d_key)¶: Fetches data corresponding to data_key. Populates its local_data attribute with a list of identifiers for successfully fetched data (paths to locally downloaded copies of data).

property full_name¶

generate_catalog()¶: Crawl the directory hierarchy via iter_files() and return the set of found files as rows in a Pandas DataFrame.

get_expt_key(scope, obj, parent_id=None)¶

Set experiment attributes with case, pod or variable scope. Given obj, construct a DataFrame of epxeriment attributes that are found in the queried data for all variables in obj.

If more than one choice of experiment is possible, call DataSource-specific heuristics in resolve_func to choose between them.

init_extra_log_handlers()¶

init_log(log_dir, fmt=None)¶

is_fetch_necessary(d_key, var=None)¶

iter_children(child_type=None, status=None, status_neq=None)¶

Generator iterating over child objects associated with this object.

Parameters

status – None or ObjectStatus, default None. If None, iterates over all child objects, regardless of status. If a ObjectStatus value is passed, only iterates over child objects with that status.
status_neq – None or ObjectStatus, default None. If set, iterates over child objects which don’t have the given status. If status is set, this setting is ignored.

iter_files()¶: Generator that yields instances of _FileRegexClass generated from relative paths of files in CATALOG_DIR. Only paths that match the regex in _FileRegexClass are returned.

iter_vars(active=None, pod_active=None)¶

Iterator over all VarlistEntrys (grandchildren) associated with this case. Returns PodVarTuples (namedtuples) of the Diagnostic and VarlistEntry objects corresponding to the POD and its variable, respectively.

Parameters

active –
bool or None, default None. Selects subset of VarlistEntrys which are returned in the namedtuples:
- active = True: only iterate over currently active VarlistEntries.
- active = False: only iterate over inactive VarlistEntries
  (VarlistEntries which have either failed or are currently unused alternate variables).
- active = None: iterate over both active and inactive
  VarlistEntries.
pod_active – bool or None, default None. Same as active, but filtering the PODs that are selected.

iter_vars_only(active=None)¶: Convenience wrapper for iter_vars() that returns only the VarlistEntry objects (grandchildren) from all PODs in this DataSource.

name: str = sentinel.Mandatory¶

post_fetch_hook(vars)¶: Called after fetching each batch of query results.

post_query_and_fetch_hook()¶: Called once, after the iterative request_data() process ends. Use to, eg, close database or remote filesystem connections.

post_query_hook(vars)¶: Called after select_experiment(), after each query of a new batch of variables.

pre_fetch_hook(vars)¶: Called before fetching each batch of query results.

pre_query_and_fetch_hook()¶: Called once, before the iterative request_data() process starts. Use to, eg, initialize database or remote filesystem connections.

pre_query_hook(vars)¶: Called before querying the presence of a new batch of variables.

preprocess_data()¶: Hook to run the preprocessing function on all variables.

query_and_fetch_cleanup(signum=None, frame=None)¶: Called if framework is terminated abnormally. Not called during normal exit.

query_data()¶

query_dataset(var)¶: Find all rows of the catalog matching relevant attributes of the DataSource and of the variable (VarlistEntry). Group these by experiments, and for each experiment make the corresponding DataFrameDataKey and store it in var’s data attribute. Specifically, the data attribute is a dict mapping experiments (labeled by experiment_keys) to data found for that variable by this query (labeled by the DataKeys).

property remote_data_col¶: Name of the column in the catalog containing the path to the remote data file.

request_data()¶: Top-level method to iteratively query, fetch and preprocess all data requested by PODs, switching to alternate requested data as needed.

resolve_expt(df, obj)¶

Disambiguate experiment attributes that must be the same for all variables in this case:

If variant_id (realization, forcing, etc.) not specified by user,
choose the lowest-numbered variant
If version_date not set by user, choose the most recent revision

resolve_pod_expt(df, obj)¶

Disambiguate experiment attributes that must be the same for all variables for each POD:

Prefer regridded to native-grid data (questionable)
If multiple regriddings available, pick the lowest-numbered one

resolve_var_expt(df, obj)¶

Disambiguate arbitrary experiment attributes on a per-variable basis:

If the same variable appears in multiple MIP tables, select the first
MIP table in alphabetical order.

select_data()¶

set_experiment()¶: Ensure that all data we’re about to fetch comes from the same experiment. If data from multiple experiments was returned by the query that just finished, either employ data source-specific heuristics to select one or return an error.

set_expt_key(obj, expt_key)¶

setup()¶

setup_fetch()¶: Called once, before the iterative request_data() process starts. Use to, eg, initialize database or remote filesystem connections.

setup_pod(pod)¶

Update POD with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)

Could arguably be moved into Diagnostic’s init, at the cost of dependency inversion.

setup_query()¶: Generate an intake_esm catalog of files found in CATALOG_DIR. Attributes of files listed in the catalog (columns of the DataFrame) are taken from the match groups (fields) of the class’s _FileRegexClass.

setup_var(pod, v)¶

Update VarlistEntry fields with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)

Could arguably be moved into VarlistEntry’s init, at the cost of dependency inversion.

status: src.core.ObjectStatus = 1¶

tear_down_fetch()¶: Called once, after the iterative request_data() process ends. Use to, eg, close database or remote filesystem connections.

tear_down_query()¶: Called once, after the iterative request_data() process ends. Use to, eg, close database or remote filesystem connections.

variable_dest_path(pod, var)¶: Returns the absolute path of the POD’s preprocessed, local copy of the file containing the requested dataset. Files not following this convention won’t be found by the POD.

src.data_sources module¶

MDTF Diagnostics

Navigation

Related Topics