src.data_sources module¶
Implementation classes for model data query/fetch functionality, selected by
the user via --data_manager; see Model data sources and
Data layer: Overview.
-
class
src.data_sources.SampleDataFile(first_arg=None, *args, **kwargs)[source]¶ Bases:
objectDataclass describing catalog entries for sample model data files.
-
frequency: src.util.datelabel.DateFrequency = sentinel.Mandatory¶
-
__init__(sample_dataset: str = sentinel.Mandatory, frequency: src.util.datelabel.DateFrequency = sentinel.Mandatory, variable: str = sentinel.Mandatory, remote_path: str = sentinel.Mandatory) → None¶ Initialize self. See help(type(self)) for accurate signature.
-
__post_init__(*args, **kwargs)¶
-
classmethod
from_string(str_, *args)¶ Create an object instance from a string representation str_. Used by
regex_dataclass()for parsing field values and automatic type coercion.
-
-
class
src.data_sources.SampleDataAttributes(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = 'CMIP', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, sample_dataset: str = '')[source]¶ Bases:
src.data_manager.DataSourceAttributesBaseData-source-specific attributes for the DataSource providing sample model data.
-
CASENAME= sentinel.Mandatory¶
-
CASE_ROOT_DIR= ''¶
-
FIRSTYR= sentinel.Mandatory¶
-
LASTYR= sentinel.Mandatory¶
-
__init__(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = 'CMIP', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, sample_dataset: str = '') → None¶ Initialize self. See help(type(self)) for accurate signature.
-
log= <Logger src.data_manager (WARNING)>¶
-
-
class
src.data_sources.SampleLocalFileDataSource(*args, **kwargs)[source]¶ Bases:
src.data_manager.SingleLocalFileDataSourceDataSource for handling POD sample model data stored on a local filesystem.
-
col_spec= DataframeQueryColumnSpec(expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, pod_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, var_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, remote_data_col=None, daterange_col=None)¶
-
property
CATALOG_DIR¶ Placeholder class used in the definition of the
abstract_attribute()decorator.
-
__post_init__()¶
-
property
active¶
-
property
all_columns¶
-
check_group_daterange(group_df, expt_key=None, log=<Logger>)¶ Sort the files found for each experiment by date, verify that the date ranges contained in the files are contiguous in time and that the date range of the files spans the query date range.
-
child_deactivation_handler(child, child_exc)¶ When a DataKey (child) has been deactivated during query or fetch, log a message on all VarlistEntries using it, and deactivate any VarlistEntries with no remaining viable DataKeys.
-
child_status_update(exc=None)¶
-
close_log_file(log=True)¶
-
data_key(value, expt_key=None, status=None)¶ Constructor for an instance of
DataKeyBasethat’s used by this DataSource.
-
deactivate(exc, level=None)¶
-
property
df¶ Synonym for the DataFrame containing the catalog.
-
property
failed¶
-
fetch_data()¶
-
fetch_dataset(var, d_key)¶ Fetches data corresponding to data_key. Populates its local_data attribute with a list of identifiers for successfully fetched data (paths to locally downloaded copies of data).
-
property
full_name¶
-
generate_catalog()¶ Crawl the directory hierarchy via
iter_files()and return the set of found files as rows in a Pandas DataFrame.
-
get_expt_key(scope, obj, parent_id=None)¶ Set experiment attributes with case, pod or variable scope. Given obj, construct a DataFrame of epxeriment attributes that are found in the queried data for all variables in obj.
If more than one choice of experiment is possible, call DataSource-specific heuristics in resolve_func to choose between them.
-
init_extra_log_handlers()¶
-
init_log(log_dir, fmt=None)¶
-
is_fetch_necessary(d_key, var=None)¶
-
iter_children(child_type=None, status=None, status_neq=None)¶ Generator iterating over child objects associated with this object.
- Parameters
status – None or
ObjectStatus, default None. If None, iterates over all child objects, regardless of status. If aObjectStatusvalue is passed, only iterates over child objects with that status.status_neq – None or
ObjectStatus, default None. If set, iterates over child objects which don’t have the given status. If status is set, this setting is ignored.
-
iter_files()¶ Generator that yields instances of _FileRegexClass generated from relative paths of files in CATALOG_DIR. Only paths that match the regex in _FileRegexClass are returned.
-
iter_vars(active=None, pod_active=None)¶ Iterator over all
VarlistEntrys (grandchildren) associated with this case. ReturnsPodVarTuples (namedtuples) of theDiagnosticandVarlistEntryobjects corresponding to the POD and its variable, respectively.- Parameters
active –
bool or None, default None. Selects subset of
VarlistEntrys which are returned in the namedtuples:active = True: only iterate over currently active VarlistEntries.
- active = False: only iterate over inactive VarlistEntries
(VarlistEntries which have either failed or are currently unused alternate variables).
- active = None: iterate over both active and inactive
VarlistEntries.
pod_active – bool or None, default None. Same as active, but filtering the PODs that are selected.
-
iter_vars_only(active=None)¶ Convenience wrapper for
iter_vars()that returns only theVarlistEntryobjects (grandchildren) from all PODs in this DataSource.
-
post_fetch_hook(vars)¶ Called after fetching each batch of query results.
-
post_query_and_fetch_hook()¶ Called once, after the iterative
request_data()process ends. Use to, eg, close database or remote filesystem connections.
-
post_query_hook(vars)¶ Called after select_experiment(), after each query of a new batch of variables.
-
pre_fetch_hook(vars)¶ Called before fetching each batch of query results.
-
pre_query_and_fetch_hook()¶ Called once, before the iterative
request_data()process starts. Use to, eg, initialize database or remote filesystem connections.
-
pre_query_hook(vars)¶ Called before querying the presence of a new batch of variables.
-
preprocess_data()¶ Hook to run the preprocessing function on all variables.
-
query_and_fetch_cleanup(signum=None, frame=None)¶ Called if framework is terminated abnormally. Not called during normal exit.
-
query_data()¶
-
query_dataset(var)¶ Verify that only a single file was found from each experiment.
-
property
remote_data_col¶ Name of the column in the catalog containing the path to the remote data file.
-
request_data()¶ Top-level method to iteratively query, fetch and preprocess all data requested by PODs, switching to alternate requested data as needed.
-
resolve_expt(expt_df, obj)¶ Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.
-
resolve_pod_expt(expt_df, obj)¶ Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.
-
resolve_var_expt(expt_df, obj)¶ Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.
-
select_data()¶
-
set_experiment()¶ Ensure that all data we’re about to fetch comes from the same experiment. If data from multiple experiments was returned by the query that just finished, either employ data source-specific heuristics to select one or return an error.
-
set_expt_key(obj, expt_key)¶
-
setup()¶
-
setup_fetch()¶ Called once, before the iterative
request_data()process starts. Use to, eg, initialize database or remote filesystem connections.
-
setup_pod(pod)¶ Update POD with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)
Could arguably be moved into Diagnostic’s init, at the cost of dependency inversion.
-
setup_query()¶ Generate an intake_esm catalog of files found in CATALOG_DIR. Attributes of files listed in the catalog (columns of the DataFrame) are taken from the match groups (fields) of the class’s _FileRegexClass.
-
setup_var(pod, v)¶ Update VarlistEntry fields with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)
Could arguably be moved into VarlistEntry’s init, at the cost of dependency inversion.
-
status: src.core.ObjectStatus = 1¶
-
tear_down_fetch()¶ Called once, after the iterative
request_data()process ends. Use to, eg, close database or remote filesystem connections.
-
tear_down_query()¶ Called once, after the iterative
request_data()process ends. Use to, eg, close database or remote filesystem connections.
-
variable_dest_path(pod, var)¶ Returns the absolute path of the POD’s preprocessed, local copy of the file containing the requested dataset. Files not following this convention won’t be found by the POD.
-
-
class
src.data_sources.MetadataRewriteParser(data_mgr, pod)[source]¶ Bases:
src.xr_parser.DefaultDatasetParserAfter loading and parsing the metadata on dataset ds but before applying the preprocessing functions, update attrs on ds with the new metadata values that were specified in
ExplicitFileDataSource’s config file.-
__init__(data_mgr, pod)[source]¶ Constructor.
- Parameters
data_mgr – DataSource instance calling the preprocessor.
pod (
Diagnostic) – POD whose variables are being preprocessed.
-
setup(data_mgr, pod)[source]¶ Make a lookup table to map
VarlistEntryIDs to the set of metadata that we need to alter.If user has provided the name of variable used by the data files (via the
var_nameattribute), set that as the translated variable name. Otherwise, variables are untranslated, and we use the herusitics inxr_parser.DefaultDatasetParser.guess_dependent_var()to determine the name.
-
check_calendar(ds)¶ Checks the ‘calendar’ attribute has been set correctly for time-dependent data (assumes CF conventions).
Sets the “calendar” attr on the time coordinate, if it exists, in order to be read by the calendar property defined in the cf_xarray accessor.
-
check_ds_attrs(var, ds)¶ Final checking of xarray Dataset attribute dicts before starting functions in
src.preprocessor.Only checks attributes on the dependent variable var and its coordinates: any other netCDF variables in the file are ignored.
-
check_metadata(ds_var, *attr_names)¶ Wrapper for
normalize_attr(), specialized to the case of getting a variable’s standard_name.
-
compare_attr(our_attr_tuple, ds_attr_tuple, comparison_func=None, fill_ours=True, fill_ds=False, overwrite_ours=None)¶ Worker function to compare two attributes (on our_var, the framework’s record, and on ds, the “ground truth” of the dataset) and update one in the event of disagreement.
This handles the special cases where the attribute isn’t defined on our_var or ds.
- Parameters
our_attr_tuple – tuple specifying the attribute on our_var
ds_attr_tuple – tuple specifying the same attribute on ds
comparison_func – function of two arguments to use to compare the attributes; defaults to
__eq__.fill_ours (bool) – If the attr on our_var is missing, fill it in with the value from ds.
fill_ds (bool) – If the attr on ds is missing, fill it in with the value from our_var.
overwrite_ours (bool) –
Action to take if both attrs are defined but have different values:
- None (default): Update our_var if fill_ours is True,
but in any case raise a
MetadataEvent.
True: Change our_var to match ds.
False: Change ds to match our_var.
-
static
get_unmapped_names(ds)¶ Get a dict whose keys are variable or attribute names referred to by variables in the Dataset ds, but not present in the dataset itself.
- Returns
Values of the dict are sets of names of variables in the dataset that referred to the missing name (keys).
- Return type
(dict)
-
guess_attr(attr_desc, attr_name, options, default=None, comparison_func=None)¶ Select and return element of options equal to attr_name. If none are equal, try a case-insensititve string match.
- Parameters
attr_desc (str) – Description of the attribute (only used for log messages.)
attr_name (str) – Expected name of the attribute.
options (iterable of str) – Attribute names that are present in the data.
default (str, default None) – If supplied, default value to return if no match.
comparison_func (optional, default None) – String comparison function to use.
- Raises
KeyError – if no element of options can be coerced to match key_name.
- Returns
Element of options matching attr_name.
-
normalize_attr(new_attr_d, d, key_name, key_startswith=None)¶ Sets the value in dict d corresponding to the key key_name.
If key_name is in d, no changes are made. If key_name is not in d, we check possible nonstandard representations of the key (case-insensitive match via
guess_attr()and whether the key starts with the string key_startswith.) If no match is found for key_name, its value is set to the sentinel valueATTR_NOT_FOUND.- Parameters
new_attr_d (dict) – dict to store all found attributes. We don’t change attributes on d here, since that can interfere with xarray.decode_cf(), but instead modify this dict in place and pass it to
restore_attrs()so they can be set once that’s done.d (dict) – dict of Dataset attributes, whose keys are to be searched for key_name.
key_name (str) – Expected name of the key.
key_startswith (optional, str) – If provided and if key_name isn’t found in d, a key starting with this string will be accepted instead.
-
normalize_calendar(attr_d)¶ Finds the calendar attribute, if present, and normalizes it to one of the values in the CF standard before xarray.decode_cf() decodes the time axis.
-
normalize_dependent_var(var, ds)¶ Use heuristics to determine the name of the dependent variable from among all the variables in the Dataset ds, if the name doesn’t match the value we expect in our_var.
-
normalize_metadata(var, ds)¶ Normalize name, standard_name and units attributes after decode_cf and cf_xarray setup steps and metadata dict has been restored, since those methods don’t touch these metadata attributes.
-
normalize_pre_decode(ds)¶ Initial munging of xarray Dataset attribute dicts, before any parsing by xarray.decode_cf() or the cf_xarray accessor.
-
normalize_standard_name(new_attr_d, attr_d)¶ Method for munging standard_name attribute prior to parsing.
-
normalize_unit(new_attr_d, attr_d)¶ Hook to convert unit strings to values that are correctly parsed by cfunits/UDUnits2. Currently we handle the case where “mb” is interpreted as “millibarn”, a unit of area (see UDUnits mailing list.) New cases of incorrectly parsed unit strings can be added here as they are discovered.
-
parse(var, ds)¶ Calls the above metadata parsing functions in the intended order; intended to be called immediately after the Dataset ds is opened.
Note
decode_cf=Falseshould be passed to the xarray open_dataset method, since that parsing is done here instead.Calls
normalize_pre_decode()to do basic cleaning of metadata attributes.Call xarray’s decode_cf, using cftime to decode CF-compliant date/time axes.
Assign axis labels to dimension coordinates using cf_xarray.
Verify that calendar is set correctly (
check_calendar()).Reconcile metadata in var and ds (
reconcile_*methods).- Verify that the name, standard_name and units for the variable and its
coordinates are set correctly (
check_*methods).
- Parameters
var (
VarlistEntry) – VerlistEntry describing metadata we expect to find in ds.ds (Dataset) – xarray Dataset of locally downloaded model data.
- Returns
ds, with data unchanged but metadata normalized to expected values. Except in specific cases, attributes of var are updated to reflect the ‘ground truth’ of data in ds.
-
reconcile_attr(our_var, ds_var, our_attr_name, ds_attr_name=None, **kwargs)¶ Compare attribute of a
DMVariable(our_var) with what’s set in the xarray.Dataset (ds_var).
-
reconcile_coord_bounds(our_coord, ds, ds_coord_name)¶ Reconcile standard_name and units attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var), for the bounds on the dimension coordinate our_coord.
-
reconcile_dimension_coords(our_var, ds)¶ Reconcile name, standard_name and units attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var), for all dimension coordinates used by our_var.
- Parameters
our_var (
TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.ds – xarray Dataset.
-
reconcile_name(our_var, ds_var_name, overwrite_ours=None)¶ Reconcile the name of the variable between the ‘ground truth’ of the dataset we downloaded (ds_var) and our expectations based on the model’s convention (our_var).
-
reconcile_names(our_var, ds, ds_var_name, overwrite_ours=None)¶ Reconcile the name and standard_name attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var).
- Parameters
our_var (
TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.ds – xarray Dataset.
ds_var_name (str) – Name of the variable in ds we expect to correspond to our_var.
overwrite_ours (bool, default False) – If True, always update the name of our_var to what’s found in ds.
-
reconcile_scalar_coords(our_var, ds)¶ Reconcile name, standard_name and units attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var), for all scalar coordinates used by our_var.
- Parameters
our_var (
TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.ds – xarray Dataset.
-
reconcile_scalar_value_and_units(our_var, ds_var)¶ Compare scalar coordinate value of a
DMVariable(our_var) with what’s set in the xarray.Dataset (ds_var). If there’s a discrepancy, log an error but change the entry in our_var.
-
reconcile_time_units(our_var, ds_var)¶ Special case of
reconcile_units()for the time variable. In normal operation we don’t know (or need to know) the calendar or reference date (for time units of the form ‘days since 1970-01-01’), so it’s OK to set these from the dataset.- Parameters
our_var (
TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.ds_var – xarray DataArray.
-
reconcile_units(our_var, ds_var)¶ Reconcile the units attribute between the ‘ground truth’ of the dataset we downloaded (ds_var) and our expectations based on the model’s convention (our_var).
- Parameters
our_var (
TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.ds_var – xarray DataArray.
-
reconcile_variable(var, ds)¶ Top-level method for the MDTF-specific dataset validation: attempts to reconcile name, standard_name and units attributes for the variable and coordinates in translated_var (our expectation, based on the DataSource’s naming convention) with attributes actually present in the Dataset ds.
-
restore_attrs_backup(ds)¶ xarray.decode_cf() and other functions appear to un-set some of the attributes defined in the netCDF file. Restore them from the backups made in
munge_ds_attrs(), but only if the attribute was deleted.
-
-
class
src.data_sources.MetadataRewritePreprocessor(*args, **kwargs)[source]¶ Bases:
src.preprocessor.DaskMultiFilePreprocessorSubclass
DaskMultiFilePreprocessorin order to look up and apply edits to metadata that are stored inExplicitFileDataSourceConfigEntryobjects in the config_by_id attribute ofExplicitFileDataSource.-
__init__(data_mgr, pod)¶ Initialize self. See help(type(self)) for accurate signature.
-
clean_nc_var_encoding(var, name, ds_obj)¶ Clean up the
attrsandencodingdicts of obj prior to writing to a netCDF file, as a workaround for the following known issues:Missing attributes may be set to the sentinel value
ATTR_NOT_FOUNDbyxr_parser.DefaultDatasetParser. Depending on context, this may not be an error, but attributes with this value need to be deleted before writing.Delete the
_FillValueattribute for all independent variables (coordinates and their bounds), which is specified in the CF conventions but isn’t the xarray default; see https://github.com/pydata/xarray/issues/1598.‘NaN’ is not recognized as a valid
_FillValueby NCL (see https://www.ncl.ucar.edu/Support/talk_archives/2012/1689.html), so unset the attribute for this case.xarray to_netcdf() raises an error if attributes set on a variable have the same name as those used in its encoding, even if their values are the same. We delete these attributes prior to writing, after checking equality of values.
-
clean_output_attrs(var, ds)¶ Call
clean_nc_var_encoding()on all sets of attributes in the Dataset ds.
-
edit_request(data_mgr, pod)¶ Edit POD’s data request, based on the child class’s functionality. If the child class has a function that can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.
-
load_ds(var)¶ Top-level method to load dataset and parse metadata; spun out so that child classes can modify it. Calls child class
read_dataset().
-
log_history_attr(var, ds)¶ Update
historyattribute on xarray Dataset ds with log records of any metadata modifications logged to var’s _nc_history_log log handler. Out of simplicity, events are written in chronological rather than reverse chronological order.
-
property
open_dataset_kwargs¶ Arguments passed to xarray open_dataset() and open_mfdataset().
-
process(var)¶ Top-level wrapper for doing all preprocessing of data files.
-
process_ds(var, ds)¶ Top-level method to apply selected functions to dataset; spun out so that child classes can modify it.
-
read_dataset(var)¶ Open multi-file Dataset specified by the
local_dataattribute of var, wrapping xarray open_mfdataset().
-
read_one_file(var, path_list)¶
-
property
save_dataset_kwargs¶ Arguments passed to xarray to_netcdf().
-
setup(data_mgr, pod)¶ Method to do additional configuration immediately before
process()is called on each variable for pod.
-
write_dataset(var, ds)¶ Writes processed Dataset ds to location specified by
dest_pathattribute of var, using xarray to_netcdf()
-
write_ds(var, ds)¶ Top-level method to write out processed dataset; spun out so that child classes can modify it. Calls child class
write_dataset().
-
-
class
src.data_sources.GlobbedDataFile(first_arg=None, *args, **kwargs)[source]¶ Bases:
objectApplies a trivial regex to the paths returned by the glob.
-
__init__(dummy_group: str = sentinel.Mandatory, remote_path: str = sentinel.Mandatory) → None¶ Initialize self. See help(type(self)) for accurate signature.
-
__post_init__(*args, **kwargs)¶
-
classmethod
from_string(str_, *args)¶ Create an object instance from a string representation str_. Used by
regex_dataclass()for parsing field values and automatic type coercion.
-
-
class
src.data_sources.ExplicitFileDataSourceConfigEntry(glob_id: src.util.basic.MDTF_ID = None, pod_name: str = sentinel.Mandatory, name: str = sentinel.Mandatory, glob: str = sentinel.Mandatory, var_name: str = '', metadata: dict = <factory>, _has_user_metadata: bool = None)[source]¶ Bases:
object-
glob_id: src.util.basic.MDTF_ID = None¶
-
property
full_name¶
-
-
class
src.data_sources.ExplicitFileDataAttributes(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = '', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, config_file: str = None)[source]¶ Bases:
src.data_manager.DataSourceAttributesBase-
CASENAME= sentinel.Mandatory¶
-
CASE_ROOT_DIR= ''¶
-
FIRSTYR= sentinel.Mandatory¶
-
LASTYR= sentinel.Mandatory¶
-
__init__(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = '', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, config_file: str = None) → None¶ Initialize self. See help(type(self)) for accurate signature.
-
convention= ''¶
-
log= <Logger src.data_manager (WARNING)>¶
-
-
class
src.data_sources.ExplicitFileDataSource(*args, **kwargs)[source]¶ Bases:
src.data_manager.OnTheFlyGlobQueryMixin,src.data_manager.LocalFetchMixin,src.data_manager.DataframeQueryDataSourceBaseDataSource for dealing data in a regular directory hierarchy on a locally mounted filesystem. Assumes data for each variable may be split into several files according to date, with the dates present in their filenames.
-
col_spec= DataframeQueryColumnSpec(expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, pod_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, var_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, remote_data_col=None, daterange_col=None)¶
-
expt_key_cols= ()¶
-
expt_cols= ()¶
-
property
CATALOG_DIR¶ Placeholder class used in the definition of the
abstract_attribute()decorator.
-
parse_config(config_d)[source]¶ Parse contents of JSON config file into a list of :class`ExplicitFileDataSourceConfigEntry` objects.
-
iter_globs()[source]¶ Iterator returning
FileGlobTupleinstances. The generated catalog contains the union of the files found by each of the globs.
-
__post_init__()¶
-
property
active¶
-
property
all_columns¶
-
check_group_daterange(group_df, expt_key=None, log=<Logger>)¶ Sort the files found for each experiment by date, verify that the date ranges contained in the files are contiguous in time and that the date range of the files spans the query date range.
-
child_deactivation_handler(child, child_exc)¶ When a DataKey (child) has been deactivated during query or fetch, log a message on all VarlistEntries using it, and deactivate any VarlistEntries with no remaining viable DataKeys.
-
child_status_update(exc=None)¶
-
close_log_file(log=True)¶
-
data_key(value, expt_key=None, status=None)¶ Constructor for an instance of
DataKeyBasethat’s used by this DataSource.
-
deactivate(exc, level=None)¶
-
property
df¶ Synonym for the DataFrame containing the catalog.
-
property
failed¶
-
fetch_data()¶
-
fetch_dataset(var, d_key)¶ Fetches data corresponding to data_key. Populates its local_data attribute with a list of identifiers for successfully fetched data (paths to locally downloaded copies of data).
-
property
full_name¶
-
generate_catalog()¶ Build the catalog from the files returned from the set of globs provided by
rel_path_globs().
-
get_expt_key(scope, obj, parent_id=None)¶ Set experiment attributes with case, pod or variable scope. Given obj, construct a DataFrame of epxeriment attributes that are found in the queried data for all variables in obj.
If more than one choice of experiment is possible, call DataSource-specific heuristics in resolve_func to choose between them.
-
init_extra_log_handlers()¶
-
init_log(log_dir, fmt=None)¶
-
is_fetch_necessary(d_key, var=None)¶
-
iter_children(child_type=None, status=None, status_neq=None)¶ Generator iterating over child objects associated with this object.
- Parameters
status – None or
ObjectStatus, default None. If None, iterates over all child objects, regardless of status. If aObjectStatusvalue is passed, only iterates over child objects with that status.status_neq – None or
ObjectStatus, default None. If set, iterates over child objects which don’t have the given status. If status is set, this setting is ignored.
-
iter_files(path_glob)¶ Generator that yields instances of _FileRegexClass generated from relative paths of files in CATALOG_DIR. Only paths that match the regex in _FileRegexClass are returned.
-
iter_vars(active=None, pod_active=None)¶ Iterator over all
VarlistEntrys (grandchildren) associated with this case. ReturnsPodVarTuples (namedtuples) of theDiagnosticandVarlistEntryobjects corresponding to the POD and its variable, respectively.- Parameters
active –
bool or None, default None. Selects subset of
VarlistEntrys which are returned in the namedtuples:active = True: only iterate over currently active VarlistEntries.
- active = False: only iterate over inactive VarlistEntries
(VarlistEntries which have either failed or are currently unused alternate variables).
- active = None: iterate over both active and inactive
VarlistEntries.
pod_active – bool or None, default None. Same as active, but filtering the PODs that are selected.
-
iter_vars_only(active=None)¶ Convenience wrapper for
iter_vars()that returns only theVarlistEntryobjects (grandchildren) from all PODs in this DataSource.
-
post_fetch_hook(vars)¶ Called after fetching each batch of query results.
-
post_query_and_fetch_hook()¶ Called once, after the iterative
request_data()process ends. Use to, eg, close database or remote filesystem connections.
-
post_query_hook(vars)¶ Called after select_experiment(), after each query of a new batch of variables.
-
pre_fetch_hook(vars)¶ Called before fetching each batch of query results.
-
pre_query_and_fetch_hook()¶ Called once, before the iterative
request_data()process starts. Use to, eg, initialize database or remote filesystem connections.
-
pre_query_hook(vars)¶ Called before querying the presence of a new batch of variables.
-
preprocess_data()¶ Hook to run the preprocessing function on all variables.
-
query_and_fetch_cleanup(signum=None, frame=None)¶ Called if framework is terminated abnormally. Not called during normal exit.
-
query_data()¶
-
query_dataset(var)¶ Find all rows of the catalog matching relevant attributes of the DataSource and of the variable (
VarlistEntry). Group these by experiments, and for each experiment make the correspondingDataFrameDataKeyand store it in var’s data attribute. Specifically, the data attribute is a dict mapping experiments (labeled by experiment_keys) to data found for that variable by this query (labeled by the DataKeys).
-
property
remote_data_col¶ Name of the column in the catalog containing the path to the remote data file.
-
request_data()¶ Top-level method to iteratively query, fetch and preprocess all data requested by PODs, switching to alternate requested data as needed.
-
resolve_expt(expt_df, obj)¶ Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.
-
resolve_pod_expt(expt_df, obj)¶ Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.
-
resolve_var_expt(expt_df, obj)¶ Tiebreaker logic to resolve redundancies in experiments, to be specified by child classes.
-
select_data()¶
-
set_experiment()¶ Ensure that all data we’re about to fetch comes from the same experiment. If data from multiple experiments was returned by the query that just finished, either employ data source-specific heuristics to select one or return an error.
-
set_expt_key(obj, expt_key)¶
-
setup()¶
-
setup_fetch()¶ Called once, before the iterative
request_data()process starts. Use to, eg, initialize database or remote filesystem connections.
-
setup_pod(pod)¶ Update POD with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)
Could arguably be moved into Diagnostic’s init, at the cost of dependency inversion.
-
setup_query()¶ Generate an intake_esm catalog of files found in CATALOG_DIR. Attributes of files listed in the catalog (columns of the DataFrame) are taken from the match groups (fields) of the class’s _FileRegexClass.
-
setup_var(pod, v)¶ Update VarlistEntry fields with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)
Could arguably be moved into VarlistEntry’s init, at the cost of dependency inversion.
-
status: src.core.ObjectStatus = 1¶
-
tear_down_fetch()¶ Called once, after the iterative
request_data()process ends. Use to, eg, close database or remote filesystem connections.
-
tear_down_query()¶ Called once, after the iterative
request_data()process ends. Use to, eg, close database or remote filesystem connections.
-
variable_dest_path(pod, var)¶ Returns the absolute path of the POD’s preprocessed, local copy of the file containing the requested dataset. Files not following this convention won’t be found by the POD.
-
-
class
src.data_sources.CMIP6DataSourceAttributes(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = 'CMIP', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, activity_id: str = '', institution_id: str = '', source_id: str = '', experiment_id: str = '', variant_label: str = '', grid_label: str = '', version_date: str = '', model: dataclasses.InitVar = '', experiment: dataclasses.InitVar = '')[source]¶ Bases:
src.data_manager.DataSourceAttributesBase-
model: dataclasses.InitVar = ''¶
-
experiment: dataclasses.InitVar = ''¶
-
CASENAME= sentinel.Mandatory¶
-
CASE_ROOT_DIR= ''¶
-
FIRSTYR= sentinel.Mandatory¶
-
LASTYR= sentinel.Mandatory¶
-
__init__(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = 'CMIP', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, activity_id: str = '', institution_id: str = '', source_id: str = '', experiment_id: str = '', variant_label: str = '', grid_label: str = '', version_date: str = '', model: dataclasses.InitVar = '', experiment: dataclasses.InitVar = '') → None¶ Initialize self. See help(type(self)) for accurate signature.
-
log= <Logger src.data_manager (WARNING)>¶
-
-
class
src.data_sources.CMIP6ExperimentSelectionMixin[source]¶ Bases:
objectEncapsulate attributes and logic used for CMIP6 experiment disambiguation so that it can be reused in DataSources with different parents (eg. different FetchMixins for different data fetch protocols.)
Assumes inheritance from DataframeQueryDataSourceBase – should enforce this.
-
property
CATALOG_DIR¶
-
resolve_expt(df, obj)[source]¶ Disambiguate experiment attributes that must be the same for all variables in this case:
- If variant_id (realization, forcing, etc.) not specified by user,
choose the lowest-numbered variant
If version_date not set by user, choose the most recent revision
-
resolve_pod_expt(df, obj)[source]¶ Disambiguate experiment attributes that must be the same for all variables for each POD:
Prefer regridded to native-grid data (questionable)
If multiple regriddings available, pick the lowest-numbered one
-
resolve_var_expt(df, obj)[source]¶ Disambiguate arbitrary experiment attributes on a per-variable basis:
- If the same variable appears in multiple MIP tables, select the first
MIP table in alphabetical order.
-
__init__()¶ Initialize self. See help(type(self)) for accurate signature.
-
property
-
class
src.data_sources.CMIP6LocalFileDataSource(*args, **kwargs)[source]¶ Bases:
src.data_sources.CMIP6ExperimentSelectionMixin,src.data_manager.LocalFileDataSourceDataSource for handling model data named following the CMIP6 DRS and stored on a local filesystem.
-
col_spec= DataframeQueryColumnSpec(expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, pod_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, var_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, remote_data_col=None, daterange_col='date_range')¶
-
property
CATALOG_DIR¶ Placeholder class used in the definition of the
abstract_attribute()decorator.
-
__post_init__()¶
-
property
active¶
-
property
all_columns¶
-
check_group_daterange(group_df, expt_key=None, log=<Logger>)¶ Sort the files found for each experiment by date, verify that the date ranges contained in the files are contiguous in time and that the date range of the files spans the query date range.
-
child_deactivation_handler(child, child_exc)¶ When a DataKey (child) has been deactivated during query or fetch, log a message on all VarlistEntries using it, and deactivate any VarlistEntries with no remaining viable DataKeys.
-
child_status_update(exc=None)¶
-
close_log_file(log=True)¶
-
data_key(value, expt_key=None, status=None)¶ Constructor for an instance of
DataKeyBasethat’s used by this DataSource.
-
deactivate(exc, level=None)¶
-
property
df¶ Synonym for the DataFrame containing the catalog.
-
property
failed¶
-
fetch_data()¶
-
fetch_dataset(var, d_key)¶ Fetches data corresponding to data_key. Populates its local_data attribute with a list of identifiers for successfully fetched data (paths to locally downloaded copies of data).
-
property
full_name¶
-
generate_catalog()¶ Crawl the directory hierarchy via
iter_files()and return the set of found files as rows in a Pandas DataFrame.
-
get_expt_key(scope, obj, parent_id=None)¶ Set experiment attributes with case, pod or variable scope. Given obj, construct a DataFrame of epxeriment attributes that are found in the queried data for all variables in obj.
If more than one choice of experiment is possible, call DataSource-specific heuristics in resolve_func to choose between them.
-
init_extra_log_handlers()¶
-
init_log(log_dir, fmt=None)¶
-
is_fetch_necessary(d_key, var=None)¶
-
iter_children(child_type=None, status=None, status_neq=None)¶ Generator iterating over child objects associated with this object.
- Parameters
status – None or
ObjectStatus, default None. If None, iterates over all child objects, regardless of status. If aObjectStatusvalue is passed, only iterates over child objects with that status.status_neq – None or
ObjectStatus, default None. If set, iterates over child objects which don’t have the given status. If status is set, this setting is ignored.
-
iter_files()¶ Generator that yields instances of _FileRegexClass generated from relative paths of files in CATALOG_DIR. Only paths that match the regex in _FileRegexClass are returned.
-
iter_vars(active=None, pod_active=None)¶ Iterator over all
VarlistEntrys (grandchildren) associated with this case. ReturnsPodVarTuples (namedtuples) of theDiagnosticandVarlistEntryobjects corresponding to the POD and its variable, respectively.- Parameters
active –
bool or None, default None. Selects subset of
VarlistEntrys which are returned in the namedtuples:active = True: only iterate over currently active VarlistEntries.
- active = False: only iterate over inactive VarlistEntries
(VarlistEntries which have either failed or are currently unused alternate variables).
- active = None: iterate over both active and inactive
VarlistEntries.
pod_active – bool or None, default None. Same as active, but filtering the PODs that are selected.
-
iter_vars_only(active=None)¶ Convenience wrapper for
iter_vars()that returns only theVarlistEntryobjects (grandchildren) from all PODs in this DataSource.
-
post_fetch_hook(vars)¶ Called after fetching each batch of query results.
-
post_query_and_fetch_hook()¶ Called once, after the iterative
request_data()process ends. Use to, eg, close database or remote filesystem connections.
-
post_query_hook(vars)¶ Called after select_experiment(), after each query of a new batch of variables.
-
pre_fetch_hook(vars)¶ Called before fetching each batch of query results.
-
pre_query_and_fetch_hook()¶ Called once, before the iterative
request_data()process starts. Use to, eg, initialize database or remote filesystem connections.
-
pre_query_hook(vars)¶ Called before querying the presence of a new batch of variables.
-
preprocess_data()¶ Hook to run the preprocessing function on all variables.
-
query_and_fetch_cleanup(signum=None, frame=None)¶ Called if framework is terminated abnormally. Not called during normal exit.
-
query_data()¶
-
query_dataset(var)¶ Find all rows of the catalog matching relevant attributes of the DataSource and of the variable (
VarlistEntry). Group these by experiments, and for each experiment make the correspondingDataFrameDataKeyand store it in var’s data attribute. Specifically, the data attribute is a dict mapping experiments (labeled by experiment_keys) to data found for that variable by this query (labeled by the DataKeys).
-
property
remote_data_col¶ Name of the column in the catalog containing the path to the remote data file.
-
request_data()¶ Top-level method to iteratively query, fetch and preprocess all data requested by PODs, switching to alternate requested data as needed.
-
resolve_expt(df, obj)¶ Disambiguate experiment attributes that must be the same for all variables in this case:
- If variant_id (realization, forcing, etc.) not specified by user,
choose the lowest-numbered variant
If version_date not set by user, choose the most recent revision
-
resolve_pod_expt(df, obj)¶ Disambiguate experiment attributes that must be the same for all variables for each POD:
Prefer regridded to native-grid data (questionable)
If multiple regriddings available, pick the lowest-numbered one
-
resolve_var_expt(df, obj)¶ Disambiguate arbitrary experiment attributes on a per-variable basis:
- If the same variable appears in multiple MIP tables, select the first
MIP table in alphabetical order.
-
select_data()¶
-
set_experiment()¶ Ensure that all data we’re about to fetch comes from the same experiment. If data from multiple experiments was returned by the query that just finished, either employ data source-specific heuristics to select one or return an error.
-
set_expt_key(obj, expt_key)¶
-
setup()¶
-
setup_fetch()¶ Called once, before the iterative
request_data()process starts. Use to, eg, initialize database or remote filesystem connections.
-
setup_pod(pod)¶ Update POD with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)
Could arguably be moved into Diagnostic’s init, at the cost of dependency inversion.
-
setup_query()¶ Generate an intake_esm catalog of files found in CATALOG_DIR. Attributes of files listed in the catalog (columns of the DataFrame) are taken from the match groups (fields) of the class’s _FileRegexClass.
-
setup_var(pod, v)¶ Update VarlistEntry fields with information that only becomes available after DataManager and Diagnostic have been configured (ie, only known at runtime, not from settings.jsonc.)
Could arguably be moved into VarlistEntry’s init, at the cost of dependency inversion.
-
status: src.core.ObjectStatus = 1¶
-
tear_down_fetch()¶ Called once, after the iterative
request_data()process ends. Use to, eg, close database or remote filesystem connections.
-
tear_down_query()¶ Called once, after the iterative
request_data()process ends. Use to, eg, close database or remote filesystem connections.
-
variable_dest_path(pod, var)¶ Returns the absolute path of the POD’s preprocessed, local copy of the file containing the requested dataset. Files not following this convention won’t be found by the POD.
-