src.data_sources module

Implementation classes for the model data query/fetch functionality implemented in src/data_manager.py, selected by the user via --data_manager.

class src.data_sources.SampleDataFile(first_arg=None, *args, **kwargs)[source]

Bases: object

Dataclass describing catalog entries for sample model data files.

sample_dataset: str = sentinel.Mandatory
frequency: src.util.datelabel.DateFrequency = sentinel.Mandatory
variable: str = sentinel.Mandatory
remote_path: str = sentinel.Mandatory
_is_regex_dataclass = True
_pattern = {}
classmethod from_string(str_, *args)
class src.data_sources.SampleDataAttributes(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = sentinel.Mandatory, log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, sample_dataset: str = '')[source]

Bases: src.data_manager.DataSourceAttributesBase

Data-source-specific attributes for the DataSource providing sample model data.

sample_dataset: str = ''
_set_case_root_dir(log=<Logger src.data_sources (WARNING)>)[source]

Additional logic to set CASE_ROOT_DIR from MODEL_DATA_ROOT.

class src.data_sources.SampleLocalFileDataSource(*args, **kwargs)[source]

Bases: src.data_manager.SingleLocalFileDataSource

DataSource for handling POD sample model data stored on a local filesystem.

_FileRegexClass

alias of SampleDataFile

_AttributesClass

alias of SampleDataAttributes

_DiagnosticClass

alias of src.diagnostic.Diagnostic

_PreprocessorClass

alias of src.preprocessor.SampleDataPreprocessor

col_spec = DataframeQueryColumnSpec(expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, pod_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, var_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, remote_data_col=None, daterange_col=None)
_query_attrs_synonyms = {'name': 'variable'}
property CATALOG_DIR

Placeholder class used in the definition of the abstract_attribute() decorator.

_abc_impl = <_abc_data object>
class src.data_sources.MetadataRewriteParser(data_mgr, pod)[source]

Bases: src.xr_parser.DefaultDatasetParser

After loading and parsing the metadata on dataset ds but before applying the preprocessing functions, update attrs on ds with the new metadata values that were specified in ExplicitFileDataSource’s config file.

setup(data_mgr, pod)[source]

Make a lookup table to map VarlistEntry IDs to the set of metadata that we need to alter.

If user has provided the name of variable used by the data files (via the var_name attribute), set that as the translated variable name. Otherwise, variables are untranslated, and we use the herusitics in xr_parser.DefaultDatasetParser.guess_dependent_var() to determine the name.

_post_normalize_hook(var, ds)[source]

After loading the metadata on dataset ds but before reconciling it with the record, update attrs with the new metadata values that were specified in ExplicitFileDataSource’s config file.

Normal operation is to set the changed attrs on the VarlistEntry translation, and then have these overwrite attrs in ds in the inherited xr_parser.DefaultDatasetParser.reconcile_variable() method. If the user set the --disable-preprocessor flag, this is skipped, so instead we set the attrs directly on ds.

class src.data_sources.MetadataRewritePreprocessor(*args, **kwargs)[source]

Bases: src.preprocessor.DaskMultiFilePreprocessor

Subclass DaskMultiFilePreprocessor in order to look up and apply edits to metadata that are stored in ExplicitFileDataSourceConfigEntry objects in the config_by_id attribute of ExplicitFileDataSource.

_file_preproc_functions = []
_XarrayParserClass

alias of MetadataRewriteParser

property _functions

Determine which preprocessor functions are applicable to the current package run, defaulting to all of them.

Returns

tuple of classes (inheriting from PreprocessorFunctionBase) listing the preprocessing functions to be called, in order.

_abc_impl = <_abc_data object>
class src.data_sources.GlobbedDataFile(first_arg=None, *args, **kwargs)[source]

Bases: object

Applies a trivial regex to the paths returned by the glob.

dummy_group: str = sentinel.Mandatory
remote_path: str = sentinel.Mandatory
_is_regex_dataclass = True
_pattern = {}
classmethod from_string(str_, *args)
class src.data_sources.ExplicitFileDataSourceConfigEntry(glob_id: src.util.basic.MDTF_ID = None, pod_name: str = sentinel.Mandatory, name: str = sentinel.Mandatory, glob: str = sentinel.Mandatory, var_name: str = '', metadata: dict = <factory>, _has_user_metadata: bool = None)[source]

Bases: object

glob_id: src.util.basic.MDTF_ID = None
pod_name: str = sentinel.Mandatory
name: str = sentinel.Mandatory
glob: str = sentinel.Mandatory
var_name: str = ''
metadata: dict
_has_user_metadata: bool = None
property full_name
classmethod from_struct(pod_name, var_name, v_data)[source]
to_file_glob_tuple()[source]
class src.data_sources.ExplicitFileDataAttributes(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = sentinel.Mandatory, log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, config_file: str = None)[source]

Bases: src.data_manager.DataSourceAttributesBase

config_file: str = None
class src.data_sources.ExplicitFileDataSource(*args, **kwargs)[source]

Bases: src.data_manager.OnTheFlyGlobQueryMixin, src.data_manager.LocalFetchMixin, src.data_manager.DataframeQueryDataSourceBase

DataSource for dealing data in a regular directory hierarchy on a locally mounted filesystem. Assumes data for each variable may be split into several files according to date, with the dates present in their filenames.

_FileRegexClass

alias of GlobbedDataFile

_AttributesClass

alias of ExplicitFileDataAttributes

_DiagnosticClass

alias of src.diagnostic.Diagnostic

_PreprocessorClass

alias of MetadataRewritePreprocessor

col_spec = DataframeQueryColumnSpec(expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, pod_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, var_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, remote_data_col=None, daterange_col=None)
expt_key_cols = ()
expt_cols = ()
property CATALOG_DIR

Placeholder class used in the definition of the abstract_attribute() decorator.

parse_config(config_d)[source]

Parse contents of JSON config file into a list of :class`ExplicitFileDataSourceConfigEntry` objects.

iter_globs()[source]

Iterator returning FileGlobTuple instances. The generated catalog contains the union of the files found by each of the globs.

_abc_impl = <_abc_data object>
class src.data_sources.CMIP6DataSourceAttributes(CASENAME: str = sentinel.Mandatory, FIRSTYR: str = sentinel.Mandatory, LASTYR: str = sentinel.Mandatory, CASE_ROOT_DIR: str = '', convention: str = 'CMIP', log: dataclasses.InitVar = <Logger src.data_manager (WARNING)>, activity_id: str = '', institution_id: str = '', source_id: str = '', experiment_id: str = '', variant_label: str = '', grid_label: str = '', version_date: str = '', model: dataclasses.InitVar = '', experiment: dataclasses.InitVar = '')[source]

Bases: src.data_manager.DataSourceAttributesBase

convention: str = 'CMIP'
activity_id: str = ''
institution_id: str = ''
source_id: str = ''
experiment_id: str = ''
variant_label: str = ''
grid_label: str = ''
version_date: str = ''
model: dataclasses.InitVar = ''
experiment: dataclasses.InitVar = ''
CATALOG_DIR: str
class src.data_sources.CMIP6ExperimentSelectionMixin[source]

Bases: object

Encapsulate attributes and logic used for CMIP6 experiment disambiguation so that it can be reused in DataSources with different parents (eg. different FetchMixins for different data fetch protocols.)

Assumes inheritance from DataframeQueryDataSourceBase – should enforce this.

_query_attrs_synonyms = {'name': 'variable_id'}
property CATALOG_DIR
_query_group_hook(group_df)[source]

Eliminate regional (Antarctic/Greenland) and spatially averaged data from consideration for data fetch, since no POD currently makes use of data of this type.

static _filter_column(df, col_name, func, obj_name)[source]
_filter_column_min(df, obj_name, *col_names)[source]
_filter_column_max(df, obj_name, *col_names)[source]
resolve_expt(df, obj)[source]

Disambiguate experiment attributes that must be the same for all variables in this case:

  • If variant_id (realization, forcing, etc.) not specified by user,

    choose the lowest-numbered variant

  • If version_date not set by user, choose the most recent revision

resolve_pod_expt(df, obj)[source]

Disambiguate experiment attributes that must be the same for all variables for each POD:

  • Prefer regridded to native-grid data (questionable)

  • If multiple regriddings available, pick the lowest-numbered one

resolve_var_expt(df, obj)[source]

Disambiguate arbitrary experiment attributes on a per-variable basis:

  • If the same variable appears in multiple MIP tables, select the first

    MIP table in alphabetical order.

class src.data_sources.CMIP6LocalFileDataSource(*args, **kwargs)[source]

Bases: src.data_sources.CMIP6ExperimentSelectionMixin, src.data_manager.LocalFileDataSource

DataSource for handling model data named following the CMIP6 DRS and stored on a local filesystem.

_FileRegexClass

alias of src.cmip6.CMIP6_DRSPath

_DirectoryRegex = {}
_AttributesClass

alias of CMIP6DataSourceAttributes

_DiagnosticClass

alias of src.diagnostic.Diagnostic

_PreprocessorClass

alias of src.preprocessor.DefaultPreprocessor

col_spec = DataframeQueryColumnSpec(expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, pod_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, var_expt_cols=<src.data_manager.DataFrameQueryColumnGroup object>, remote_data_col=None, daterange_col='date_range')
_convention = 'CMIP'
_abc_impl = <_abc_data object>