Operations API

This page provides an overview of OpenClimateGIS’s operations interface. A Class Reference and Function Reference are also available. OpenClimateGIS’s combined functionality is generally accessed using the operations interface (OcgOperations). Potential keyword arguments are described below in Keyword Arguments. Please see Examples for an overview of using operations with real data.

Default Coordinate System

The default coordinate system for OpenClimateGIS is Spherical. The representative PROJ.4 string is {'a': 6370997, 'no_defs': True, 'b': 6370997, 'proj': 'longlat', 'towgs84': '0,0,0,0,0,0,0'}. Prior to OpenClimateGIS v2.x, the WGS84 datum was used by default. This default coordinate system also applies to bounding boxes. If a bounding box coordinate system differs from the default, this must be converted to a dictionary representation:

>>> import shapely
>>> import ocgis
>>> bbox = [-120, 30, -130, 40]
>>> bbox = shapely.geometry.box(*bbox)
>>> geom = {'geom': bbox, 'crs': ocgis.crs.WGS84()}

The correct coordinate system will always be read from file by the driver. The default coordinate system used by OpenClimateGIS may be changed using env.DEFAULT_COORDSYS.

OcgOperations Reference

The basic operations syntax is:

>>> import ocgis
>>> ops = ocgis.OcgOperations(**kwargs)
>>> res = ops.execute()
Some notable features of the operations object:
  1. The only required keyword argument is dataset.
  2. All keyword arguments are exposed as public attributes which may be arbitrarily set using standard syntax:
>>> import ocgis
>>> rd = ocgis.RequestDataset(uri='/path/to/some/dataset', variable='foo')
>>> ops = OcgOperations(dataset=rd)
>>> ops.aggregate = True
  1. Operations may be run in parallel using MPI. See Parallel Operations for guidance.
  2. The object SHOULD NOT be reused following an execution as the software may add/modify attribute contents. Instantiate a new object following an execution or copy the object appropriately.

Keyword Arguments

Additional information on arguments are found in their respective sections.

abstraction

Note

OpenClimateGIS uses the bounds attribute of a NetCDF variable to construct 'polygon' representations of regular grids. If no bounds attribute is found, the software defaults to the 'point' geometry abstraction.

Value Description
'polygon' (default) Represent cells as shapely.geometry.Polygon objects.
'point' Represent cells as shapely.geometry.Point objects.

add_auxiliary_files

If True (the default), create a new directory and add metadata and other informational files in addition to the converted file. If False, write the target file only to dir_output and do not create a new directory.

aggregate

Value Description
True Selected geometries are combined into a single geometry (see Aggregate (Union)).
False (default) Selected geometries are not combined.

agg_selection

Value Description
True Aggregate (union) geom to a single geometry.
False (default) Leave geom as is.

The purpose of this data manipulation is to ease the method required to aggregate (union) geometries into arbitrary regions. A simple example would be unioning the U.S. state boundaries of Utah, Nevada, Arizona, and New Mexico into a single polygon representing a “Southwestern Region”.

allow_empty

Value Description
True Allow the empty set for geometries not geographically coincident with a source geometry.
False (default) Raise EmptyDataNotAllowed if the empty set is encountered.

calc

See the Computation page for more details.

calc_grouping

There are three forms for this argument:

  1. Date Part Grouping: Any combination of 'day', 'month', and 'year'.
>>> calc_grouping = ['day']
>>> calc_grouping = ['month','year']
>>> calc_grouping = ['day','year']

Temporal aggregation splits date/time coordinates into parts and groups them according to unique combinations of those parts. If data is grouped by month, then all of the January times would be in one group with all of the August times in another. If a grouping of month and year are applied, then all of the January 2000 times would be in a group with all of the January 2001 times and so on.

Any temporal aggregation applied to a dataset should be consistent with the input data’s temporal resolution. For example, aggregating by day, month, and year on daily input dataset is not a reasonable aggregation as the data selected for aggregation will have a sample size of one (i.e. one day per aggregation group).

  1. Summarize Over All: The string 'all' indicates the entire time domain should be summarized.
>>> calc_grouping = 'all'
  1. Seasonal Groups: A sequence of integer sequences. Element sequences must be mutually exclusive (i.e. no repeated integers). Representative times for the climatology are chosen as the center month in a sequence (i.e. January in the sequence [12,1,2]).

Month integers map as expected (1=January, 2=February, etc.). The example below constructs a single season composed of March, April, and May. Note the nested lists.

>>> calc_grouping = [[3, 4, 5]]

The next example consumes all the months in a year.

>>> calc_grouping = [[12, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]

Unique, time sequential seasons are possible with the 'unique' flag:

>>> calc_grouping = [[12, 1, 2], 'unique']
A unique season has at least one value associated with each month in the season. If a month is missing, the season will be dropped. The season specification above returns a calculation based on values with date coordinates in:
  • Dec 1900, Jan 1901, Feb 1901
  • Dec 1901, Jan 1902, Feb 1902

It is also possible to group the seasons by year.

>>> calc_grouping = [[12, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], 'year']
For example, this returns a calculation based on values with date coordinates in:
  • 1900: Dec, Jan, Feb
  • 1901: Dec, Jan, Feb
  • 1902: Dec, Jan, Feb

calc_raw

Value Description
True If ocgis.OcgOperations.aggregate is True, perform computations on raw, unaggregated values.
False (default) Use aggregated values during computation.

callback

A callback function that may be used for custom messaging. This function integrates with the log handler and will receive messages at or above the logging.INFO level.

>>> def callback(percent, message):
>>>     print(percent, message)

conform_units_to

Destination units for conversion. If this parameter is set, then the cf_units module must be installed. Setting this parameter will override conformed units set on dataset objects.

dataset

This is the only required parameter. All elements of dataset will be processed.

A dataset is the target file(s) or object(s) containing data to process. A dataset may be:
  1. A file on the local machine or network location accessible by the software (use RequestDataset).
  2. A URL to an unsecured OpenDAP dataset (use RequestDataset).
  3. An OpenClimateGIS field object (use Field). If a Field object is used, be aware operations may modify the object inplace.
>>> # A keyword argument dictionary can be used in place of an actual request object.
>>> dataset = {'uri': '/path/to/my/data.nc'}
>>> # Use variable auto-discovery.
>>> from ocgis import RequestDataset
>>> dataset = RequestDataset(uri='/path/to/my/data.nc'}
>>> # Specify the target variable directly.
>>> dataset = RequestDataset(uri='/path/to/my/data.nc', variable='tas')

In version v2.x the RequestDatasetCollection was removed. Use sequences of request dataset or field objects in their place.

dir_output

This sets the output folder for any disk formats. If this is None and :attr:ocgis.env.DIR_OUTPUT is None, then output will be written to the current working directory.

geom

Warning

Unless aggregate or agg_selection is True, subsetting with multiple geometries to netCDF will raise an error.

If a geometry(s) is provided, it is used to subset every RequestDataset or Field object. Supplying a value of None (the default) results in the return of the entire spatial domain.

There are a number of ways to parameterize the geom keyword argument:

  1. Bounding Box

This is a list of floats corresponding to: [min_x, min_y, max_x, max_y]. See Default Coordinate System for guidance on coordinate system defaults and usages.

>>> geom = [-120.4, 30.0, -110.3, 41.4]
  1. Point

This is a list of floats corresponding to: [longitude, latitude]. See Default Coordinate System for guidance on coordinate system defaults and usages.

>>> geom = [-120.4, 36.5]
  1. Using GeomCabinetIterator
>>> from ocgis import GeomCabinetIterator
>>> geom = GeomCabinetIterator('state_boundaries', geom_select_uid=[16])
  1. Using a GeomCabinet key
>>> geom = 'state_boundaries'
  1. Custom Sequence of Shapely Geometry Dictionaries

The 'crs' key is optional. If it is not included, WGS84 is assumed. The 'properties' key is also optional. See Default Coordinate System for guidance on coordinate system defaults and usages.

>>> geom = [{'geom': Point(x,y), 'properties': {'UGID': 23, 'NAME': 'geometry23'}, 'crs': CoordinateReferenceSystem(epsg=4326)} ,...]
  1. Path to a GIS file
>>> geom = '/path/to/shapefile.shp'

geom_select_sql_where

Warning

Single quotes must be used inside double quotes!

If provided, this string will be used as part of a SQL WHERE clause to select geometries from the source. See the section titled “WHERE” for documentation on supported statements: http://www.gdal.org/ogr_sql.html. This works only for geometries read from file.

>>> geom_select_sql_where = "STATE_NAME = 'Wisconsin'"
>>> geom_select_sql_where = "STATE_NAME in ('Wisconsin', 'Nebraska')"
>>> geom_select_sql_where = "POPULATION > 1500"

geom_select_uid

Select specific geometries from the target shapefile chosen using geom. The integer sequence selects matching UGID values from the shapefiles. For more information on adding new shapefiles or the requirements of input shapefiles, please see the section titled Shapefile Data.

>>> geom_select_uid = [1, 2, 3]
>>> geom_select_uid = [4, 55]
>>> geom_select_uid = [1]

As clarification, suppose there is a shapefile called basins.shp (this assumes the folder containing the shapefile has been set as the value for ocgis.env.DIR_GEOMCABINET) with the following attribute table:

UGID Name
1 Basin A
2 Basin B
3 Basin C

If the goal is to subset the data by the boundary of “Basin A” and write the resulting data to netCDF, a call to OpenClimateGIS operations looks like:

>>> import ocgis
>>> rd = ocgis.RequestDataset(uri='/path/to/data.nc', variable='tas')
>>> path = ocgis.OcgOperations(dataset=rd, geom='basins', geom_select_uid=[1], output_format='nc').execute()

geom_uid

All subset geometries must have a unique identifier. The unique identifier allows subsetted data to be linked to the selection geometry. Passing a string value to geom_uid will overload the default unique identifier ocgis.env.DEFAULT_GEOM_UID. If no unique identifier is available, a one-based unique identifier will be generated having a name with value ocgis.env.DEFAULT_GEOM_UID.

interpolate_spatial_bounds

If True, attempt to interpolate bounds coordinates if they are absent. This will also extrapolate exterior bounds to avoid losing spatial coverage.

melted

If False, variable names will be individual column headers (non-melted). If True, variable names will be placed into a single column.

A non-melted format:

TIME TAS TASMAX
1 30.3 40.3
2 32.2 41.7
3 31.7 40.9

A melted format:

TIME NAME VALUE
1 TAS 30.3
2 TAS 32.2
3 TAS 31.7
1 TASMAX 40.3
2 TASMAX 41.7
3 TASMAX 40.9

optimized_bbox_subset

If True, only perform the bounding box subset ignoring other subsetting procedures like masking within the bounding coordinates. Using this option should result in lower memory requirements and shorter processing times for subsets. Note this assumes the bounding box aligns appropriately with the target grid.

output_crs

By default, the output coordinate reference system (CRS) is the CRS of the input RequestDataset object. If multiple RequestDataset objects are part of an OcgOperations call, then output_crs must be provided if the input CRS values of the RequestDataset objects differ. The value for output_crs is an instance of CRS.

>>> import ocgis
>>> output_crs = ocgis.crs.Spherical()

General PROJ.4 and EPSG codes are supported by coordinate systems. For example, to output data on the GCS North American 1983 coordinate system, you can configure the coordinate system like:

>>> from ocgis import crs
>>> output_crs = crs.CRS(epgs=4269)

You could also use a PROJ.4 string:

>>> from ocgis import crs
>>> output_crs = crs.CRS(proj4='+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs')

output_format

Value Description
'ocgis' (default) Return a SpatialCollection with keys matching ugid (see geom). Also see Spatial Collections for more information on this output format.
'csv' A CSV file representation of the data.
'csv-shp' In addition to a CSV representation, shapefiles with primary key links to the CSV are provided.
'nc' A NetCDF4-CF file. See NetCDF Output for additional information on the structure of the NetCDF format.
'geojson' A GeoJSON representation of the data.
'shp' A shapefile representation of the data.

output_format_options

A dictionary of converter-specific options. Options for each converter are listed in the table below.

Output Format Option Description
'nc' data_model The netCDF data model: http://unidata.github.io/netcdf4-python/#netCDF4.Dataset.
variable_kwargs Dictionary of keyword parameters to use for netCDF variable creation. See: http://unidata.github.io/netcdf4-python/#netCDF4.Variable.
unlimited_to_fixedsize If True, convert the unlimited dimension to fixed size. Only applies to time and level dimensions.
geom_dim The name of the dimension storing aggregated (unioned) outputs. Only applies when aggregate is True.
>>> output_format_options = {'data_model': 'NETCDF4_CLASSIC'}
>>> options = {'variable_kwargs': {'zlib': True, 'complevel': 4}}

prefix

The prefix provides the name of the output folder (if add_auxiliary_files=True) and the filename prefix for any file output created by OpenClimateGIS.

>>> prefix = 'fn_start'

regrid_destination

Please see ESMPy Regridding for an overview and limitations.

If provided, all RequestDataset objects in dataset will be regridded to match the grid provided in the argument’s object. This argument must be a RequestDataset or Field.

>>> regrid_destination = ocgis.RequestDataset(uri='/path/to/destination.nc')

regrid_options

A dictionary with regridding options. Please see the documentation for regrid_field(). Dictionary elements of regrid_options correspond to the keyword arguments of this function.

>>> import ESMF
>>> regrid_options = {'regrid_method': ESMF.RegridMethod.CONSERVE}

search_radius_mult

This is a scalar float value multiplied by the target data’s resolution to determine the buffer radius for the point. This is only applicable when subsetting against gridded datasets.

Note

Prior to v2.x, this was a float value by default. This was changed to None in current versions. Hence, point geometries will be used for subsetting and not a buffered point.

select_nearest

If True, the nearest geometry to the centroid of the current selection geometry is returned.

slice

This is a list of integers, None, or lists of integers. The values composing the list will be converted to slice objects. For example, to return the first ten time steps:

>>> slc = [None, [0, 10], None, None, None]

The index locations in the above list correspond to:

Index Description
0 Realization / Ensemble Member
1 Time
2 Level
3 Row
4 Column

To select the last time step:

>>> slice = [None, -1, None, None, None]

snippet

Note

The entire spatial domain is returned unless geom is specified.

Note

Only applies for pure subsetting for limiting computations use time_range and/or time_region.

Value Description
True Return only the first time point and the first level slice (if applicable).
False (default) Return all data.

spatial_operation

Value Description
"intersects" (default) Source geometries touching or overlapping selection geometries are returned (see Intersects (Select)).
"clip" A full geometric intersection is performed between source and selection geometries. New geometries may be created. (see Clip (Intersection))

spatial_reorder

If True, reorder wrapped coordinates such that the longitude values are in ascending order. Reordering assumes the first row of longitude coordinates are representative of the other longitude coordinate rows. Bounds and corners will be removed in the event of a reorder. Only applies to spherical coordinate systems.

If False (the default), do not attempt to reorder wrapped spherical longitude coordinates.

Note

If aggregate=True, spatial reordering is not possible.

spatial_wrapping

Allows control of the wrapped state for all input fields. Only field objects with a wrappable coordinate system are affected. Wrapping operations are applied before all other operations.

Value Description
None (default) Do not attempt a wrap or unwrap operation.
"wrap" Wrap spherical coordinates to the -180 to 180 longitudinal domain.
"unwrap" Unwrap spherical coordinate to the 0 to 360 longitudinal domain.

time_range

Upper and lower bounds for the time dimension subset composed of a two-element sequence of datetime.datetime-like objects. If None, return all time points. Using this argument will overload all RequestDataset time_range values.

time_region

A dictionary with keys of ‘month’ and/or ‘year’ and values as sequences corresponding to target month and/or year values. Empty region selection for a key may be set to None. Using this argument will overload all RequestDataset time_region values.

>>> time_region = {'month': [6, 7], 'year': [2010, 2011]}
>>> time_region = {'year': [2010]}

time_subset_func

Subset the time dimension by an arbitrary function. The functions must take one argument and one keyword. The argument is a vector of datetime.datetime-like objects. The keyword argument should be called “bounds” and may be None. If the bounds value is not None, it should expect a n-by-2 array of datetime objects. The function must return an integer sequence suitable for indexing. For example:

>>> def subset_func(value, bounds=None):
>>>     indices = []
>>>     for ii, v in enumerate(value):
>>>         if v.month == 6:
>>>             indices.append(ii)
>>>     return indices

Note

The subset function is applied following time_region and time_range.

vector_wrap

Note

Only applicable for spherical, geographic coordinate systems.

Value Description
True (default) For vector geometry outputs (e.g. shp), ensure output longitudinal domain is -180 to 180.
False Maintain the RequestDataset’s longitudinal domain.

Environment

These are global parameters used by OpenClimateGIS. For those familiar with arcpy programming, this behaves similarly to the arcpy.env module. Any ocgis.env variable be overloaded with system environment variables by setting OCGIS_<variable-name>.

env.DEFAULT_GEOM_UID = 'UGID'
The default unique geometry identifier to search for in geometry datasets. This is also the name of the created unique identifier if none exists in the target.
env.DIR_DATA = None
Directory(s) to search through to find data. If specified, this should be a sequence of directories. It may also be a single directory location. Note that the search may take considerable time if a very high level directory is chosen. If this variable is set, it is only necessary to specify the filename(s) when creating a RequestDataset.
env.DIR_OUTPUT = None (defaults to current working directory)
The directory where output data is written. OpenClimateGIS creates directories inside which output data is stored unless add_auxiliary_files is False. If None, it defaults to the current working directory.
env.DIR_GEOMCABINET = <path-to-directory>
Location of the geometry directory (e.g. a directory containing shapefiles) for use by GeomCabinet. Formerly called DIR_SHPCABINET.
env.MELTED = False
If True, use a melted tabular format with all variable values collected in a single column.
env.OVERWRITE = False

Warning

Use with caution.

Set to True to overwrite existing output folders. This will remove the folder if it exists!

env.PREFIX = 'ocgis_output'
The default prefix to apply to output files. This is also the output folder name.
env.SUPPRESS_WARNINGS = True
If True, suppress all OpenClimateGIS warning messages to standard out. Warning messages will still be logged.
env.USE_CFUNITS = True
If True, use cfunits for any unit transformations. This will be automatically set to False if cfunits is not available for import.
env.USE_MEMORY_OPTIMIZATIONS = False
If True, some methods will attempt to minimize their memory usage at the expense of computational time.
env.USE_SPATIAL_INDEX = True
If True, use rtree to create spatial indices for spatial operations. This will be automatically set to False if rtree is not available for import.
env.VERBOSE = False
Indicate if additional output information should be printed to terminal.
env.DEFAULT_COORDSYS = ocgis.crs.Spherical
The default coordinate system used by OpenClimateGIS.
env.USE_NETCDF4_MPI = None
If None, detect if it is possible to use netCDF4-python’s MPI asynchronous write capability. Use it if available. If True, do asynchronous writes with netCDF4-python. Set to False to use synchronous writes always.

Inspecting Data

See Inspection for guidance on inspecting datasets.

Spatial Collections

See the Advanced Subsetting example for SpatialCollection usage. Spatial collections are returned by default from OcgOperations.

Shapefile Data

Shapefiles may be added to the directory mapped by the environment variable env.DIR_GEOMCABINET.

The shapefile’s geom key is the name of the shapefile. It must have an alphanumeric name with no spaces with the only allowable special character being underscores “_”.