EO Datasets 3

EO Datasets aims to be the easiest way to write, validate and convert dataset imagery and metadata for the Open Data Cube

Write a Dataset

Here’s a simple example of creating a dataset with one measurement (called “blue”) from an existing image:

collection = Path('/some/output/collection/path')
with DatasetAssembler(collection) as p:
   p.product_family = "blues"

   # Date of acquisition (UTC if no timezone).
   p.datetime = datetime(2019, 7, 4, 13, 7, 5)
   # When the data was processed/created.
   p.processed_now() # Right now!
   # (If not newly created, set the date on the field: `p.processed = ...`)

   # Write our measurement from the given path, calling it 'blue'.
   p.write_measurement("blue", blue_geotiff_path)

   # Add a jpg thumbnail using our only measurement for the r/g/b bands.
   p.write_thumbnail("blue", "blue", "blue")

   # Complete the dataset.
   p.done()

Note that until you call done(), nothing will exist in the dataset’s final output location. It is stored in a hidden temporary folder in the output directory, and renamed by done() once complete and valid.

Custom stac-like properties can also be set directly on .properties:

p.properties['fmask:cloud_cover'] = 34.0

And known properties are automatically normalised:

p.platform = "LANDSAT_8"  # to: 'landsat-8'
p.processed = "2016-03-04 14:23:30Z"  # into a date.
p.maturity = "FINAL"  # lowercased
p.properties["eo:off_nadir"] = "34"  # into a number

Including provenance

Most of our datasets are processed from an existing (input) dataset and have the same spatial information. We can add them as source datasets, to record the provenance, and the assembler can optionally copy any common metadata automatically:

collection = Path('/some/output/collection/path')
with DatasetAssembler(collection) as p:
   # We add a source dataset, asking to inherit the common properties
   # (eg. platform, instrument, datetime)
   p.add_source_path(level1_ls8_dataset_path, auto_inherit_properties=True)

   # Set our product information.
   # It's a GA product of "numerus-unus" ("the number one").
   p.producer = "ga.gov.au"
   p.product_family = "blues"
   p.dataset_version = "3.0.0"

   ...

In these situations, we often write our new pixels as a numpy array, inheriting the existing grid spatial information (gridspec) from our input dataset:

# Write a measurement from a numpy array, using the source dataset's grid spec.
p.write_measurement_numpy(
   "ones",
   numpy.ones((60, 60), numpy.int16),
   GridSpec.from_dataset_doc(l1_ls8_dataset),
   nodata=-999,
)

Writing only metadata

The above examples copy the imagery, converting them to valid COG files in a new location. But sometimes you want to leave the imagery as-is and just generate a metadata file for Open Data Cube. We can do this by using eodatasets3.DatasetAsssembler.note_measurement() instead of eodatasets3.DatasetAsssembler.write_measurement(), to note the path of the current image:

usgs_level1 = Path('datasets/LC08_L1TP_090084_20160121_20170405_01_T1')

with DatasetAssembler(
  dataset_location=usgs_level1
) as p:
  p.product_family = "level1"
  p.datetime = datetime(2019, 7, 4, 13, 7, 5)

  # Note the measurement in the metadata. (instead of ``write``)
  p.note_measurement('red',
     usgs_level1 / 'LC08_L1TP_090084_20160121_20170405_01_T1_B3.TIF'
  )

  # Or relative to the dataset
  # (this will work unchanged on non-filesystem locations, such as ``s3://`` or tar files)
  p.note_measurement('blue',
     'LC08_L1TP_090084_20160121_20170405_01_T1_B3.TIF',
     relative_to_dataset_location=True
  )

  ...

Note that the assembler will throw an error if the path lives outside the dataset (location), as they will be absolute rather than relative paths. Relative paths are considered best-practice for Open Data Cube.

You can allow absolute paths with a field on assembler construction eodatasets3.DatasetAssembler.__init__():

with DatasetAssembler(
   dataset_location=usgs_level1,
   allow_absolute_paths=True,
 ):
     ...

API / Class

class eodatasets3.DatasetAssembler(collection_location=None, dataset_location=None, metadata_path=None, dataset_id=None, if_exists=<IfExists.ThrowError: 2>, allow_absolute_paths=False, naming_conventions='default')[source]
__init__(collection_location=None, dataset_location=None, metadata_path=None, dataset_id=None, if_exists=<IfExists.ThrowError: 2>, allow_absolute_paths=False, naming_conventions='default')[source]

Assemble a dataset with ODC metadata, writing metadata and (optionally) its imagery as COGs.

There are three optional paths that can be specified. At least one must be.

  • A collection path is the root folder where datasets will live (in sub-[sub]-folders).

  • Each dataset has its own dataset location, as stored in an Open Data Cube index. All paths inside the metadata are relative to this location.

  • An output metadata document location.

If you’re writing data, you typically only need to specify the collection path, and the others will be automatically generated using the naming conventions.

If you’re only writing a metadata file (for existing data), you only need to specify a metadata path.

If you’re storing data using an exotic URI schema, such as a ‘tar://’ URL path, you will need to specify this as your dataset location.

Parameters
  • collection_location (Optional[Path]) – Optional base directory where the collection of datasets should live. Subfolders will be created accordion to the naming convention.

  • dataset_location (Union[Path, str, None]) – Optional location for this specific dataset. Otherwise it will be generated according to the collection path and naming conventions.

  • metadata_path (Optional[Path]) – Optional metadata document output path. Otherwise it will be generated according to the collection path and naming conventions.

  • dataset_id (Optional[UUID]) – Optional UUID for this dataset, otherwise a random only will be created. Use this if you have a stable way of generating your own IDs.

  • if_exists (IfExists) – What to do if the output dataset already exists? By default, throw an error.

  • allow_absolute_paths (bool) – Allow metadata paths to refer to files outside the dataset location. this means they will have to be absolute paths, and not be portable. (default: False)

  • naming_conventions (str) – Naming conventions to use. Supports default or dea. The latter has stricter metadata requirements (try it and see – it will tell your what’s missing).

Return type

None

add_accessory_file(name, path)[source]

Record a reference to an additional file. Such as native metadata, thumbnails, checksums, etc. Anything other than ODC measurements.

By convention, the name should have prefixes with their category, such as ‘metadata:’ or ‘thumbnail:’

Parameters
  • name (str) – identifying name, eg ‘metadata:mtl’

  • path (Path) – local path to file.

add_source_dataset(dataset, classifier=None, auto_inherit_properties=False)[source]

Record a source dataset using its metadata document.

It can optionally copy common properties from the source dataset (platform, instrument etc)/

(see self.INHERITABLE_PROPERTIES for the list of fields that are inheritable)

Parameters
  • dataset (DatasetDoc) –

  • auto_inherit_properties (bool) – Whether to copy any common properties from the dataset

  • classifier (Optional[str]) –

    How to classify the kind of source dataset. This is will automatically be filled with the family of dataset if available (eg. “level1”).

    You want to set this if you have two datasets of the same type that are used for different purposes. Such as having a second level1 dataset that was used for QA (but is not this same scene).

See add_source_path() if you have a filepath reference instead of a document.

add_source_path(*paths, classifier=None, auto_inherit_properties=False)[source]

Record a source dataset using the path to its metadata document.

Parameters

paths (Path) –

See other parameters in DatasetAssembler.add_source_dataset()

cancel()[source]

Cancel the package, cleaning up temporary files.

This works like DatasetAssembler.close(), but is intentional, so no warning will be raised for forgetting to complete the package first.

close()[source]

Clean up any temporary files, even if dataset has not been written

done(validate_correctness=True, sort_measurements=True)[source]

Write the dataset and move it into place.

It will be validated, metadata will be written, and if all is correct, it will be moved to the output location.

The final move is done atomically, so the dataset will only exist in the output location if it is complete.

Parameters
  • validate_correctness (bool) – Run the eo3-validator on the resulting metadata.

  • sort_measurements (bool) – Order measurements alphabetically. (instead of insert-order)

Raises

IncompleteDatasetError If any critical metadata is incomplete.

Return type

Tuple[UUID, Path]

Returns

The id and final path to the dataset metadata file.

extend_user_metadata(section_name, doc)[source]

Record extra metadata from the processing of the dataset.

It can be any document suitable for yaml/json serialisation, and will be written into the sidecar “proc-info” metadata.

This is typically used for recording processing parameters or environment information.

Parameters
  • section_name (str) – Should be unique to your product, and identify the kind of document, eg ‘brdf_ancillary’

  • doc (Dict[str, Any]) – Document

iter_measurement_paths()[source]

not recommended - will likely change soon.

Iterate through the list of measurement names that have been written, and their current (temporary) paths.

TODO: Perhaps we want to return a real measurement structure here as it’s not very extensible.

Return type

Generator[Tuple[GridSpec, str, Path], None, None]

property label

An optional displayable string to identify this dataset.

These are often used when when presenting a list of datasets, such as in search results or a filesystem folder. They are unstructured, but should be more humane than showing a list of UUIDs.

By convention they have no spaces, due to their usage in filenames.

Eg. ga_ls5t_ard_3-0-0_092084_2009-12-17_final or USGS’s LT05_L1TP_092084_20091217_20161017_01_T1

A label will be auto-generated using the naming-conventions, but you can manually override it by setting this property.

Return type

Optional[str]

note_measurement(name, path, expand_valid_data=True, relative_to_dataset_location=False)[source]

Reference a measurement from its existing file path.

(no data is copied, but Geo information is read from it.)

Parameters
  • name

  • path (Union[Path, str]) –

  • expand_valid_data

  • relative_to_dataset_location

note_software_version(name, url, version)[source]

Record the version of some software used to produce the dataset.

Parameters
  • name (str) – a short human-readable name for the software. eg “datacube-core”

  • url (str) – A URL where the software is found, such as the git repository.

  • version (str) – the version string, eg. “1.0.0b1”

write_measurement(name, path, overviews=(8, 16, 32), overview_resampling=<Resampling.average: 5>, expand_valid_data=True, file_id=None)[source]

Write a measurement by copying it from a file path.

Assumes the file is gdal-readable.

Parameters
  • name (str) – Identifier for the measurement eg 'blue'.

  • path (Union[Path, str]) –

  • overviews (Iterable[int]) – Set of overview sizes to write

  • overview_resampling (Resampling) – rasterio Resampling method to use

  • expand_valid_data (bool) – Include this measurement in the valid-data geometry of the metadata.

  • file_id (Optional[str]) – Optionally, how to identify this in the filename instead of using the name. (DEA has measurements called blue, but their written filenames must be band04 by convention.)

write_measurement_numpy(name, array, grid_spec, nodata=None, overviews=(8, 16, 32), overview_resampling=<Resampling.average: 5>, expand_valid_data=True, file_id=None)[source]

Write a measurement from a numpy array and grid spec.

The most common case is to copy the grid spec from your input dataset, assuming you haven’t reprojected.

Example:

p.write_measurement_numpy(
    "blue",
    new_array,
    GridSpec.from_dataset_doc(source_dataset),
    nodata=-999,
)

See write_measurement() for other parameters.

Parameters
  • array (ndarray) –

  • grid_spec (GridSpec) –

  • nodata

write_measurement_rio(name, ds, overviews=(8, 16, 32), overview_resampling=<Resampling.average: 5>, expand_valid_data=True, file_id=None)[source]

Write a measurement by reading it an open rasterio dataset

Parameters

ds (DatasetReader) – An open rasterio dataset

See write_measurement() for other parameters.

write_measurements_odc_xarray(dataset, nodata, overviews=(8, 16, 32), overview_resampling=<Resampling.average: 5>, expand_valid_data=True, file_id=None)[source]

Write measurements from an ODC xarray.Dataset

The main requirement is that the Dataset contains a CRS attribute and X/Y or lat/long dimensions and coordinates. These are used to create an ODC GeoBox.

Parameters

dataset (Dataset) – an xarray dataset (as returned by dc.load() and other methods)

See write_measurement() for other parameters.

write_thumbnail(red, green, blue, resampling=<Resampling.average: 5>, static_stretch=None, percentile_stretch=(2, 98), scale_factor=10, kind=None)[source]

Write a thumbnail for the dataset using the given measurements (specified by name) as r/g/b.

(the measurements must already have been written.)

A linear stretch is performed on the colour. By default this is a dynamic 2% stretch (the 2% and 98% percentile values of the input). The static_stretch parameter will override this with a static range of values.

Parameters
  • red (str) – Name of measurement to put in red band

  • green (str) – Name of measurement to put in green band

  • blue (str) – Name of measurement to put in blue band

  • kind (Optional[str]) – If you have multiple thumbnails, you can specify the ‘kind’ name to distinguish them (it will be put in the filename). Eg. GA’s ARD has two thumbnails, one of kind nbar and one of nbart.

  • scale_factor (int) – How many multiples smaller to make the thumbnail.

  • percentile_stretch (Tuple[int, int]) – Upper/lower percentiles to stretch by

  • resampling (Resampling) – rasterio rasterio.enums.Resampling method to use.

  • static_stretch (Optional[Tuple[int, int]]) – Use a static upper/lower value to stretch by instead of dynamic stretch.