nats.solutions/posts/netcdf-geotransformation.md

14 KiB

layout title published_date categories tags
post.liquid Correcting an incorrectly georeferenced raster dataset 2024-06-11 12:00:00 +0000
notebook
gis

Objective

To extend the fccsmap Python package to support looking up FCCS fuelbed information within Canada.

Context

fccsmap is a Python package developed by Pacific North West Airfire used to look up FCCS fuelbed data within continental United States and Alaska. It does this by loading a georeferenced raster dataset and checking what value sits at the relevant coordinates.

Prior work includes Miles Epstein adding the dataset fccs_canada.nc to fccsmap and wiring it up for use. However, fccsmap fails to use the dataset; when it's loaded, dependencies complain that it lacks a geotransform matrix. It does indeed have a geotransform matrix, however, it doesn't appear to be recognized by any GIS software I've used. Namely, the dataset has the following property in its header:

Band1#transform={1000,0,-2341249.999999993,0,-1000,3833423.97736665}

Clearly, fccsmap fails to recognize the geotransformation of the dataset, while it does recognize it for the American datasets. Interestingly, PanopolyJ fails to recognize the geotransformations for both the American and Canadian datasets. Meanwhile, QGIS does recognize the geotransformation, however, the coordinates it reports are well off where they should be (the southern tip of Vancouver Island is listed as being almost 30 degrees east of where it's expected to be). Notably, the American FCCS datasets do not have this issue

Definitions and ideas

  • Raster data: data in "raster" form; think raster images. Data is layed out in a grid of "pixels," where each pixel contains or references some data.
  • Georeferencing and geotransformation: the process of applying a matrix transformation (the "geotransformation") to the raster grid in order to make the data correspond to geographic coordinates. Raster data is essentially a matrix, and we can apply matrix transformations to it accordingly.

Approach 1: Upgrading to a later version of GDAL

When using fccs_canada.nc, you get the following error:

>>> ca_lookup = FccsLookUp(is_canada=True)
>>> ca_lookup.filename
'/usr/local/lib/python3.8/dist-packages/fccsmap/data/fccs_canada.nc'
>>> ca_lookup.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
/usr/local/lib/python3.8/dist-packages/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/fccsmap/lookup.py", line 179, in look_up
    stats = self._look_up(new_geo_data)
  File "/usr/local/lib/python3.8/dist-packages/fccsmap/lookup.py", line 305, in _look_up
    stats = zonal_stats(s, self.gridfile_specifier,
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/main.py", line 31, in zonal_stats
    return list(gen_zonal_stats(*args, **kwargs))
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/main.py", line 156, in gen_zonal_stats
    fsrc = rast.read(bounds=geom_bounds)
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/io.py", line 289, in read
    win = bounds_window(bounds, self.affine)
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/io.py", line 150, in bounds_window
    row_start, col_start = rowcol(w, n, affine)
  File "/usr/local/lib/python3.8/dist-packages/rasterstats/io.py", line 141, in rowcol
    r = int(op((y - affine.f) / affine.e))
OverflowError: cannot convert float infinity to integer

This error is reminiscent of this issue in the rioxarray package, which the developers identified as being a bug with gdal, fixed in this pull request. It's possible updating to gdal version 3.6.1 would fix the issue. I am indeed using an outdated version of GDAL.

To test this, I decided to install the latest version of fccsmap using the latest version of GDAL in a Docker container, using the following Dockerfile:

FROM ubuntu:22.04

RUN cd /root

# installing tzdata, which is a dependency of one of the following packages,
# needs a little configuration to be run non-interactively. This manually
# sets the timezone data it expects
RUN ln -fs /usr/share/zoneinfo/America/Vancouver /etc/localtime

# Setting DEBIAN_FRONTEND here prevents undesired interactive prompts when
# installing certain packages (namely tzdata)
RUN apt-get update && DEBIAN_FRONTEND=noninteractive \
    apt-get install -y \
    python3 python3-dev python3-pip python3-venv\
    libnetcdf-dev libproj-dev libgdal-dev \
    python3-numpy python3-gdal \
    libxml2-dev libxslt1-dev

# Create and activate python virtual environment
RUN python3 -m venv /root/venv
ENV PATH="/root/venv/bin/activate:$PATH"

# Install pip dependencies
RUN pip install numpy
RUN export CPLUS_INCLUDE_PATH=/usr/include/gdal
RUN export CPLUS_INCLUDE_PATH=/usr/include/gdal
RUN pip install gdal==`gdal-config --version`

RUN pip install --no-binary gdal --extra-index https://pypi.airfire.org/simple fccsmap==4.1.2

Unfortunately, using the latest version of GDAL didn't resolve the issue, and it failed with a similar error.

Approach 2: Identifying why the geotransformation isn't recognized

Why isn't GDAL recognizing the geotransformation? I suspect it's because the netCDF driver is searching for it somewhere it's not. To find out, we can trace the callback stack to see where specifically it's looking. Recall from earlier:

>>> ca_lookup.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
/usr/local/lib/python3.8/dist-packages/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.

Following the chain of dependencies, we find this line which seems to be throwing the warning:

# rasterio_v1.3/__init__.py
dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)

The DatasetReader class can in turn be found in rasterio/io.py, which extends DatasetReaderBase in rasterio/_io.pxd. DatasetReaderBase extends DatasetBase and this is pretty much where our journey in Pythonland ends because from here on out everything's defined in the C++ GDAL library.

We can get a bit closer by searching the codebase for NotGeoreferencedWarning, which takes us to rasterio/_base.pyx and the read_transform method:

def read_transform(self):
    """Return the stored GDAL GeoTransform"""
    cdef double gt[6]

    if self._hds == NULL:
        raise ValueError("Null dataset")
    err = GDALGetGeoTransform(self._hds, gt)
    if err == GDALError.failure and not self._has_gcps_or_rpcs():
        warnings.warn(
            ("Dataset has no geotransform, gcps, or rpcs. "
            "The identity matrix will be returned."),
            NotGeoreferencedWarning)

    return [gt[i] for i in range(6)]

Some earlier investigation suggests that bypassing this issue by georeferencing the data with GCPs isn't an option, but maybe it's worth looking into RPCs if this rabbit hole doesn't take us anywhere

The key here is GDALGetGeoTransform, which is in fact a reference to a C function; one that takes a representation of a dataset and a pointer to a location in memory where the geotransform matrix will be loaded. It's defined in the gdal.h header

This is effectively a dead end for me; tracing the GDAL codebase adds a lot of complexity to this project, particularly now that I have a few other leads that might work better

Approach 3: Manually configuring the GeoTransform

Approach 2 taught me a lot about how rasterio and GDAL work under the hood. fccsmap exposes the GDAL dataset reader, and I might be able to leverage that to dynamically configure the dataset. Having done that, I might be able to export it as well.

import rioxarray
import rasterio
from fccsmap.lookup import FccsLookUp
old_ds_filename = FccsLookUp(is_canada=True)._filename
old_ds = rasterio.open(old_ds_filename)
## /usr/local/lib/python3.10/dist-packages/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
##   dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)

# One-liner:
# new_ds = rasterio.open('./modified_fccs_canada.nc', 'w', width=old_ds.width, height=old_ds.height, count=old_ds.count, nodata=0, dtype='float32', crs=old_ds.crs, transform=rasterio.transform.Affine(1000,0,-2341249.999999993,0,-1000,3833423.97736665))

new_ds = rasterio.open('./modified_fccs_canada.nc', 'w',
  driver='netCDF',
  dtype='float32',
  width=old_ds.width,
  height=old_ds.height,
  crs=old_ds.crs,
  transform=rasterio.transform.Affine(1000,0,-2341249.999999993,0,-1000,3833423.97736665),
  count=old_ds.count,
  nodata=0)
## Traceback (most recent call last):
##   File "<stdin>", line 1, in <module>
##   File "/usr/local/lib/python3.10/dist-packages/rasterio/env.py", line 451, in wrapper
##     return f(*args, **kwds)
##   File "/usr/local/lib/python3.10/dist-packages/rasterio/__init__.py", line 230, in open
##     raise RasterioIOError(
## rasterio.errors.RasterioIOError: Blacklisted: file cannot be opened by driver 'netCDF' in 'w' mode

rasterio raises such a RasterioIOError when the driver doesn't support writing as described in the Writing Datasets section of the rasterio documentation. The implications of this aren't immediately obvious; if you leave off the driver='netCDF' line, this code works fine. Presumably, in this case, it's not writing in NetCDF format. Instead, it's probably writing in GeoTIFF, which is better supported (new lead: can we use GTIFF instead of NetCDF?).

To quote the documentation:

Some other formats that are writable by GDAL can also be written by Rasterio. These use an IndirectRasterUpdater which does not create directly but uses a temporary in-memory dataset and GDALCreateCopy() to produce the final output.

The Raster drivers page in the GDAL documentation lists the NetCDF driver as one supporting "Creation," though it doesn't elaborate on what that means. The documentation for the driver itself claims the NetCDF driver supports both reading and writing.

If you run the same command, dropping the driver kwarg, you get:

new_ds = rasterio.open('./modified_fccs_canada.nc', 'w', width=old_ds.width, height=old_ds.height, crs=old_ds.crs, count=old_ds.count, nodata=0, dtype='float32', transform=rasterio.transform.Affine(1000,0,-2341249.999999993,0,-1000,3833423.97736665))
new_ds.driver
## 'netCDF'

Maybe this is a bug? We're going to run with this for now.

From here, we can get the raster data of fccs_canada.nc and write it to our new dataset:

raster = rioxarray.open_rasterio(old_ds)
new_ds.write(raster)
new_ds.close()

l = FccsLookUp(fccs_fuelload_file='./modified_fccs_canada.nc')
l.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
## Traceback (most recent call last):
##   [...]
## OverflowError: cannot convert float infinity to integer

This has brought us back to where we started. However, I lucked out and made a discovery after landing here. I remembered that this error is otherwise a result of looking up a value outside the domain of the dataset. For example, let's take the known working American continental dataset:

us = FccsLookUp()
us.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
## Traceback (most recent call last):
##   [...]
## OverflowError: cannot convert float infinity to integer

This fails because the request is being made for a point in Canada—a point in space outside continental US.

Previously, I'd noticed that when I loaded the dataset in QGIS, it seemed like the coordinates more eastern than they should have been. In light of this, I tried:

l.look_up({'type': 'MultiPoint', 'coordinates':[[51, -80]]})
{'fuelbeds': OrderedDict([('0.0', {'percent': 100.0, 'grid_cells': 480249})]), 'units': 'm^2', 'sampled_grid_cells': 480249, 'sampled_area': 240128428465.43958}

Of course, the next thing that came to mind was to check this on the existinng fccs_canada.nc dataset:

ca = FccsLookUp(is_canada=True)
ca.look_up({'type': 'MultiPoint', 'coordinates':[[51, -80]]})
{'fuelbeds': OrderedDict([('0.0', {'percent': 100.0, 'grid_cells': 480249})]), 'units': 'm^2', 'sampled_grid_cells': 480249, 'sampled_area': 240128428465.43958}
## Traceback (most recent call last):
##   [...]
## OverflowError: cannot convert float infinity to integer

In the end, finicking with the dataset was not in fact a red herring. We've marshalled the dataset into working with fccsmap. For completeness, here's the code to generate it:

import rioxarray
import rasterio
from fccsmap.lookup import FccsLookUp
old_ds_filename = FccsLookUp(is_canada=True)._filename
old_ds = rasterio.open(old_ds_filename)
new_ds = rasterio.open('./modified_fccs_canada.nc', 'w', width=old_ds.width, height=old_ds.height, crs=old_ds.crs, count=old_ds.count, nodata=0, dtype='float32', transform=rasterio.transform.Affine(1000,0,-2341249.999999993,0,-1000,3833423.97736665))
raster = rioxarray.open_rasterio(old_ds)
new_ds.write(raster)
old_ds.close()
new_ds.close()

Remaining question: what will it take to fix the offset? Resolving this and patching it into fccsmap will completely resolve the problem.