[DRAFT 2023-10-01]
Overview
GDAL implements read/write memory caching for raster I/O. Caching
operates on raster blocks and offers potential for substantial
performance improvement when accessing pixel data across block
boundaries. In an analytical context where processing may be
row-oriented, this means that I/O can be efficient even when rows
intersect many tiles of a raster arranged in square blocks (as opposed
to blocks arranged as one whole row). Consideration of the caching
mechanism is helpful for scaling I/O to large datasets that need to be
processed in many chunks. This article will describe the operation of
the caching mechanism, and relative performance when accessing data by
row or by tile in relation to different raster block arrangements.
Implications for configuring cache memory size with the
GDAL_CACHEMAX
setting will be described. Focus here is on
reading pixel data, but similar concepts apply to writing as well.
Relative performance
A dataset containing 16-bit integer elevation at 30-m pixel
resolution for the conterminous US was obtained from LANDFIRE. The version is “LF 2020
[LF 2.2.0]” which is available as an 8.4 GB download. The download
includes raster overviews (as .ovr), but the elevation raster itself is
a 6.8 GB GeoTIFF file using LZW compression on 128 x 128 tiles. The
direct download link for LF 2020 elevation is:
https://landfire.gov/bulk/downloadfile.php?FNAME=US_Topo_2020-LF2020_Elev_220_CONUS.zip&TYPE=landfire
Tests were run on a laptop with Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz, 8 GB RAM and SSD, Ubuntu 22.04.2, R 4.3.0, gdalraster 1.2.1 (dev), GDAL 3.6.2.
Open the elevation dataset and get parameters:
library(gdalraster)
f = "LC20_Elev_220.tif"
ds = new(GDALRaster, f, read_only = TRUE)
ncols = ds$getRasterXSize()
nrows = ds$getRasterYSize()
print(paste("Size is", ncols, "x", nrows)) # 1.587394e+10 pixels
#> [1] "Size is 156335 x 101538"
ds$getMetadata(band=0, domain="IMAGE_STRUCTURE")
#> [1] "COMPRESSION=LZW" "INTERLEAVE=BAND"
ds$getBlockSize(band=1)
#> [1] 128 128
ds$getDataTypeName(band=1)
#> [1] "Int16"
The first test reads all pixels in the raster by row. For a tile size of 128 x 128 pixels, each row intersects 1222 raster blocks (156335 / 128 = 1221.4, the last block is incomplete). This test reflects performance implications of GDAL read-ahead caching:
process_row = function(row) {
r = ds$read(band=1, xoff=0, yoff=row,
xsize=ncols, ysize=1,
out_xsize=ncols, out_ysize=1)
## process pixels, possibly write output...
return()
}
## Test 1
## original tiled raster, reading by row (across block boundaries)
system.time( lapply(0:(nrows-1), process_row) )
#> user system elapsed
#> 228.269 11.319 242.195
ds$close()
For comparison, we will read the same data from a raster arranged
with blocks as whole rows (efficient for row-level access).
gdalraster::createCopy()
copies a raster dataset with
optional changes to the format. The extent, number of bands, data type,
projection, and geotransform are all copied from the source raster:
f2 = "LC20_Elev_220_striped.tif"
opt = c("COMPRESS=LZW", "TILED=NO", "BLOCKYSIZE=1", "BIGTIFF=YES")
gdalraster::createCopy(format="GTiff", dst_filename=f2, src_filename=f,
options=opt)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
ds2 = new(GDALRaster, f2, read_only = TRUE)
ds2$getBlockSize(band=1)
#> [1] 156335 1
ds2$close()
This creates a “striped” tif with raster blocks arranged for
row-level access (TILED=NO
is the default creation option
for the GTiff
format driver). The resulting file is larger at 10.6 GB vs. 6.8 GB,
since compression is not efficient for strips vs. tiles. Substituting
the new file (f2
) in the test above gives the following
time to read all pixels by row:
## Test 2
## striped tif, reading on block boundaries (rows)
## cache retrieval not involved
system.time( lapply(0:(nrows-1), process_row) )
#> user system elapsed
#> 230.650 5.112 236.370
A final test reads by block on the original tiled raster as distributed by LANDFIRE. To read by square blocks we need to calculate the row/column offsets and x/y sizes for each tile, including the incomplete tiles along the right and bottom edges. Package stars provides a helper function that we will use for this. stars represents raster data using R’s 1-based indexing and column-major array order (raster rows as array columns in R). Accounting for those differences from native GDAL indexing, we get a matrix of block boundaries using:
library(stars)
r = read_stars(f, proxy = TRUE)
nrows = nrow(r)
ncols = ncol(r)
blocks = st_tile(nrows, ncols, 128, 128)
blocks[, 1:2] = blocks[, 1:2] - 1L
nrow(blocks)
#> [1] 970268
In terms of expected efficiency, reading the tiled raster by block is
similar to reading the striped raster by row (reading on block
boundaries, no retrieval from cache). The difference is that the striped
tif contains fewer but larger blocks (101538 blocks, 156335 pixels per
block), while the tiled tif contains order of magnitude more blocks that
are each smaller (970268 blocks, 16384 pixels per block). This test
reads all pixels by tile from the original LANDFIRE elevation file
(f
):
## Test 3
## original tiled raster, reading on block boundaries (tiles)
## cache retrieval not involved
system.time({
for (i in seq_len(nrow(blocks))) {
ds$read(1, blocks[i, 1], blocks[i, 2], blocks[i, 3], blocks[i, 4],
blocks[i, 3], blocks[i, 4])
}
})
#> user system elapsed
#> 237.920 7.165 251.200
Description of cache operation
GDAL block caching enables reading a large tiled raster efficiently
by row (1.6e+10 total pixels in the test dataset). The default size
limit of the memory cache is 5% of usable physical RAM. Each row of the
LANDFIRE tiled raster intersects 1222 blocks of size 128 x 128. Each
intersected block is read from file, decoded from LZW compression, and
placed in cache memory. The data for each successive read()
that intersects the same block are retrieved from cache. Caching all of
the intersected blocks requires 128 x 128 x 2 bytes = 32768 bytes per
block, 32768 x 1222 = 40042496 bytes, approximately 40 MB. All of the
decoded block data for a row can be held in cache in this case, meaning
that only 1 out of every 128 row-level read()
involves
retrieval from file and decoding of compressed blocks. The other 127/128
are provided from cache. Memory is recovered when a request for a new
cache block would put cache memory use over the established limit (least
recently used blocks are flushed from cache to accommodate adding new
blocks).
The code below uses the function
gdalraster::get_cache_used()
to demonstrate the
behavior:
## run in a new R session
library(gdalraster)
f = "LC20_Elev_220.tif"
ds = new(GDALRaster, f, read_only = TRUE)
ncols = ds$getRasterXSize()
nrows = ds$getRasterYSize()
## Default cache size is approximately 400 MB in this case (5% of 8 GB RAM)
## Read enough data to reach cache max
rows_read = 0
cache_use = get_cache_used()
for (row in 0:1536) {
rowdata = ds$read(1, 0, row, ncols, 1, ncols, 1)
rows_read = c(rows_read, row+1)
cache_use = c(cache_use, get_cache_used())
}
get_cache_used()
#> [1] 401
ds$close()
get_cache_used()
#> [1] 0
plot(rows_read, cache_use, type="S",
xlab="rows read", ylab="cache in use (MB)",
col="blue", lwd=2, xaxt="n")
axis(1, at = seq(0, 1536, by = 128), las=2)
While the examples above focus on reading a tiled raster by row, similar considerations would apply to processing large row-oriented rasters in 2-D chunks.
Configuring cache size
The cache size limit can be set with the GDAL_CACHEMAX
configuration option, e.g.,
## set to a specific size in MB
gdalraster::set_config_option("GDAL_CACHEMAX", "1000")
## or percent of physical RAM
gdalraster::set_config_option("GDAL_CACHEMAX", "20%")
Note that the size limit of the block cache is set upon first use
(first I/O). Setting GDAL_CACHEMAX
after that point will
not resize the cache. It is a per-session setting. If
GDAL_CACHEMAX
has not been configured upon first use of the
cache, then the default cache size will be in effect for the current
session.
I/O that involves block caching with large datasets may require
setting GDAL_CACHEMAX
larger than the default. If the
LANDFIRE elevation raster were tiled at 256 x 256, then each block would
require 65536 x 2 = 131072 bytes for 16-bit data. The cache size needed
to hold all intersected blocks for a row would be approximately 160 MB
(and likewise, 640 MB for 512 x 512 tiles). Similarly, the cache size
could be configured for the case of multiple large rasters that need to
be read (or written) simultaneously for processing.
The cache is flushed upon dataset closing to recover memory. The
behavior described above assumes that the GDAL dataset is opened once,
and required I/O completed before closing the dataset. This would
normally be the case when using the GDAL API via gdalraster
(GDALRaster-class
encapsulates a GDALDataset
object and its associated GDALRasterBand
objects in the
underlying API).
It is also worth noting that without the block caching mechanism, it is not possible to read the tiled elevation raster by row in reasonable time. This can be checked by repeating Test 1 above with the cache disabled:
## Test 4
## original tiled raster, reading by row (across block boundaries)
## cache disabled for testing
## run in a new R session
library(gdalraster)
## for testing only
set_config_option("GDAL_CACHEMAX", "0")
f = "LC20_Elev_220.tif"
ds = new(GDALRaster, f, read_only = TRUE)
ncols = ds$getRasterXSize()
nrows = ds$getRasterYSize()
process_row = function(row) {
r = ds$read(band=1, xoff=0, yoff=row,
xsize=ncols, ysize=1,
out_xsize=ncols, out_ysize=1)
return()
}
system.time( lapply(0:(nrows-1), process_row) )
#> ^C
#> Timing stopped at: 3650 42.97 3694 # killed with ctrl-c
ds$close()