---
title: "LPJmL Data"
author: "Jannes Breier"
date: "`r Sys.Date()`"
output:
  html_document: default
  # pdf_document: default
  md_document:
    variant: gfm
knit: (function(inputFile, encoding){
  rmarkdown::render(inputFile, encoding = encoding,
  output_format = "all") })
vignette: >
  %\VignetteIndexEntry{LPJmL Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

**LPJmL Data** `r if(!knitr::is_latex_output()){"&#128190;"}` is an lpjmlkit
module that groups around the data class `LPJmLData` and aims to facilitate 
reading and processing of LPJmL inputs and outputs by combining the raw data
with available meta data (meta files, header files or manually) to avoid a large
overhead. It further enables easy subsetting, transformations and basic
statistics of the data and allows export to common data formats.  
Example data files can be downloaded from https://doi.org/10.5281/zenodo.12915168

&nbsp;

## Overview

LPJmL Data first requires reading LPJmL input or output data into R by applying
the read_io function (1). The returned object is of class LPJmLData (2), for
which basic statistics can be calculated (3), the inner data can be modified (4),
or exported (5) to common data formats.

#### **1. `r if (!knitr::is_latex_output()) {"&#128214;"}` Read function read_io**\
  \
  read_io is a generic function to read LPJmL input and output files. It
  currently supports three different file formats, "meta", "clm" and "raw":

* **"meta"** - *Easy to use and strongly recommended*.  
  Set `"output_metafile" : true` in your LPJmL run configuration to
  generate output files in "meta" format. LPJmL input files can also be created
  in "meta" format.
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
  # Read monthly runoff with meta file.
  runoff <- read_io("./output/runoff.bin.json")
  ```

* **"clm"** - *Use if "meta" is not available or in combination*.  
  Most LPJmL input files use "clm" format. To write output files in "clm" format
  set `"fmt" : "clm"` in your LPJmL run configuration. Some optional meta
  data (e.g. `band_names`) need to be specified manually while the basic
  information about file structure is derived automatically from the file header.
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
  # Read monthly runoff data with header.
  runoff <- read_io("./output/runoff.clm",
                    # If the clm version is lower than 4 set nbands and nstep
                    # manually so that month dimension is recognized correctly.
                    nbands = 1,
                    nstep = 12,
                    # Useful additional information that is not needed to read the
                    # Data.
                    variable = "runoff",
                    descr = "monthly runoff",
                    unit = "mm/month")
  ```

* **"raw"** - *Not recommended for use (with lpjmlkit)*.  
  By default, LPJmL output files are written as "raw" files
  (`"fmt" : "raw"` in your LPJmL run configuration). These files include no meta
  data about their structure or contents and should therefore be combined with
  the `"output_metafile" : true` setting to generate a corresponding "meta" file.
  Otherwise, all meta data need to be specified by the user. Historically, some
  LPJmL input files use "raw" format.
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
  # Read monthly runoff raw data.
  runoff <- read_io("./output/runoff.bin",
                    # Specify all meta data if they differ from the function
                    # default values.
                    ...)
  ```

&nbsp;

#### **2. `r if (!knitr::is_latex_output()) {"&#128193;"}` Data class LPJmLData**\
  \
  read_io returns an object of an R6 class `LPJmLData` with two main attributes,
  `$data` and `$meta`:

* **\$data** A `class base::array` - returns the data array with default
  dimensions "cell", "time" and "band"
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
  runoff$data
  #     , , band = 1
  #
  #          time
  # cell       1901-01-31    1901-02-28    1901-03-31    1901-04-30
  #   0      2.427786e+02  1.265680e+02  2.279087e+02  2.027685e+02
  #   1      4.189225e-14  1.032507e-16  0.000000e+00  0.000000e+00
  #   2      3.860512e-14  0.000000e+00  0.000000e+00  0.000000e+00
  #   3      0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
  #   4      0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
  #   5      0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
  ```
* **\$meta** Meta data of class `LPJmLMetaData` - returns the corresponding meta
  data (e.g. `runoff$meta$unit`)
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
  runoff$meta
  # $sim_name "lu_cf"
  # $source "LPJmL C Version 5.3.001"
  # $history "./LPJmL_internal/bin/lpjml ./configurations/config_lu_cf.json"
  # $variable "runoff"
  # $descr "monthly runoff"
  # $unit "mm/month"
  # $nbands 1
  # $nyear 111
  # $firstyear 1901
  # $lastyear 2011
  # $nstep 12
  # $timestep 1
  # $ncell 67420
  # $firstcell 0
  # $cellsize_lon 0.5
  # $cellsize_lat 0.5
  # $datatype "float"
  # $scalar 1
  # $order "cellseq"
  # $bigendian FALSE
  # $format "raw"
  # $filename "runoff.bin"
  ```
&nbsp;

#### **3. `r if (!knitr::is_latex_output()) {"&#128200;"}` Basic statistics of LPJmLData objects**\
  \
  To get an overview of the data, `LPJmLData` supports the usage of the base
  functions: `length()`, `dim()`, `dimension()`, `summary()` and `plot()`.
  *More methods can be added in the future.*

```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
# Self print; also via print(runoff).
runoff
# $meta |>
#   .$sim_name "lu_cf"
#   .$variable "runoff"
#   .$descr "monthly runoff"
#   .$unit "mm/month"
#   .$nbands 1
#   .$nyear 111
#   .$nstep 12
#   .$timestep 1
#   .$ncell 67420
#   .$cellsize_lon 0.5
#   .$cellsize_lat 0.5
# Note: not printing all meta data, use $meta to get all.
# $data |>
#   dimnames() |>
#     .$cell  "0" "1" "2" "3" ... "67419"
#     .$time  ""1901-01-31" "1901-02-28" "1901-03-31" "1901-04-30" ... "2011-12-31"
#     .$band  "1"
# $summary()
#        1
#  Min.   :   0.0000
#  1st Qu.:   0.0619
#  Median :   4.4320
#  Mean   :  28.7658
#  3rd Qu.:  27.5627
#  Max.   :2840.9602
# Note: summary is not weighted by grid area.

# Return the dimension length of $data array; dimnames function is also available.
dim(runoff)
#  cell  time  band
# 67420  1332     1

# Plot as maps or time series, depending on the dimensions.
plot(runoff)
```

&nbsp;

#### **4. `r if (!knitr::is_latex_output()) {"&#9999;"}` Modify LPJmLData objects**\
  \
  Each LPJmLData object comes with a bundle of methods to modify its state: 
  `add_grid()`, `transform()` and `subset()`.

* **`r if (!knitr::is_latex_output()) {"&#128205;"}` `add_grid()`** Adds a
  **$grid** attribute (as an LPJmLData object) to the object, providing the spatial
  reference (longitude and latitude) for each cell.
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
# Object- oriented (R6 class) notation (assigning grid directly to runoff)
runoff$add_grid()

# Common R notation (overwriting the original object)
runoff <- add_grid(runoff)

# Use the read_io arguments if a grid file cannot be detected automatically.
runoff <- add_grid(runoff, "./output/grid.clm")
  ```

* **`r if (!knitr::is_latex_output()) {"&#128257;"}` `transform()`** the `$data` dimensions.  
  Transforms the spatial dimension from "cell" to "lon" (longitude) and "lat"
  (latitude) or the temporal dimension "time" into separate "year", "month", and
  "day" dimensions. Combinations and back transformations are also possible.
  Transformation into the format "lon_lat" requires a `$grid` attribute (see `add_grid` above).
  Any transformation does not change the contents of the data, only the structure.
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
# Transform into lon and lat dimensions. If add_grid has not been executed
#   before it is called implicitly.
runoff <- transform(runoff, to = "lon_lat")
runoff
# [...]
# $data |>
#   dimnames() |>
#     .$lat  "-55.75" "-55.25" "-54.75" "-54.25" ... "83.75"
#     .$lon  "-179.75" "-179.25" "-178.75" "-178.25" ... "179.75"
#     .$time  "1901-01-31" "1901-02-28" "1901-03-31" "1901-04-30" ... "2011-12-31"
#     .$band  "1"
# [...]

# Transform into year and month dimensions (day not available for monthly
#   runoff)
runoff <- transform(runoff, to = "year_month_day")
runoff
# [...]
# $data |>
#   dimnames() |>
#     .$lat  "-55.75" "-55.25" "-54.75" "-54.25" ... "83.75"
#     .$lon  "-179.75" "-179.25" "-178.75" "-178.25" ... "179.75"
#     .$month  "1" "2" "3" "4" ... "12"
#     .$year  "1901" "1902" "1903" "1904" ... "2011"
#     .$band  "1"
# [...]

# Transform back to original dimensions.
runoff <- transform(runoff, to = c("cell", "time"))
runoff
# [...]
# $data |>
#   dimnames() |>
#     .$cell  "0" "1" "2" "3" ... "67419"
#     .$time  "1901-01-31" "1901-02-28" "1901-03-31" "1901-04-30" ... "2100-12-31"
#     .$band  "1"
# [...]
  ```

* **`r if (!knitr::is_latex_output()) {"&#9986;"}` `subset()`** the `$data`.  
  Use `$data` dimensions as keys and names or indices as values to subset `$data`.
  `$meta` data are adjusted according to the subset. Applying a subset changes
  the contents of the data and cannot be reversed.
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
# Subset by dimnames (character string).
runoff <- subset(runoff, time = "1991-05-31")
runoff
# $meta |>
#   .$nyear 1
#   .$ncell 67420
#   .$subset TRUE
# [...]
# Note: not printing all meta data, use $meta to get all.
# $data |>
#   dimnames() |>
#     .$cell  "0" "1" "2" "3" ... "67419"
#     .$time  "1991-05-31"
#     .$band  "1"
# [...]

# Subset by indices
runoff <- subset(runoff, cell = 28697:28700)
runoff
# $meta |>
#   .$nyear 1
#   .$ncell 4
#   .$subset TRUE
# [...]
# Note: not printing all meta data, use $meta to get all.
# $data |>
#   dimnames() |>
#     .$cell  "28696" "28697" "28698" "28699"
#     .$time  "1991-05-31"
#     .$band  "1"
# [...]
  ```
&nbsp;

#### **5. `r if (!knitr::is_latex_output()) {"&#128230;"}` Export LPJmLData objects**\
  \
  Finally, LPJmLData objects can be exported into common R data formats:
  `array`, `tibble`, `raster` and `terra`.  
  *More export methods can be added in the future.*

* **`as_array()`** Export `$data` as an array. In addition to simply returning the
  `$data` element of an `LPJmLData` object, as_array provides functionalities to subset
  and aggregate `$data`. Subsetting is conducted before aggregation.
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
  # Export as an array with subset of first 6 time steps and aggregation along
  #   the dimension cell (mean).
  as_array(runoff,
           subset = list(time = 1:6),
           aggregate = list(cell = mean))
  #             band
  # time                1
  #   1901-01-31 19.49611
  #   1901-02-28 20.28368
  #   1901-03-31 27.93595
  #   1901-04-30 36.90505
  #   1901-05-31 39.38885
  #   1901-06-30 32.80252
  ```

* **`as_tibble()`** Export `$data` as a
  [`tibble`](https://tibble.tidyverse.org/) object, providing the same
  additional subsetting and aggregation functionality as as_array.
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
  # Export as a tibble with subset of first 6 time steps
  as_tibble(runoff, subset = list(time = 1:6))
  # # A tibble: 404,520 × 4
  #    cell  time       band  value
  #    <fct> <fct>      <fct> <dbl>
  #  1 0     1901-01-31 1      184.
  #  2 1     1901-01-31 1        0
  #  3 2     1901-01-31 1        0
  #  4 3     1901-01-31 1        0
  #  5 4     1901-01-31 1        0
  #  6 5     1901-01-31 1        0
  #  7 6     1901-01-31 1        0
  #  8 7     1901-01-31 1        0
  #  9 8     1901-01-31 1        0
  # 10 9     1901-01-31 1        0
  # # … with 404,510 more rows
  ```

* **`r if (!knitr::is_latex_output()) {"&#127760;"}` `as_raster()`** / **`as_terra()`** Export `$data` as a
  [`raster`](https://rspatial.github.io/raster/reference/raster-package.html)
  or a [`terra`](https://rspatial.org/) object (successor of raster), providing
  the same additional subsetting and aggregation functionality as `as_array()`.
  as_raster() returns a RasterLayer for a single data field and a RasterBrick if
  the result contains more than one band or more than one time step.
  ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
# Export the first time step as a RasterLayer object from the raster package.
as_raster(runoff, subset = list(time = 1))
# class      : RasterLayer
# dimensions : 280, 720, 201600  (nrow, ncol, ncell)
# resolution : 0.5, 0.5  (x, y)
# extent     : -180, 180, -56, 84  (xmin, xmax, ymin, ymax)
# crs        : +proj=longlat +datum=WGS84 +no_defs # nolint:commented_code_linter.
# source     : memory
# names      : runoff
# values     : -1.682581e-13, 671.8747  (min, max)

# Export the first time step as a terra SpatRaster object.
as_terra(runoff, subset = list(time = 1))
# class       : SpatRaster
# dimensions  : 280, 720, 1  (nrow, ncol, nlyr)
# resolution  : 0.5, 0.5  (x, y)
# extent      : -180, 180, -56, 84  (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84 (EPSG:4326)
# source      : memory
# name        :        runoff
# min value   : -1.682581e-13
# max value   :  6.718747e+02
# unit        :      mm/month # nolint:commented_code_linter.

# Export the first 4 times step as a RasterBrick object.
as_raster(runoff, subset = list(time = 1:4))
# class      : RasterBrick
# dimensions : 280, 720, 201600, 4  (nrow, ncol, ncell, nlayers)
# resolution : 0.5, 0.5  (x, y)
# extent     : -180, 180, -56, 84  (xmin, xmax, ymin, ymax)
# crs        : +proj=longlat +datum=WGS84 +no_defs # nolint:commented_code_linter.
# source     : memory
# names      :   X1901.01.31,   X1901.02.28,   X1901.03.31,   X1901.04.30
# min values : -1.682581e-13, -1.750495e-13, -2.918900e-13, -1.516298e-13
# max values :      671.8747,      785.2363,      828.2853,      987.4359

# Export the first 4 time steps as a terra SpatRaster object.
as_terra(runoff, subset = list(time = 1:4))
# class       : SpatRaster
# dimensions  : 280, 720, 4  (nrow, ncol, nlyr)
# resolution  : 0.5, 0.5  (x, y)
# extent      : -180, 180, -56, 84  (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84 (EPSG:4326)
# source      : memory
# names       :    1901-01-31,    1901-02-28,    1901-03-31,    1901-04-30
# min values  : -1.682581e-13, -1.750495e-13, -2.918900e-13, -1.516298e-13
# max values  :  6.718747e+02,  7.852363e+02,  8.282853e+02,  9.874359e+02
# unit        :      mm/month,      mm/month,      mm/month,      mm/month
# time (days) : 1901-01-31 to 1901-04-30
  ```
&nbsp;

#### **Miscellaneous** \
  \
  More helpful functionality included with LPJmL Data:

  * `read_meta()` to read meta information from meta and header files as 
    `LPJmLMetaData` objects. `LPJmLMetaData` are usually attached to an
    `LPJmLData` object but can also be used to gain information about an LPJmL
    input or output file without reading the data.

  * `LPJmLMetaData` objects can be exported as `as_list` and `as_header` to create
    header objects or write header files.
  
  * `read_header()`, `write_header()`, `get_headersize()`, `get_datatype()`
    provide low-level interaction with LPJmL input and output files primarily in
    "clm" format.


&nbsp;
&nbsp;

## Usage

```{r setup, echo = TRUE, eval = FALSE, highlight = TRUE}
library(lpjmlkit)
```

#### **1. Example** *Global Trend in net primary productivity (NPP) over the years*

&nbsp;

```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
npp <- read_io(filename = "./output/npp.bin.json",
               subset = list(year = as.character(1970:2011)))

# Transform "time" into "year" and "month" dimensions.
npp$transform(to = "year_month_day")

# Plot timeseries with aggregated cell and month dimensions. Note that spatial
# aggregation across cells is not area-weighted.
plot(npp,
     aggregate = list(cell = mean, month = sum))

# Also available as data array.
global_npp_trend <- as_array(npp,
                             aggregate = list(cell = mean, month = sum))


```


#### **2. Example** *Runoff in northern hemisphere during summertime*

&nbsp;

```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
runoff <- read_io(filename = "./output/runoff.bin.json",
                  subset = list(year = as.character(2002:2011)))

# Usage of pipe operator operator |> (%>% via package magrittr R version < 4.1)
runoff |>
  # Transform the time and space dimensions ...
  transform(to = c("year_month_day", "lon_lat")) |>
  # ... to subset summer months as well as northern hemisphere (positive)
  #   latitudes.
  subset(month = 6:9,
         lat = as.character(seq(0.25, 83.75, by = 0.5))) |>
  # for plotting sum up summer month and take the average over the years
  plot(aggregate = list(year = mean, month = sum))

```


#### **3. Example** *Gross primary productivity (GPP) per latitude*

&nbsp;

```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
gpp <- read_io(filename = "./output/gpp.bin.json",
               subset = list(year = as.character(2002:2011)))

# Transform into lon_lat format.
gpp$transform(to = "lon_lat")

# Plot GPP per latitude.
plot(gpp, aggregate = list(time = mean, lon = mean))

```


#### **4. Example** *CFT fractions for area around Potsdam*

&nbsp;

```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
# Coordinates for cells around Potsdam.
coordinates <- tibble::tibble(lat = as.character(c(52.25, 52.400922, 53.25)),
                              lon = as.character(c(12.75, 13.03638, 12.75)))

# Complete pipe notation, from reading to plotting data.
read_io(
  filename = glue("./cftfrac.bin.json"),
  subset = list(year = as.character(2000:2018))
) |>
  transform(to = "lon_lat") |>
  # Special case for subsetting of lat and lon pairs
  subset(coords = coordinates) |>
  # Mean across spatial dimensions
  plot(aggregate = list(lon = mean, lat = mean))

```

&nbsp;
&nbsp;

### Notes & tips

1. `LPJmLData` and `LPJmLMetaData` objects are closed environments, each of an
  R6 class, that function as a data container.  
  Do not replicate R6 objects like  
    ```{r, echo = TRUE, eval = FALSE, highlight = TRUE}
my_copy <- lpjml_data
# Instead use:
my_copy <- lpjml_data$clone(deep = TRUE)
    ```
  Otherwise, `my_copy` and `lpjml_data` point to the same environment, and any
  subsetting or transformation methods applied to `my_copy` will also affect
  `lpjml_data`.

2. Do not try to manually overwrite either the `$data` or any `$meta` data
  attributes within `LPJmLData` objects. It is either not possible or can mess
  up the integrity of the object. Methods surrounded by double underscores
  (`$.__<method>__)` or attributes surrounded by underscores  (`$._<attribute>_`)
  are only for low-level package development and should not be used by users for
  their data handling.

3. When performance is important, choose R6 method notation
  `runoff$transform(to = "lon_lat")` over common R notation
  `transform(runoff, to = "lon_lat")`.

4. The "meta" format is only supported by recent LPJmL versions. When comparing
  older (< LPJmL version 5.3) output data with LPJmL 5.3 output data it can
  be useful to combine meta (`"output_metafile" : true`) with the header file
  format (`"fmt": "clm"`), which has been supported since LPJmL version 4, for
  simplification of process pipelines.