Using and adding IAMC data checks

Purpose and Functionality

The iamc R package is a collection of R tools provided by the Integrated Assessment Modeling Consortium (IAMC) for data consistency checks. It can be used to make sure that a data set is in line with rules defined by a given project. These rules can be for instance a naming convention or unit convention for variables, but also qualitative measures such as lower or upper bounds for certain variables. Besides that the data can be validated against reference data.

Components

Below follows a brief overview about the essential components of this package.

Input data

Input data that should be checked can be provided as a file path to a reporting file,a quitte object or an object which can be converted to quitte using as.quitte.

Configuration

The above-mentioned set of rules are defined in the configuration file. It describes the structure and values allowed for the input data (e.g. allowed variable names, units, min, max etc.). The configuration is given as .csv file and should be stored in iamc/inst/extdata. The default configuration is that of the CDLINKS convention. As will be shown below any other configuration can be specified.

Pre-checks

To allow for checking the basic structure of the data the user can define pre-checks. These checks are performed on the raw data and filter it according to the defined pre-checks, meaning that data that does not pass the pre-checks will be removed before continuing with the main checks. This ensures that the filtered data can actually be processed by the main checks without breaking the system. To define your own pre-check please add it as a R-script to the folder iamc/R/ using a name that starts with “precheck” (e.g. “preCheckMyNewCheck.R”“). Any R script starting with”precheck” will automatically be detected and executed.

Checks

The main checks investigate the values of the uploaded data, e.g. if values lie within a given range etc … Same as for pre-checks: to define your own check please add a R-script to the folder iamc/R/ using a name that starts with “check” (e.g. “CheckMyNewCheck.R”“). Any R script starting with”check” will automatically be detected and executed.

Validation (in PDF file only)

This package offers the option to validate your data against historical and other reference data. The results of the validation can (only!) be viewed in a pdf file. This file summarizes the results of the pre-checks, checks, and the validation. If the user does not activate the generation of this pdf file the validation will not be performed but only the results of the pre-checks and checks will be printed on the screen. A collection of global historical and other reference data is provided by the package and used as default reference data for the validation. Call iamReferenceData() to view it.

A simple example

The package comes with a small example data set example_REMIND supposed to be checked against a configuration ‘example_CFG’. This is a strongly reduced example for demonstration purposes. If no configuration is explicitly assigned the default CDLINKS convention is taken. This file can be found in the iamc/inst/extdata folder of the package. To show the functionality of the package some mistakes have been introduced into the example data set. The following command detects these mistakes by triggering all pre-checks and checks that are stored in “iamc/R/”:

library(iamc)
out <- iamCheck(example_REMIND, cfg="example_CFG")
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> ℹ Please use `tibble::as_tibble()` instead.
#> ℹ The deprecated feature was likely used in the iamc package.
#>   Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 1 non-standard variables, which will be deleted.
#>    Emi|CO2
#> preCheckMissingVariables: your data is missing 1 standard variables.
#>    PE
#> checkMax: 0 values lie above allowed maximum.
#> checkMin: 1 values lie below allowed minimum.
#>    REMIND | r7323c_SSP2-26-rem-5 | AFR | GDP|PPP | 2005
#> checkUnits: 1 variables are reported in the wrong unit.
#>    GDP|PPP | unit: billion US$2005/yr

This returns information about the checks performed and the problems found in the data. In this example the pre-checks are reporting information about variables: e.g. in “preCheckIllegalVariables” we specified a test, that checks if the input data in “example_REMIND” has variables not listed in the configuration file “example_CFG”, thus are illegal. The test “preCheckMissingVariables” indentifies all variables listed in the configuration-file but missing in the input data. The checks “checkMin” and “checkMax” test if the values of the input data correspond to the minimal and maximal values specified in the configuration files.

Information about the tests is written to standard output, but also invisibly returned by the function. That means, if you want to further process the products of iamCheck you can store its return value in a variable and further work with that.

Customizing the project settings

Explicitly setting cfg="example_CFG" in the example above makes sure that the data is analyzed based on a set of rules specified in “example_CFG.csv” and not based on the default “CDLINKS.csv”. Thus, to use specific project settings you have to explicitly assign a configuration to “cfg”. The file must be given in “csv” format and stored in “iamc/inst/extdata”. Another way is to load existing cfg project settings and customize them. In the following example we take the existing example_CFG config and introduce the new variable Population_OtherName that is not in the data by renaming the existing variable Population. The second example defines a new maximum value of 200 for the variable ‘FE’:

# first example
# load cfg
cfg <- iamProjectConfig("example_CFG")
# modify cfg
cfg$variable[cfg$variable=="Population"] <- "Population_OtherName"
# run check with new cfg
iamCheck(example_REMIND, cfg=cfg)
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 2 non-standard variables, which will be deleted.
#>    Population
#>    Emi|CO2
#> preCheckMissingVariables: your data is missing 2 standard variables.
#>    Population_OtherName
#>    PE
#> checkMax: 0 values lie above allowed maximum.
#> checkMin: 1 values lie below allowed minimum.
#>    REMIND | r7323c_SSP2-26-rem-5 | AFR | GDP|PPP | 2005
#> checkUnits: 1 variables are reported in the wrong unit.
#>    GDP|PPP | unit: billion US$2005/yr

# second example
# load cfg
cfg <- iamProjectConfig("example_CFG")
# modify cfg
cfg$max[cfg$variable=="FE"] <- 200
# run check with new cfg
iamCheck(example_REMIND, cfg=cfg)
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 1 non-standard variables, which will be deleted.
#>    Emi|CO2
#> preCheckMissingVariables: your data is missing 1 standard variables.
#>    PE
#> checkMax: 19 values lie above allowed maximum.
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2005
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2010
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2015
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2020
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2025
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2030
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2035
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2040
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2045
#>    REMIND | r7323c_SSP2-26-rem-5 | GLO | FE | 2050
#>    ... 9 more rows (set verbose=TRUE to get a full list)
#> checkMin: 1 values lie below allowed minimum.
#>    REMIND | r7323c_SSP2-26-rem-5 | AFR | GDP|PPP | 2005
#> checkUnits: 1 variables are reported in the wrong unit.
#>    GDP|PPP | unit: billion US$2005/yr

Validate data against historical and other reference data

As described above you can validate your data against historical and other reference data. A collection of global historical and other reference data is provided by the package and used as default for the validation (refData=“IAMC”). These reference data can be inspected by calling iamReferenceData(). Please note: The validation is only activated, if a name for the resulting output pdf is specified by the user, e.g. “pdf=summary.pdf”. Besides the results of the pre-cheks and checks the pdf shows detailed results of the validation: it graphically compares the data and the reference data by showing time series and traffic light symbols that indicate how well your data matches the refernce data in terms of absolute values, the gradient and the trend.

# provide a name for the pdf so that the validation is run and written to the pdf
iamCheck(example_REMIND, pdf="summary.pdf", cfg="example_CFG")

Adding own checks

It is possible and encouraged to add own checks. All check functions must have a name starting with “check” or “precheck” to be automatically detected and executed. The checks may only use parameters available in the parameter list of iamCheck, which are currently

x (the provided input data to be tested as quitte object)
cfg (the project configuration) and
refData (the reference data, by default “IAMC”).

Your check function needs to return a list of two objects:

a character “message” which is the message showing up at the end of the test with the place holder “%#” for the number of elements for which the test failed, and
a character vector “failed” identifying the objects (e.g. variable names as character) for which the test failed.

By default iamCheck will only look for checks which are part of the package, but with the argument globalenv = TRUE it will also search the global environment for functions.

To add for instance a 42 check which makes sure that values lie between 42 and 2*42, you can write a function check42 following the rules mentioned above and run iamCheck with globalenv = TRUE:

library(iamc)

check42 <- function(x) {
  # perform the check
  f <- subset(x,value>42.0 & value <2*42)
  # generate character vector of objects that fail the check
  failed <- unique(paste(f$model, f$scenario, f$region, f$variable, f$period, sep=" | "))
  # return the message and the character vector
  return(list(message="%# variables are between 42 and 2*42",
              failed=failed))
} 

iamCheck(example_REMIND, cfg="example_CFG", globalenv=TRUE)
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 1 non-standard variables, which will be deleted.
#>    Emi|CO2
#> preCheckMissingVariables: your data is missing 1 standard variables.
#>    PE
#> checkMax: 0 values lie above allowed maximum.
#> checkMin: 1 values lie below allowed minimum.
#>    REMIND | r7323c_SSP2-26-rem-5 | AFR | GDP|PPP | 2005
#> checkUnits: 1 variables are reported in the wrong unit.
#>    GDP|PPP | unit: billion US$2005/yr
#> check42: 88 variables are between 42 and 2*42
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2090
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2100
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2110
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2130
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2150
#>    REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2005
#>    REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2005
#>    REMIND | r7323c_SSP2-26-rem-5 | USA | FE | 2005
#>    REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2010
#>    REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2010
#>    ... 78 more rows (set verbose=TRUE to get a full list)

# run again with modified cfg to trigger unit warning
example2 <-example_REMIND
levels(example2$unit)[2] <- "weird_unit"

iamCheck(example2, cfg="example_CFG", globalenv=TRUE)
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 1 non-standard variables, which will be deleted.
#>    Emi|CO2
#> preCheckMissingVariables: your data is missing 1 standard variables.
#>    PE
#> checkMax: 0 values lie above allowed maximum.
#> checkMin: 1 values lie below allowed minimum.
#>    REMIND | r7323c_SSP2-26-rem-5 | AFR | GDP|PPP | 2005
#> checkUnits: 2 variables are reported in the wrong unit.
#>    GDP|PPP | unit: billion US$2005/yr
#>    Population | unit: weird_unit
#> check42: 88 variables are between 42 and 2*42
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2090
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2100
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2110
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2130
#>    REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2150
#>    REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2005
#>    REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2005
#>    REMIND | r7323c_SSP2-26-rem-5 | USA | FE | 2005
#>    REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2010
#>    REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2010
#>    ... 78 more rows (set verbose=TRUE to get a full list)

As soon as the new check is working it would be nice if you could add it to the package so that others can use it as well.