NOTE: this package is identical to
the https://github.com/IAMconsortium/iamc R package, and
serves as temporary vessel for checks developed at PIK’s RD3. The
functionality is exactly the same, thus below iamc
means
also piamModelTests
.
The iamc R package is a collection of R tools provided by the Integrated Assessment Modeling Consortium (IAMC) for data consistency checks. It can be used to make sure that a data set is in line with rules defined by a given project. These rules can be a naming convention or unit convention for variables, but also qualitative measures such as lower or upper bounds for certain variables. Besides that the data can be validated against reference data.
Below follows a brief overview about the essential components of this package.
Input data that should be checked can be provided as a file path to a
reporting file, a quitte object, or an
object which can be converted to quitte using
as.quitte
.
The above-mentioned set of rules are defined in the configuration file. It describes the structure and values allowed for the input data (e.g. allowed variable names, units, min, max etc.). The configuration is given as .csv file and should be stored in iamc/inst/extdata. The default configuration is that of the CDLINKS convention. As will be shown below any other configuration can be specified.
To allow for checking the basic structure of the data the user can define pre-checks. These checks are performed on the raw data and filter it according to the defined pre-checks, meaning that data that does not pass the pre-checks will be removed before continuing with the main checks. This ensures that the filtered data can actually be processed by the main checks without breaking the system. To define your own pre-check please add it as a R-script to the folder iamc/R/ using a name that starts with “precheck” (e.g. “preCheckMyNewCheck.R”). Any R script whose name starts with “precheck” will automatically be detected and executed.
The main checks investigate the values of the uploaded data, e.g. if values lie within a given range etc … Same as for pre-checks: to define your own check please add a R-script to the folder iamc/R/ using a name that starts with “check” (e.g. “CheckMyNewCheck.R”). Any R script whose name starts with “check” will automatically be detected and executed.
This package offers the option to validate your data against historical and other reference data. The results of the validation can (only!) be viewed in a pdf file. This file summarizes the results of the pre-checks, checks, and the validation. If the user does not activate the generation of this pdf file the validation will not be performed but only the results of the pre-checks and checks will be printed on the screen. A collection of global historical and other reference data is provided by the package and used as default reference data for the validation. Call iamReferenceData() to view it.
The package comes with a small example data set
example_REMIND
supposed to be checked against a
configuration ‘example_CFG’. This is a strongly reduced example for
demonstration purposes. If no configuration is explicitly assigned the
default CDLINKS convention is taken. This file can be found in the
iamc/inst/extdata folder of the package. To show the functionality of
the package some mistakes have been introduced into the example data
set. The following command detects these mistakes by triggering all
pre-checks and checks that are stored in “iamc/R/”:
library(piamModelTests)
out <- iamCheck(example_REMIND, cfg="example_CFG")
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 1 non-standard variables, which will be deleted.
#> Emi|CO2
#> preCheckMissingVariables: your data is missing 1 standard variables.
#> PE
#> checkMax: 0 values lie above allowed maximum.
#> checkMin: 216 values lie below allowed minimum.
#> REMIND | r7323c_SSP2-26-rem-5 | AFR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | IND | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | LAM | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | MEA | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | OAS | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | ROW | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | RUS | FE | 2005
#> ... 206 more rows (set verbose=TRUE to get a full list)
#> checkSums: 0 sums do not add up or are missing in the data or the cfg (shown here are the group totals)
#> checkUnits: 1 variables are reported in the wrong unit.
#> GDP|PPP | unit: billion US$2005/yr
This returns information about the checks performed and the problems found in the data. In this example the pre-checks are reporting information about variables: e.g. in “preCheckIllegalVariables” we specified a test, that checks if the input data in “example_REMIND” has variables not listed in the configuration file “example_CFG”, thus are illegal. The test “preCheckMissingVariables” indentifies all variables listed in the configuration-file but missing in the input data. The checks “checkMin” and “checkMax” test if the values of the input data correspond to the minimal and maximal values specified in the configuration files.
Information about the tests is written to standard output, but also
invisibly returned by the function. That means, if you want to further
process the products of iamCheck
you can store its return
value in a variable and further work with that.
Explicitly setting cfg="example_CFG"
in the example
above makes sure that the data is analyzed based on a set of rules
specified in “example_CFG.csv” and not based on the default
“CDLINKS.csv”. Thus, to use specific project settings you have to
explicitly assign a configuration to “cfg”. The file must be given in
“csv” format and stored in “iamc/inst/extdata”. Another way is to load
existing cfg project settings and customize them. In the following
example we take the existing example_CFG config and introduce the new
variable Population_OtherName
that is not in the data by
renaming the existing variable Population
. The second
example defines a new maximum value of 200 for the variable ‘FE’:
# first example
# load cfg
cfg <- iamProjectConfig("example_CFG")
# modify cfg
cfg$variable[cfg$variable=="Population"] <- "Population_OtherName"
# run check with new cfg
iamCheck(example_REMIND, cfg=cfg)
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 2 non-standard variables, which will be deleted.
#> Population
#> Emi|CO2
#> preCheckMissingVariables: your data is missing 2 standard variables.
#> Population_OtherName
#> PE
#> checkMax: 0 values lie above allowed maximum.
#> checkMin: 0 values lie below allowed minimum.
#> checkSums: 0 sums do not add up or are missing in the data or the cfg (shown here are the group totals)
#> checkUnits: 1 variables are reported in the wrong unit.
#> GDP|PPP | unit: billion US$2005/yr
# second example
# load cfg
cfg <- iamProjectConfig("example_CFG")
# modify cfg
cfg$max[cfg$variable=="FE"] <- 200
# run check with new cfg
iamCheck(example_REMIND, cfg=cfg)
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 1 non-standard variables, which will be deleted.
#> Emi|CO2
#> preCheckMissingVariables: your data is missing 1 standard variables.
#> PE
#> checkMax: 228 values lie above allowed maximum.
#> REMIND | r7323c_SSP2-26-rem-5 | AFR | GDP|PPP | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | CHN | GDP|PPP | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | EUR | GDP|PPP | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | IND | GDP|PPP | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | GDP|PPP | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | LAM | GDP|PPP | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | MEA | GDP|PPP | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | OAS | GDP|PPP | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | ROW | GDP|PPP | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | RUS | GDP|PPP | 2005
#> ... 218 more rows (set verbose=TRUE to get a full list)
#> checkMin: 216 values lie below allowed minimum.
#> REMIND | r7323c_SSP2-26-rem-5 | AFR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | IND | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | LAM | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | MEA | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | OAS | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | ROW | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | RUS | FE | 2005
#> ... 206 more rows (set verbose=TRUE to get a full list)
#> checkSums: 0 sums do not add up or are missing in the data or the cfg (shown here are the group totals)
#> checkUnits: 1 variables are reported in the wrong unit.
#> GDP|PPP | unit: billion US$2005/yr
As described above you can validate your data against historical and other reference data. A collection of global historical and other reference data is provided by the package and used as default for the validation (refData=“IAMC”). These reference data can be inspected by calling iamReferenceData(). Please note: The validation is only activated, if a name for the resulting output pdf is specified by the user, e.g. “pdf=summary.pdf”. Besides the results of the pre-cheks and checks the pdf shows detailed results of the validation: it graphically compares the data and the reference data by showing time series and traffic light symbols that indicate how well your data matches the refernce data in terms of absolute values, the gradient and the trend.
It is possible and encouraged to add own checks. All check functions must have a name starting with “check” or “precheck” to be automatically detected and executed. The checks may only use parameters available in the parameter list of iamCheck, which are currently
Your check function needs to return a list of two objects:
By default iamCheck will only look for checks which are part of the
package, but with the argument globalenv = TRUE
it will
also search the global environment for functions.
To add for instance a 42 check which makes sure that values lie
between 42 and 2*42, you can write a function check42
following the rules mentioned above and run iamCheck
with
globalenv = TRUE
:
library(piamModelTests)
check42 <- function(x) {
# perform the check
f <- subset(x,value>42.0 & value <2*42)
# generate character vector of objects that fail the check
failed <- unique(paste(f$model, f$scenario, f$region, f$variable, f$period, sep=" | "))
# return the message and the character vector
return(list(message="%# variables are between 42 and 2*42",
failed=failed))
}
iamCheck(example_REMIND, cfg="example_CFG", globalenv=TRUE)
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 1 non-standard variables, which will be deleted.
#> Emi|CO2
#> preCheckMissingVariables: your data is missing 1 standard variables.
#> PE
#> checkMax: 0 values lie above allowed maximum.
#> checkMin: 216 values lie below allowed minimum.
#> REMIND | r7323c_SSP2-26-rem-5 | AFR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | IND | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | LAM | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | MEA | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | OAS | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | ROW | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | RUS | FE | 2005
#> ... 206 more rows (set verbose=TRUE to get a full list)
#> checkSums: 0 sums do not add up or are missing in the data or the cfg (shown here are the group totals)
#> checkUnits: 1 variables are reported in the wrong unit.
#> GDP|PPP | unit: billion US$2005/yr
#> check42: 88 variables are between 42 and 2*42
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2090
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2100
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2110
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2130
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2150
#> REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | USA | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2010
#> REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2010
#> ... 78 more rows (set verbose=TRUE to get a full list)
# run again with modified cfg to trigger unit warning
example2 <-example_REMIND
levels(example2$unit)[2] <- "weird_unit"
iamCheck(example2, cfg="example_CFG", globalenv=TRUE)
#> preCheckColumns: your data contains 0 non-standard columns, which will be ignored.
#> preCheckDuplicates: data contains 0 duplicate entries. Skip recurring entries.
#> preCheckIllegalVariables: your data contains 1 non-standard variables, which will be deleted.
#> Emi|CO2
#> preCheckMissingVariables: your data is missing 1 standard variables.
#> PE
#> checkMax: 0 values lie above allowed maximum.
#> checkMin: 216 values lie below allowed minimum.
#> REMIND | r7323c_SSP2-26-rem-5 | AFR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | IND | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | LAM | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | MEA | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | OAS | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | ROW | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | RUS | FE | 2005
#> ... 206 more rows (set verbose=TRUE to get a full list)
#> checkSums: 0 sums do not add up or are missing in the data or the cfg (shown here are the group totals)
#> checkUnits: 2 variables are reported in the wrong unit.
#> GDP|PPP | unit: billion US$2005/yr
#> Population | unit: weird_unit
#> check42: 88 variables are between 42 and 2*42
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2090
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2100
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2110
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2130
#> REMIND | r7323c_SSP2-26-rem-5 | JPN | Population | 2150
#> REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | USA | FE | 2005
#> REMIND | r7323c_SSP2-26-rem-5 | CHN | FE | 2010
#> REMIND | r7323c_SSP2-26-rem-5 | EUR | FE | 2010
#> ... 78 more rows (set verbose=TRUE to get a full list)
As soon as the new check is working it would be nice if you could add it to the package so that others can use it as well.