--- title: "Data caching in madrat" author: "Jan Philipp Dietrich" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data caching in madrat} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` A central feature of the madrat framework is its ability to load data from cache rather than recompute it when the calculation have run already before. Here, we explain what user should know about the caching to avoid unwanted behavior. ## Basics By default every read- or calc-function creates a cache file from its computations and stores it in the cachefolder. Where this folder is located can be checked via ```{r, echo = TRUE, eval=TRUE} library(madrat, quietly = TRUE) getConfig("cachefolder", verbose = FALSE) ``` When running data processing via `retrieveData` it currently offers two types of cache folders: `cachetype = "def"` will use a shared cachefolder in which all processes write their cache files by default. RetrieveData will check in this folder for fitting cache files and read them if available. Whether they are fitting or not will depend on their `fingerprint` which is explained further down. With `cachetype = "rev"` `retrieveData` will create a new, revision-specific cachefolder and set `setConfig(forcecache = TRUE)` (default is FALSE). Via this approach calculations will start with new cache files at all but created cache files will be read if a calculation is repeated. The forcecache option will in this case make sure that any available cache file which fits the function call is read in, independent of whether the content of the cache file might be outdated or not. ## Fingerprint In order to estimate whether a calculation should be rerun or whether the data can be read from cache madrat creates fingerprints for each function. If the fingerprint of the current function call agrees with the fingerprint of the corresponding cache file the cache is assumed up-to-date and read in. If they disagree, the cache file is assumed to be potentially outdated and ignored (except for `forcecache = TRUE` in which case it would be read in anyways). The fingerprint is created by looking at the dependency graph of a function which can be retrieved via `getDependencies`: ```{r, echo = TRUE, eval=TRUE} getDependencies("calcTauTotal", packages = "madrat") ``` The dependency graph lists all calls of `calc`, `read` and `tool` functions a function depends on (not only calls in the function itself, but also calls in the functions which have been called in order to run the function). The fingerprinting function creates hashes of all these functions and all source folders involved in this process and combines them to one single hash which is the fingerprint of that specific function: ```{r, echo = TRUE, eval=TRUE} setConfig(verbosity = 3) fp <- madrat:::fingerprint("calcTauTotal", details = TRUE, packages = "madrat") ``` As a hash has the characteristic to change when its input changes, an unchanged hash means that also the respective function or source folder did not change. Hence, an identical fingerprint means that the involved functions and source data did not change. So, if the fingerprint of the cache file agrees with the fingerprint calculated for the calculation it is quite likely that the data contained in the cache file also agrees with the output of the calculation one would run it again. The reason why it is only quite likely but not certain is that not all parts of the calculation are covered: The dependency graph only considers madrat-style functions, e.g. functions not starting with `download`, `read`, `correct`, `convert`, `calc`, or `tool` will not be considered. In many cases this should be ok, considering that external functions used in the calculation will likely keep their behavior over time, but there might be instances in which this assumption is violated (e.g. if parts of the calculation are outsourced in a function not following these conventions). On the other hand, the dependency graph might also include dependencies which only exist on the paper, as it does only scan for calls of the corresponding functions in the code, but cannot interpret which calls are actually be computed for a given calculation, e.g. there could be if clauses in a calc-function selecting different source data types. The dependency graph will show a dependency to all sources even if only one of these sources might be used at the end. ## Customize fingerprinting To make sure that the fingerprint is appropriately reflecting the current status of a calculation there are a few possibilities to steer its behavior: 1. Use madrat-style functions for all calculation that should get monitored by the fingerprinting algorithm (e.g. if part of the calculation is outsourced, call this new function `tool..` to have it monitored.) 2. Adjust the fingerprinting via control flags for all other cases. ## Control flags Control flags can be used to manually include or exclude functions in the fingerprinting. Control flags are comments in the functions which are put in quotes and start with `!#`. They can look like: ```{r, echo = TRUE, eval=FALSE} "!# @monitor madrat:::sysdata magclass:::ncells" "!# @ignore madrat:::toolAggregate" ``` Each line contains a control flag starting with the flag name (here `monitor` or `ignore`) and afterwards with the arguments of this control flag. The `monitor` flag specifies calls which should get monitored in addition to the ones anyway monitored (in the example the `sysdata` object in `madrat` and the `ncells` functions of the `magclass` package are additionally being monitored). The `ignore` flag specifies which calls should not be monitored even so `getDepenendencies` says otherwise. While the `ignore` statement has to be mentioned explicitly for each function, the `monitor` statement will be passed on automatically to all subsequent functions (e.g. if a read function has a monitor statement all calc-functions used that read function will also monitor the additional calls of that statement, but in the same example the ignore statement would only be used for the read function itself). In particular the `ignore` statement has to be handled with care as a wrong information here might lead to outdated cache files being read in. So, only use it if really necessary and if you know exactly what you are doing. ## Examples ```{r, echo = TRUE, eval=TRUE} setConfig(globalenv = TRUE) readData <- function() return(1) readData2 <- function() return(2) calcExample <- function() { a <- readSource("Data") return(a) } calcExample2 <- function() { a <- readSource("Data") if (FALSE) a <- readSource("Data2") return(a) } ``` In this example are two source data sets and two calculation functions. `calcExample` only depends on `readData` while `calcExample2` depends on both data sources. ```{r, echo = TRUE, eval=TRUE} fp <- madrat:::fingerprint("calcExample", details = TRUE, packages = "madrat") fp2 <- madrat:::fingerprint("calcExample2", details = TRUE, packages = "madrat") ``` Looking at the fingerprints this is reflected in the hash components of each fingerprint (please NOTE that the source folders are not hashed in this example as they do not exist yet. If they exist they would show up here as well as hash components). One can see, that the hash for `readData` is the same in both fingerprints but as the other hashes differ also the resulting fingerprint for both calculations is different. ```{r, echo = TRUE, eval=TRUE} readData <- function() return(99) fp <- madrat:::fingerprint("calcExample", details = TRUE, packages = "madrat") fp2 <- madrat:::fingerprint("calcExample2", details = TRUE, packages = "madrat") ``` Changing the `readData` function changes the hash of this function and thereby also the fingerprints of both calc functions even so the hash of the calc functions itself did not change. ```{r, echo = TRUE, eval=TRUE} readData2 <- function() { "!# @monitor madrat:::toolAggregate" return(99) } fp <- madrat:::fingerprint("calcExample", details = TRUE, packages = "madrat") fp2 <- madrat:::fingerprint("calcExample2", details = TRUE, packages = "madrat") ``` Adding a monitor control flag in `readData` also add this hash component to all subsequent fingerprint calculations. ```{r, echo = TRUE, eval=TRUE} calcExample2 <- function() { "!# @ignore readData2" a <- readSource("Data") if (FALSE) a <- readSource("Data2") return(a) } calcExample3 <- function() { a <- calcOutput("Example2") return(a) } fp2 <- madrat:::fingerprint("calcExample2", details = TRUE, packages = "madrat") fp3 <- madrat:::fingerprint("calcExample3", details = TRUE, packages = "madrat") ``` The `ignore` flag in `calcExample2` excludes `readData2` from the fingerprint calculation. But in contrast to the `monitor` statement this information is not forwarded to `calcExample3`. Hence, the latter does not only monitor `madrat:::toolAggregate` but also `readData2`! ## forcecache Before the introduction of fingerprinting forcing the use of cache files was the default approach. However, in the new setup the argument `forcecache = TRUE` should only be used under very specific circumstances, as it does not guarantee that the data agrees with the code of the corresponding package. In particular production runs should always use `forcecache = FALSE`. A scenario in which `forcecache = TRUE` might still make sense are development cases in which up-to-date inputs are not required for proper function development. In these cases development can be speed up by using potentially outdated cache files as a starting point to avoid lengthy calculations of parts irrelevant for the current development stage. If you are unsure what to use, always go with `forcecache = FALSE`.