--- title: "REMIND/IAM Data Analysis Using quitte" author: "Michaja Pehl" date: "`r as.Date(file.mtime('./quitte-data-analysis.Rmd'))`" output: rmarkdown::html_vignette: toc: yes df_print: tibble vignette: > %\VignetteIndexEntry{REMIND/IAM Data Analysis Using quitte} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include = FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, collapse = TRUE, comment = "#>", fig.height = 5, fig.width = 7 ) ``` The `quitte` package is a grab-bag of utility functions to work with REMIND/IAM results (or any kind of data) in a `.mif`-like format. It builds on the capabilities of the `dplyr`, `tidyr`, and `ggplot2` packages. # Load required packages ```{r load necessary packages} library(tidyverse) library(quitte) ``` Generally, when working with quitte, you will need the `dplyr` and `tidyr` packages. `tidyverse` is a "collection" package loading these two, as well as the `ggplot2` package and a couple of others. # Data Wrangling As `quitte` builds on the capabilities and practices of the _tidy data_ packages, it is highly recommended to read [this excellent tutorial introducing `dplyr` and `tidyr` ](https://datacarpentry.org/R-ecology-lesson/03-dplyr.html). You should be comfortable with the concept of piping (`%>%`), as well as the `filter()`, `select()`, `group_by()`, `mutate()`, and `summarise()` functions. Understanding the join functions (especially `inner_join()` and `full_join()` will help immensely). # Load Data > **Note:** We first load some example data for following along with the > sections below. If anything is unclear, just keep reading, it will be > explained further down. `read.quitte()` reads `.mif`-files and converts them into five-dimensional long format data frames with columns `model`, `scenario`, `region`, `variable`/`unit`, `period`, and `value`. First, select all the `.mif` files you want to load: ```{r select data files} base.path <- 'some/directory/' data.files <- c( 'REMIND_generic_r7552c_1p5C_Def-rem-5.mif', 'REMIND_generic_r7552c_1p5C_UBA_Sust-rem-5.mif', 'REMIND_generic_r7552c_2C_Def-rem-5.mif', 'REMIND_generic_r7552c_2C_UBA_Sustlife-rem-5.mif', 'REMIND_generic_r7552c_REF_Def05-rem-5.mif', # 'REMIND_generic_r7552c_REF_Def05-rem-5_tab.mif', NULL) ``` > **Tip!** Keep your data input formatted in a way that is easy to read and to > modify. Separate the directory from the path if there are many files in the > same directory. Having each path (or other item) on a separate line (and > ending with `NULL`) makes it easy to comment out individual files during > development or debugging. Then load all the files: ```{r load data fake, eval = FALSE} data <- read.quitte(file.path(base.path, data.files)) ``` , we can also load it like this: This is what you will do normally. Since the example data is included with the `quitte` package as a data object in case you just want to try something or give an example for something, we can also load it like this: ```{r quitte example data} (data <- quitte_example_data) ``` # Trim Data If you know which data you will need you can trim down the data frame early, reducing memory needs and increasing processing speed. `inline.data.frame()` is a wrapper function for entering tabular data in an easy-to-read fashion. ```{r set up scenario selection} scenarios <- inline.data.frame( 'scenario; scen.name', 'r7552c_REF_Def05-rem-5; Reference', # 'r7552c_1p5C_Def-rem-5; 1.5°C-def', 'r7552c_2C_Def-rem-5; 2°C', 'r7552c_1p5C_UBA_Sust-rem-5; 1.5°C-sust', # 'r7552c_2C_UBA_Sustlife-rem-5; 2°C-sust', NULL) tmax <- 2100 ``` ```{r trim data} data <- data %>% filter(scenario %in% scenarios$scenario, tmax >= period) ``` # Use Pretty Names `replace_column()` can be used to replace a column with hard-to-read or ugly entries with more pleasant ones. This is quite useful for preparing data for plots or tables. `order.levels()` arranges factor levels (e.g. scenario names) in a specific order, thus controlling the order with which items appear in plots. ```{r replace scenario names} (data <- data %>% replace_column(scenarios, scenario, scen.name) %>% order.levels(scenario = scenarios$scen.name)) ``` ## Piping The pipe operator `%>%` is a shortcut to string any number of operations together. So instead of writing a long code worm like ```{r code worm, eval = FALSE} ungroup(mutate(group_by(select(filter(data, 'Reference' == scenario), region, period, value), region), value = cumsum(value))) ``` that you would have to read from the inside out, you can write each function call on a separate line, greatly improving readability as data "flows" from top to bottom, each function being applied in succession. ```{r readable code, eval = FALSE} data %>% filter('Reference' == scenario) %>% select(region, period, value) %>% group_by(region) %>% mutate(value = cumsum(value)) %>% ungroup() ``` ## Filtering `filter()` is a more useful and versatile variant of the data frame extraction operator `[`. Instead of ```{r extraction, eval = FALSE} data[data$period == 2030 & df$scenario != 'Reference',] ``` use ```{r filter, eval = FALSE} data %>% filter(period == 2030, scenario != 'Reference') ``` again, improving readability and decreasing proneness to error. Predicates (the test) are combined via logical AND, you can use as many of them as you want, and there are lots of helper functions available in the `dplyr` package. ## Selecting `select()` selects columns from a data frame. So if you only need columns `scenario` and `region`, do ```{r select A and B, eval = FALSE} data %>% select(scenario, region) ``` If you need all but `model`, do ```{r select all but C, eval = FALSE} data %>% select(-model) ``` You can rename columns on the way ```{r select and rename, eval = FALSE} data %>% select(scenario, year = period) ``` ## Grouping `group_by()` subdivides a data frame into groups, such that so-called _window_ functions operate only on the items within a group. If you, for example, want to calculate the cumulated energy production over time for different scenarios and regions, you need to group by those dimensions first. ```{r grouping, eval = FALSE} data %>% filter(variable == 'PE') %>% group_by(scenario, region) %>% mutate(value = cumsum(value)) %>% ungroup() ``` ## Mutating `mutate()` is the low-level function for modifying existing or creating new columns. Its output has the same number of rows as its input. ```{r mutating, eval = FALSE} data %>% filter(variable == 'PE') %>% group_by(scenario, region) %>% mutate(value = cumsum(value)) %>% ungroup() ``` ## Summarising `summarise()` is the low-level function for aggregating variables. Its output has _one_ row for every group. All ungrouped columns are dropped. ```{r summarising, eval = FALSE} data %>% group_by(scenario) %>% summarise(value = mean(value)) %>% ungroup() ``` # Sum Things Up `sum_total()` sums values up across regions ```{r sum regions} data %>% filter(scenario == 'Reference', variable == 'Consumption', period == 2050, region %in% c('CHN', 'IND', 'JPN', 'OAS')) %>% sum_total(group = region, name = 'all Asia') ``` or across other dimensions ```{r sum variables} data %>% filter(scenario == '2°C', region == 'RUS', period == 2035, variable %in% paste0('SE|Heat|', c('Coal', 'Gas'))) %>% sum_total(variable, name = 'SE|Heat|fossil') ``` # Add New Variables `calc_addVariable()` calculates new variables from existing ones, using generic formulas. Variable names that are not valid R identifiers -- basically everything inside a `.mif`-file, need to be escaped using backticks ("`"). ```{r add per-capita FE use} data %>% filter(region == 'EUR', period == 2025, variable %in% c('PE', 'Population')) %>% calc_addVariable( # EJ/yr / million / (3.6e-9 EJ/MWh) * 1e6 1/million = MWh/yr '`PE per capita`' = 'PE / Population / 3.6e-3', units = 'MWh/year') ``` > **Tip!** Note down all unit conversions you do where you do them, and include > the calculation. This will save you hours of debug-time in the future. > (Trust me.) It is also possible to only keep the newly calculated values for further manipulation, by settting `only.new = TRUE`. ```{r add PE carrier intensity of GDP} data %>% filter(region == 'OAS', period == 2070, scenario != 'Reference') %>% calc_addVariable( # EJ/yr / $bn/yr * 1e-15 kJ/EJ * 1e9 $/$bn = kJ/$ '`PE|Coal|GDP intensity`' = '`PE|Coal` / `GDP|PPP` * 1e6', '`PE|Oil|GDP intensity`' = '`PE|Oil` / `GDP|PPP` * 1e6', '`PE|Gas|GDP intensity`' = '`PE|Gas` / `GDP|PPP` * 1e6', '`PE|Nuclear|GDP intensity`' = '`PE|Nuclear` / `GDP|PPP` * 1e6', '`PE|Biomass|GDP intensity`' = '`PE|Biomass` / `GDP|PPP` * 1e6', '`PE|Hydro|GDP intensity`' = '`PE|Hydro` / `GDP|PPP` * 1e6', '`PE|Geothermal|GDP intensity`' = '`PE|Geothermal` / `GDP|PPP` * 1e6', '`PE|Wind|GDP intensity`' = '`PE|Wind` / `GDP|PPP` * 1e6', '`PE|Solar|GDP intensity`' = '`PE|Solar` / `GDP|PPP` * 1e6', units = 'kJ/$', only.new = TRUE) ``` `calc_growthrate()` calculates the growth rate (in `%/yr`) of a time series, appending ` [Growth Rate]` to the variable name: ``` data %>% calc_growthrate(only.new = TRUE, filter.function = "Consumption") ``` returns a data.frame with the variable `Consumption [Growth Rate]`. `filter.function` can be a vector of variable names or a function that filters `data`. # Net Present Value of a Time Series `calcCumulatedDiscount()` calculates the net present value of a time series. ```{r calc NPV} data %>% filter(scenario != 'Reference', region == 'World') %>% calcCumulatedDiscount(nameVar = 'Policy Cost|Consumption Loss') %>% select(-model, -region, -unit) %>% pivot_wider(names_from = 'scenario') ``` # Calculate Sample Quantiles `calc_quantiles()` makes it easy to produce sample quantiles corresponding given probabilities from data frames. ```{r calculate sample quantiles} (data.plot <- data %>% filter(region != 'World', period == 2040) %>% calc_addVariable( # MtCO2/$bn * 1e9 kg/Mt * 1e-9 $/$bn = kgCO2/$ '`CO2 intensity`' = '`Emi|CO2` / `GDP|PPP`', units = 'kGCO2/US$2005', only.new = TRUE) %>% group_by(model, scenario, variable, unit, period) %>% calc_quantiles() %>% pivot_wider(names_from = 'quantile')) ``` This is quite useful for plotting uncertainty ranges. ```{r plot_sample_quantities} ggplot() + geom_col( data = data %>% filter(region == 'World', period == 2040) %>% calc_addVariable( # MtCO2/$bn * 1e9 kg/Mt * 1e-9 $/$bn = kgCO2/$ '`CO2 intensity`' = '`Emi|CO2` / `GDP|PPP`', units = 'kGCO2/US$2005', only.new = TRUE), mapping = aes(x = scenario, y = value, fill = scenario)) + scale_fill_discrete(guide = 'none') + geom_boxplot( data = data.plot, mapping = aes( x = scenario, ymin = q0, lower = q25, middle = q50, upper = q75, ymax = q100), stat = 'identity', width = 0.5) ``` It can also be done for arbitrary quantiles ```{r calculate different sample quantiles} data %>% filter(region == 'World', period == 2040) %>% calc_addVariable( # MtCO2/$bn * 1e9 kg/Mt * 1e-9 $/$bn = kgCO2/$ '`CO2 intensity`' = '`Emi|CO2` / `GDP|PPP`', units = 'kGCO2/US$2005', only.new = TRUE) %>% group_by(model, scenario, variable, unit, period) %>% calc_quantiles(probs = c(low = 1/3, high = 2/3, 'very high' = 4/5)) %>% pivot_wider(names_from = 'quantile') ``` # Interpolate Missing Periods `interpolate_missing_periods()` adds missing periods to a data frame and interpolates missing values. It uses either _linear_ or _spline_ interpolation and can extend missing data before/after the first/last period in the original data. ```{r interpolate missing periods} data.example <- tribble( ~x, ~value, 2010, 0.1, 2015, 0.7, 2020, 0.9, 2035, 0.6, 2040, 0.3) data.plot <- bind_rows( data.example %>% interpolate_missing_periods( x = seq(2005, 2045, 1), expand.values = TRUE, method = 'linear') %>% rename(linear = value) %>% gather(interpolation, value, -x), data.example %>% interpolate_missing_periods( x = seq(2005, 2045, 1), expand.values = TRUE, method = 'spline') %>% rename(spline = value) %>% gather(interpolation, value, -x) ) ``` Be careful when using this, as the results may be nonsensical. ```{r plot_interpolations} ggplot(mapping = aes(x = x, y = value)) + geom_line(data = data.plot, mapping = aes(colour = interpolation)) + geom_point(data = data.example) + coord_cartesian(ylim = c(0, 1)) ``` In this case, values below 0 might be meaningless. So use spline extensions with care. # Plotting with `mip` Data in `quitte` format is easily passed to the plot functions in the `mip` package. ```{r area_plot_with_mip} data %>% filter('World' == region, grepl('^PE\\|', variable)) %>% mip::mipArea(x = .) + facet_wrap(~ scenario) ``` ```{r bar_plot_with_mip} data %>% filter(period %in% c(2010, 2030, 2050, 2100), grepl('^PE\\|', variable)) %>% mip::mipBarYearData(x = .) + facet_wrap(~ region, scales = 'free_y') ``` > **Note:** You don't have to call the `mip`-functions via the double colon > (`::`) operator, but would load the package instead with `library(mip)`. This > is only needed in this vignette, since `quitte` can't import the `mip` > package.) # Plotting 'correct' Bars If bar plots are used naively (as in: without extra care), the bars for periods after 2055 are too narrow and either in the wrong places ```{r bar plot setup} colours_PE <- c('Coal' = '#0C0C0C', 'Oil' = '#663A00', 'Gas' = '#E5E5B2', 'Nuclear' = '#FF33FF', 'Hydro' = '#191999', 'Biomass' = '#005900', 'Geothermal' = '#E51900', 'Wind' = '#337FFF', 'Solar' = '#FFCC00') data_plot <- quitte_example_data %>% filter('r7552c_1p5C_Def-rem-5' == scenario, 'EUR' == region, grepl('^PE\\|', variable), 2100 >= period) %>% mutate(variable = sub('^PE\\|', '', variable)) %>% order.levels(variable = rev(names(colours_PE))) ``` ```{r plot naive bars 2} ggplot() + geom_col( data = data_plot, mapping = aes(x = factor(period), y = value, fill = variable)) + scale_fill_manual(values = colours_PE, name = NULL) + labs(x = NULL, y = 'EJ/yr') ``` or spread out with large gaps in between. ```{r plot naive bars 1} ggplot() + geom_col( data = data_plot, mapping = aes(x = period, y = value, fill = variable)) + scale_fill_manual(values = colours_PE, name = NULL) + scale_x_continuous(breaks = unique(data_plot$period), name = NULL) + labs(y = 'EJ/yr') ``` In both cases, the visual representation of the data is distorted, since the are of the bars should correspond to the integral of the variable over time. (3 EJ over 10 years should result in a bar with twice the area than 3 EJ over 5 years.) To correct this, you can use the function `add_remind_timesteps_columns`, that adds two columns to the data frame, `xpos` and `width`, which can be used to create a 'correct' bar plot: ```{r correct barplot, warning = FALSE} ggplot() + geom_col( data = data_plot %>% add_remind_timesteps_columns(gaps = 0.1), mapping = aes(x = xpos, width = width, y = value, fill = variable)) + scale_fill_manual(values = colours_PE, name = NULL) + scale_x_continuous(breaks = unique(data_plot$period), name = NULL) + labs(y = 'EJ/yr') ``` This is also available through the utility function `ggplot_bar_remind_vts`: ```{r using ggplot_bar_remind_vts, warning = FALSE} periods <- unique(data_plot$period) periods <- periods[which(!periods %% 10)] ggplot_bar_remind_vts(data_plot) + scale_fill_manual(values = colours_PE, name = NULL) + scale_x_continuous(breaks = periods, name = NULL) + labs(y = 'EJ/yr') ```