Big data sets can affect the performance of R Shiny applications. This is why it is important to think about the architecture of the application, including the way data is stored and retrieved. An interesting approach is storing data in an SQLite database instead of R data files (e.g. *.rdata/*.rds
) and only load the subset of data needed to generate the data visualizations.
The standard approach
We write a lot of Shiny applications to visualize manufacturing information in different ways. Because queries can be complex and return large tables, we typically execute them over night. When someone opens the dashboard the next day, it shows the data from the previous day. We call this approach “near real-time”. Traditionally, we saved these data in R data files (rds
/rdata
). RDS stores a single data object (e.g. data frame/list), whereas rdata stores multiple data objects.
saveRDS(df, file = "data.rds") df <- readRDS("data.rds") save(df_1, df_2, df_3, file = "data.rdata") load("data.rdata")
With an increasing amount of data and users, this approach inevitably leads to performance problems, since every time the dashboard is opened the entire data set gets loaded into memory.
Using SQLite
Often, it is not necessary to load the entire data set into memory. For example, in a time series plot we usually only show the past quarter or year, not the entire history. Therefore, we only need to load a subset of the data into memory. This is where SQLite comes in. If we save the data in an SQLite DB table, we can use SQL to query only the rows which we need to prepare the visualizations (plots/tables/info boxes/etc).
First we install/load the necessary packages (RSQLite and DBI). Then, we save the data to an SQLite database. In the example below, I show how to create a fairly large data set (1 million rows) by sampling from mtcars
with replacement.
library(DBI) data(mtcars) rows <- sample(x = rownames(mtcars), size = 1E6, replace = TRUE) df <- mtcars[rows,] # Connect to SQLite database con <- dbConnect(RSQLite::SQLite(), "test.sqlite") # Write table with 1 million rows dbWriteTable(con, "mtcars_1m", df, overwrite = TRUE) # Disconnect from database dbDisconnect(con)
Now we can send arbitrary queries to the database using SQL. For instance, the code below retrieves 100 rows from the table and saves the data into a variable.
con <- dbConnect(RSQLite::SQLite(), "test.sqlite") df_top_100 <- dbGetQuery(con, "SELECT * FROM mtcars_1m LIMIT 100") dbDisconnect(con)
The main takeaway message here is that we only load a small subset into memory.
Performance considerations
Let’s take a look at the performance. To this end, we first create another large data set of ten million rows and ten columns, containing random numerical values. We will see later that the type of data we use makes a big difference.
library(dplyr) x <- rnorm(1e8) %>% matrix(ncol = 10) %>% as.data.frame() %>% `colnames<-`(paste("column", 1:10, sep = "_"))
Writing the entire data set
Next, we write the entire data set to RDS/SQLite and compare the performance. On my system, SQLite is almost twice as fast as writing to file compare to RDS.

# Write RDS ping_rds <- Sys.time() saveRDS(x, "random_data.rds") pong_rds <- Sys.time() # Write SQLite ping_sqlite <- Sys.time() con <- DBI::dbConnect(RSQLite::SQLite(), "random_data.sqlite") DBI::dbWriteTable(con, "random_table", x, overwrite = TRUE) DBI::dbDisconnect(con) pong_sqlite <- Sys.time() data.frame(Type = c("RDS","SQLite"), Time = c(pong_rds - ping_rds, pong_sqlite - ping_sqlite), stringsAsFactors = FALSE) %>% print
Reading the entire data set
When we read the entire data set, SQLite is more than four times slower than RDS.

# Read RDS x <- readRDS("random_data.rds") # Read SQLite con <- DBI::dbConnect(RSQLite::SQLite(), "random_data.sqlite") x <- DBI::dbReadTable(con, "random_table") DBI::dbDisconnect(con)
Reading a subset
Finally, we read a subset of the data using a filter. Again, RDS/dplyr::filter
is faster than SQLite. Nevertheless, the size of the object that is read is significantly smaller in SQLite and depends on the number of rows that are selected. Also, with RDS the entire data set needs to be loaded, which in this case is ~800 MB vs. ~27 MB for SQLite (R command: object.size(x)
).

# Read RDS subset x <- readRDS("random_data.rds") %>% filter(column_1 < -0.9, column_2 > 0.9) # Read SQLite subset con <- DBI::dbConnect(RSQLite::SQLite(), "random_data.sqlite") x <- DBI::dbGetQuery(con, "SELECT * FROM random_table WHERE column_1 < -0.9 AND column_2 > 0.9") DBI::dbDisconnect(con)
Using index to improve the performance
SQLite allows the creation of indices, which improves performance significantly. I created an index on the first two columns, since in my example I filter on these columns. The index can also be created on all columns and depends on the type of query that is used. The index clearly is a game changer in this case. It reduces the time to run the query to below 1s. Creating an index takes time, as the database needs to sort all records. However, the index only needs to be created once.

DBI::dbSendQuery(con, "CREATE INDEX col12 ON random_table(column_1,column_2);") x <- DBI::dbGetQuery(con, "SELECT column_1, column_2 FROM random_table WHERE column_1 < -0.9 AND column_2 > 0.9")
Non-numerical data
It turns out the performance highly depends on the type of data. I replaced the numerical data with character values. In this case, SQLite performs much better than RDS.

x <- LETTERS[runif(n = 1e8, min = 1, max = 26)] %>% matrix(ncol = 10) %>% as.data.frame %>% `colnames<-`(paste("column", 1:10, sep = "_")) x <- DBI::dbGetQuery(con, "SELECT * FROM random_table WHERE column_1 = 'A' AND column_2 = 'B'")
Typically, we have mixed data sets, where columns are of numerical, character and factor types. I leave it up to the reader to experiment with mixed data sets.
Summary
To my surprise, with a random numerical data set, RDS performs better reading data compared to SQLite. In case we use mixed data sets or filter data and use an index, SQLite performs on a different level. Moreover, SQLite helps to save memory on the server. And finally, it may also reduce consumption of network bandwidth, depending on where the data is hosted. To summarize, using SQLite or any type of database might help improve the performance of a dashboard where resources are limited or a large number of users is served.
This post first appeared on rshiny.blog.
Pingback: Load Data to R Shiny – rshiny.blog