Working with big data sets in R Shiny using SQLite

Big data sets can affect the performance of R Shiny applications. This is why it is important to think about the architecture of the application, including the way data is stored and retrieved. An interesting approach is storing data in an SQLite database instead of R data files (e.g. *.rdata/*.rds) and only load the subset of data needed to generate the data visualizations.

The standard approach

We write a lot of Shiny applications to visualize manufacturing information in different ways. Because queries can be complex and return large tables, we typically execute them over night. When someone opens the dashboard the next day, it shows the data from the previous day. We call this approach “near real-time”. Traditionally, we saved these data in R data files (rds/rdata). RDS stores a single data object (e.g. data frame/list), whereas rdata stores multiple data objects.

saveRDS(df, file = "data.rds")
df <- readRDS("data.rds")

save(df_1, df_2, df_3, file = "data.rdata")
load("data.rdata")

With an increasing amount of data and users, this approach inevitably leads to performance problems, since every time the dashboard is opened the entire data set gets loaded into memory.

Using SQLite

Often, it is not necessary to load the entire data set into memory. For example, in a time series plot we usually only show the past quarter or year, not the entire history. Therefore, we only need to load a subset of the data into memory. This is where SQLite comes in. If we save the data in an SQLite DB table, we can use SQL to query only the rows which we need to prepare the visualizations (plots/tables/info boxes/etc).

First we install/load the necessary packages (RSQLite and DBI). Then, we save the data to an SQLite database. In the example below, I show how to create a fairly large data set (1 million rows) by sampling from mtcars with replacement.

library(DBI)

data(mtcars)
rows <- sample(x = rownames(mtcars), size = 1E6, replace = TRUE)
df <- mtcars[rows,]

# Connect to SQLite database
con <- dbConnect(RSQLite::SQLite(), "test.sqlite")
# Write table with 1 million rows
dbWriteTable(con, "mtcars_1m", df, overwrite = TRUE)
# Disconnect from database
dbDisconnect(con)

Now we can send arbitrary queries to the database using SQL. For instance, the code below retrieves 100 rows from the table and saves the data into a variable.

con <- dbConnect(RSQLite::SQLite(), "test.sqlite")
df_top_100 <- dbGetQuery(con, "SELECT * FROM mtcars_1m LIMIT 100")
dbDisconnect(con)

The main takeaway message here is that we only load a small subset into memory.

Performance considerations

Let’s take a look at the performance. To this end, we first create another large data set of ten million rows and ten columns, containing random numerical values. We will see later that the type of data we use makes a big difference.

library(dplyr)
x <- rnorm(1e8) %>%
    matrix(ncol = 10) %>%
    as.data.frame() %>%
    `colnames<-`(paste("column", 1:10, sep = "_"))

Writing the entire data set

Next, we write the entire data set to RDS/SQLite and compare the performance. On my system, SQLite is almost twice as fast as writing to file compare to RDS.

# Write RDS
ping_rds <- Sys.time()
saveRDS(x, "random_data.rds")
pong_rds <- Sys.time()

# Write SQLite
ping_sqlite <- Sys.time()
con <- DBI::dbConnect(RSQLite::SQLite(), "random_data.sqlite")
DBI::dbWriteTable(con, "random_table", x, overwrite = TRUE)
DBI::dbDisconnect(con)
pong_sqlite <- Sys.time()

data.frame(Type = c("RDS","SQLite"), 
           Time = c(pong_rds - ping_rds, pong_sqlite - ping_sqlite),
           stringsAsFactors = FALSE) %>%
    print

Reading the entire data set

When we read the entire data set, SQLite is more than four times slower than RDS.

# Read RDS
x <- readRDS("random_data.rds")

# Read SQLite
con <- DBI::dbConnect(RSQLite::SQLite(), "random_data.sqlite")
x <- DBI::dbReadTable(con, "random_table")
DBI::dbDisconnect(con)

Reading a subset

Finally, we read a subset of the data using a filter. Again, RDS/dplyr::filter is faster than SQLite. Nevertheless, the size of the object that is read is significantly smaller in SQLite and depends on the number of rows that are selected. Also, with RDS the entire data set needs to be loaded, which in this case is ~800 MB vs. ~27 MB for SQLite (R command: object.size(x)).

# Read RDS subset
x <- readRDS("random_data.rds") %>%
    filter(column_1 < -0.9, column_2 > 0.9)

# Read SQLite subset
con <- DBI::dbConnect(RSQLite::SQLite(), "random_data.sqlite")
x <- DBI::dbGetQuery(con, "SELECT * FROM random_table WHERE column_1 < -0.9 AND column_2 > 0.9")
DBI::dbDisconnect(con)

Using index to improve the performance

SQLite allows the creation of indices, which improves performance significantly. I created an index on the first two columns, since in my example I filter on these columns. The index can also be created on all columns and depends on the type of query that is used. The index clearly is a game changer in this case. It reduces the time to run the query to below 1s. Creating an index takes time, as the database needs to sort all records. However, the index only needs to be created once.

DBI::dbSendQuery(con, "CREATE INDEX col12 ON random_table(column_1,column_2);")

x <- DBI::dbGetQuery(con, "SELECT column_1, column_2 FROM random_table WHERE column_1 < -0.9 AND column_2 > 0.9")

Non-numerical data

It turns out the performance highly depends on the type of data. I replaced the numerical data with character values. In this case, SQLite performs much better than RDS.

x <- LETTERS[runif(n = 1e8, min = 1, max = 26)] %>%
    matrix(ncol = 10) %>%
    as.data.frame %>%
    `colnames<-`(paste("column", 1:10, sep = "_"))

x <- DBI::dbGetQuery(con, "SELECT * FROM random_table WHERE column_1 = 'A' AND column_2 = 'B'")

Typically, we have mixed data sets, where columns are of numerical, character and factor types. I leave it up to the reader to experiment with mixed data sets.

Summary

To my surprise, with a random numerical data set, RDS performs better reading data compared to SQLite. In case we use mixed data sets or filter data and use an index, SQLite performs on a different level. Moreover, SQLite helps to save memory on the server. And finally, it may also reduce consumption of network bandwidth, depending on where the data is hosted. To summarize, using SQLite or any type of database might help improve the performance of a dashboard where resources are limited or a large number of users is served.

This post first appeared on rshiny.blog.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Working with big data sets in R Shiny using SQLite

The standard approach

Using SQLite

Performance considerations

Writing the entire data set

Reading the entire data set

Reading a subset

Using index to improve the performance

Non-numerical data

Summary

Like this:

Related

1 thought on “Working with big data sets in R Shiny using SQLite”

Leave a Comment Cancel Reply

The standard approach

Using SQLite

Performance considerations

Writing the entire data set

Reading the entire data set

Reading a subset

Using index to improve the performance

Non-numerical data

Summary

Share this:

Like this:

Related

1 thought on “Working with big data sets in R Shiny using SQLite”

Leave a Comment Cancel Reply