R-Forge Logo

ColByCol R package

This package is intended for reading big datasets into R. It can be useful under certain conditions.

Reading big datasets into R

By default, R reads data into memory. It creates some problems when dealing with large datasets. Many solutions have been proposed for dealing with such datasets. Some of them can be found here. Uploading your data into a DBMS can also be a solution.

However, read.table function remains the main data import function in R. This function is memory inefficient and, according to some estimates, it requires three times as much memory as the size of a dataset in order to read it into R.

The reason for such inefficiency is that R stores data.frames in memory as columns (a data.frame is no more than a list of equal length vectors) whereas text files consist of rows of records. Therefore, R's read.table needs to read whole lines, process them individually breaking into tokens and transposing these tokens into column oriented data structures.

Short tutorial

First, we will create a huge file on disk.
n.rows    <- 10000
n.cols    <- 100
n.levels  <- 200

df.double <- data.frame( matrix( rnorm( n.rows * n.cols ), n.rows, n.cols) )
df.integ  <- data.frame( matrix( sample( 1:(n.rows * n.cols), n.rows * n.cols, replace = T ), n.rows, n.cols ) )


df.txt    <- sample( 1:20, n.levels, replace = T )
df.txt    <- sapply( df.txt, function(i) paste( sample( letters, i, replace = T ), collapse = "" ) )
df.txt    <- replicate( n.cols, sample( df.txt, n.rows, replace = T ), simplify = F )
df.txt    <- as.data.frame( df.txt )
colnames( df.txt ) <- paste( "txt", 1:n.cols, sep = "." )

df.all    <- cbind( df.double, df.integ, df.txt )

write.table( df.all, file = "bigsize.txt", col.names = T, row.names = F, sep = "," )
replicate( 25, write.table( df.all, file = "bigsize.txt", col.names = F, row.names = F, sep = ",", append = T ) )

The file thus created is about 1GB in size. You can gauge the n.rows, n.cols, and n.levels to get file sizes suitable to your system. If you are breve enough, you can then try to load the file in the standard way, i.e,
can.i   <- read.table( "bigsize.txt", header = T, sep = "," )         # in fact, I cannot
but I advise you not to. Instead, you can use the colbycol package as follows:
library( colbycol )
i.can <- cbc.read.table( "bigsize.txt", header = T, sep = "," )
After a few minutes, your i.can object of class colbycol will be available. What you do after that with it is up to you, but you can try, for instance,
summary( i.can )
colnames( i.can )

sapply( 1:100, function( x ) summary( cbc.get.col( i.can, x ) )  )

# my.df <- as.data.frame( i.can )                   # perhaps too much

my.df <- as.data.frame( i.can, columns = 1:10 )
my.df <- as.data.frame( i.can, columns = 1:10 )
my.df <- as.data.frame( i.can, columns = 1:10, rows = 1:300 )
my.df <- as.data.frame( i.can, rows = 1:300 )
It is not even necessary to preprocess all the rows and columns from the original text file. For instance, the commands
i.can <- cbc.read.table( "bigsize.txt", just.read = 1:100, header = T, sep = "," )
i.can <- cbc.read.table( "bigsize.txt", sample.pct = 0.5, header = T, sep = "," )
will just read the columns 1:100 of the text file and a random sample of approximately 50% of its rows respectively.

ColByCol approach

ColByCol approach is memory efficient. Using Java code, tt reads the input text file and outputs it into several text files, each holding an individual column of the original dataset. Then, these files are read individually into R thus avoiding R's memory bottleneck.

The approach works best for big files divided into many columns, specially when these columns can be transformed into memory efficient types and data structures: R representation of numbers (in some cases), and character vectors with repeated levels via factors occupy much less space than their character representation.

Package ColByCol has been successfully used to read multi-GB datasets on a 2GB laptop.

Further package info

You can find it in CRAN and can be installed using R's install.package function.

The project summary page can be found here. You may also be interested in its page at crantastic.

The package has been developed by Carlos J. Gil Bellosta.