Category Archives: Big Data Analysis using ff and ffbase

Big Data Analysis using ff and ffbase

Native R stores everything into RAM. For more, please visit R memory management. R objects can take memory upto 2-4 GB, depends on hardware configuration. Beyond this, it returns “Error: cannot allocate vector of size ……” and leaving us handicapped to work with big data using R.

data storage using native R

              Data Storage with standard R Object

Thanks to R open source, group of scholars who continuously strives in creating R packages which help us to work effectively with big data.

ff package developed by Daniel Adler, Christian Gläser, Oleg Nenadic, Jens Oehlschlägel, Walter Zucchini and maintained by Jens Oehlschlägel is designed to overcome this limitation. It uses other media like hard disk, CD and DVD to store the native binary flat files rather than its memory. It also allows you to work on very large data file simultaneously. It reads the data files into chunk and write that chunk into the external drive.

Data storage with ff

                              Data Storage with ff

Read csv file using ff package

>options(fftempdir = [Provide path where you want to store binary files])
>file_chunks <- read.csv.ffdf(file=”big_data.csv”, header=T, sep=”,”, VERBOSE=T, next.rows=500000, colClasses=NA)

It read big_data csv file chunk by chunk as specified in next.rows. It reads the chunks and write binary files in any external media and store the pointer of file in RAM. It perform this step until csv file left with no chunks.

ff working

Functioning of ff package

In the same way, we can write csv or other flat files in chunk. It reads chunk by chunk from HDD or any other external media and write it into csv or other supported format.

>write.csv.ffdf(File_chunks, “file_name.csv”)

ff provides us the facility with ffbase package to implement all sorts of functions like joins, aggregations, slicing and dicing.

>Merged_data = merge(ffobject1, ffobject2, by.x=c(“Col1”, “Col2”), by.y=c(“Col1″,”Col2”), trace=T)

Merge function of ff, ffbase package works similar as it worked for data frame but it allows inner and left join only.

>AggregatedData = ffdfdply(ffobject, split=as.character(ffobject$Col1), FUN=function(x) summaryBy(Col3+Col4+Col5 ~ Col1, data=x, FUN=sum))

To perform aggregation, I used summaryBy function which is available under doBy package. In the above ffdfdply function we split the data on the basis of some key column. If key column contains combination of 2 or more fields, we can generate key columns using ikey function

>ffobject$KeyColumn <- ikey(ffobject[c(“Col1″,”Col2″,”Col3”)])

With all sorts of advantages like working with big data and less dependency on RAM, ff has few limitations, such as

1. Sometimes, we need to compromise with the speed when we are performing complex operations with huge data set.
2. Development is not easier using ff.
3. Need to care about flat files that stores in the disk otherwise your HDD or external media left with little or no space.



Tags: , , , , , , ,