RSS

Category Archives: Data Table

Working with Data Table


Data.table inherits from data.frame. It is a data.frame too. A data.table can be passed to any package that only accepts data.frame and that package can use data.frame syntax on the data.table.

Data table syntactically different from data frame like j in data.table is fundamentally different from j in data.frame. Even something as simple as DF[,1] would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work.

You can read FAQ for more information.

If you are beginner, I would advised you to stick with data frame. There are some cases where data table shines like processing big data and few package required data table to process the data

Lets’s say, we have 2 vectors

age=12:20

height=c(48, 47, 70, 50, 62, 60, 62, 72, 55) and we want to combine them and form a table like structure, we can do through data frame students = data.frame(age, height) if we want to rename the data frame fields, students = data.frame(stu_age=age, stu_height=height) We can convert the existing data frame to data table using conversion process > require(data.table) > students_table = data.table(students) Or, we can directly create the data table students = data.table(stu_age=age, stu_height=height) student_data_table

Slicing and Dicing of data

Slicing means filtering rows from the data set and dicing means select set of columns from the data set. If you have a database background, slicing is “Select * from students where age>15” and dicing is “Select stu_age from students“. We can do slicing and dicing in data frame very easily.

Slicing

> students[stu_age>=18,]
slice_data_table

Dicing

> students[,stu_age]
[1] 12 13 14 15 16 17 18 19 20

This display the age field of student data frame. Its SQL equivalent are “Select stu_age from students

Slicing and Dicing

> students[stu_age>=15,stu_age]
[1] 15 16 17 18 19 20

> students[stu_age>=15,stu_age,stu_height]

slice and dice data table

Sorting

We can also order data frame fields > students[order(stu_height),]

order data table

It SQL equivalent are “Select * from students order by stu_height

Distinct

> unique(students$stu_height)
[1] 48 47 42 50 62 60 72 55

It SQL equivalent are “Select distinct stu_height from students

Data frame dimensions

To check the number of rows and columns in the data frame, we have nrow, ncol, dim, length

> ncol(students) – Number of columns or fields
[1] 2

> nrow(students) – Number of rows in student data table
[1] 9

> dim(students) – Number of rows and columns in student data table
[1] 9 2

> length(students) – Number of columns or fields
[1] 2

Grouping of Data (Split, Apply and combine data)

> require(plyr)
> ir=data.table(iris)
> ddply(ir, .(Species), summarize, means=mean(Sepal.Length), sd=round(sd(Sepal.Width),2))

ddply summarize

ddply is another powerful feature of R. It splits the data (as field specified in the bold letters), apply the grouping function and combine all the data. It works like “select Species, avg(cast([Sepal Length] as numeric(10,3))) mean, STDEV(cast([Sepal Width] as numeric(10,5))) sd from iris group by Species”. To understand this query, you have some knowledge of group by clause.

Summarize will not display entire data, as the name suggests it only display summary of data.

Same ddply statement with Transform function will display the entire data.

> ddply(ir, .(Species), transform, means=mean(Sepal.Length), sd=round(sd(Sepal.Width),2))

ddply transform

It return complete data frame with the calculated result. If we want to produce the same result through SQL, it would be like “select *, avg(cast([Sepal Length] as numeric(10,3))) over (partition by Species) mean, stdev(cast([Sepal Width] as numeric(10,3))) over (partition by Species) sd from iris“. To understand this SQL query, you should have knowledge of advanced SQL.

Data Table Merge / Join

R provide us to MERGE function to join two data tables. We can do inner join, outer joins, full join and cartesian product through merge. Its syntax is

merge(DT1, DT2, by=[by], all = FALSE, all.x = all, all.y = all,sort = TRUE….)

DT1, DT2 – data tables to be used for merging

by – specifications of the columns used for merging. We can specify multiple columns also like c(“Col1”, “Col2”).

all – logical; Should be either TRUE or FALSE.

all.x – logical; if TRUE, it works like an left join. The default is FALSE, it works like inner join.

all.y – logical; if TRUE, it works like an right join. The default is FALSE, it works like inner join.

sort – logical. Default is True.

Lets try to understand with the help of example. Suppose, we have two data frame

> stu=data.table(roll_no=c(3,1,2,5,4), names=c(“Peter”,”Jack”,”David”,”James”,”John”))
> marks=data.table(roll_no=c(4,2,3,6,1), Maths=c(89,92,76,67,90), Science=c(98,92,88,91,92))

Inner Join

> x=merge(stu, marks, by=”roll_no”, all=F, sort=F)
> x=merge(stu, marks, by=”roll_no”, all.x=F, sort=F)
> x=merge(stu, marks, by=”roll_no”, all.y=F, sort=F)

merge unsorted

Output of above three statements are same and all are work like inner join. Its SQL equivalent are “Select s.roll_no, names, Maths, Science from stu s inner join marks m on s.roll_no = m.roll

Left Join >merge(stu, marks, by=”roll_no”, all.x=T)

merge left join

In this case, all.x=TRUE, that means include entire data of table exists on the left side and coerced NA for non matching columns of table exists on the right hand side. Its SQL equivalent are “Select s.roll_no, names, Maths, Science from stu s left join marks m on s.roll_no = m.roll_no

Right Join

> merge(stu, marks, by=”roll_no”, all.y=T)

merge full join

In this case, all.y=TRUE, that means include entire data of table exists on the right side and coerced NA for non matching columns of table exists on the left hand side. Its SQL equivalent are “Select s.roll_no, names, Maths, Science from stu s right join marks m on stu.roll_no = marks.roll_noFull join

> merge(stu, marks, by=”roll_no”, all=T)

merge cartesian product

It include entire data of both the tables and coerced NA where data not exists.  Its SQL equivalent are “Select s.roll_no, names, Maths, Science from stu s full join marks m on s.roll_no = m.roll_no 

 
Leave a comment

Posted by on August 25, 2013 in Data Table

 

Tags: , , , , , , , , , , , , , , , , , , , ,