RSS

Category Archives: Data Frame

Working with Data Frame


Data frames are the primary data structure in R. A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. As we shall see, a “case” is not necessarily the same as an experimental subject or unit, although they are often the same. Technically, in R a data frame is a list of column vectors, although there is only one reason why you might need to remember such an arcane thing. Unlike an array, the data you store in the columns of a data frame can be of various types. I.e., one column might be a numerical variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items).

Let’s say, we have 2 vectors

age=12:20
height=c(48, 47, 70, 50, 62, 60, 62, 72, 55)

and we want to combine them and form a table like structure, we can do through data frame

students = data.frame(age, height)

if we want to rename the data frame fields,

students = data.frame(stu_age=age, stu_height=height)

student_data_frame

Slicing and Dicing of data

Slicing means filtering rows from the data set and dicing means select set of columns from the data set. If you have a database background, slicing is “Select * from students where age>15” and dicing is “Select age from students“. We can do slicing and dicing through data frame very easily.

Slicing

> students[students$stu_age>=18,]

slice_data_frame

To access the data frame column, we use $ symbol to separate data frame and its field. Its SQL equivalent are “Select * from students where stu_age >= 18

Dicing

> students[,c(“stu_age”)]
[1] 12 13 14 15 16 17 18 19 20

This display the age field of student data frame. Its SQL equivalent are “Select stu_age from students

Slicing and Dicing

> students[students$stu_age>=15, c(“stu_age”)]
[1] 15 16 17 18 19 20

> students[students$stu_age>=15, c(“stu_age”,”stu_height”)]

slice and dice data frame

Sorting

We can also order data frame fields

> students[order(students$stu_height),]

order data frame

Its SQL equivalent are “Select * from students order by stu_height

Distinct

> unique(students$stu_height)
[1] 48 47 42 50 62 60 72 55

Its SQL equivalent are “Select distinct stu_height from students

Data frame dimensions

To check the number of rows and columns in the data frame, we have nrow, ncol, dim, length

> ncol(students) – Number of columns or fields
[1] 2
> nrow(students) – Number of rows in student data frame
[1] 9
> dim(students) – Number of rows and columns in student data frame
[1] 9 2
> length(students) – Number of columns or fields
[1] 2

Grouping of Data (Split, Apply and combine data)

> require(plyr)
> data(iris)
> ddply(iris, .(Species), summarize, means=mean(Sepal.Length), sd=round(sd(Sepal.Width),2))

ddply summarize

ddply is another powerful feature of R. It splits the data (as field specified in the bold letters), apply the grouping function and combine the data. Its SQL equivalent are “select Species, avg(cast([Sepal Length] as numeric(10,3))) mean, STDEV(cast([Sepal Width] as numeric(10,5))) sd from iris group by Species”. To understand this query, you have some knowledge of group by clause.

Summarize will not display entire data, as the name suggests it only display summary of data.

Same ddply statement with Transform function will display the entire data.

> ddply(iris, .(Species), transform, mean=mean(Sepal.Length), sd=round(sd(Sepal.Width),2))

ddply transform

It returns complete data frame with the calculated result. If we want to produce the same result through SQL,  Its SQL equivalent are “select *, avg(cast([Sepal Length] as numeric(10,3))) over (partition by Species) mean, stdev(cast([Sepal Width] as numeric(10,3))) over (partition by Species) sd from iris“. To understand this SQL query, you should have knowledge of advanced SQL.

Data Frame Merge / Join

R provide us to MERGE function to join two data frames. We can do inner join, outer joins, full join and cartesian product through merge. Its syntax is

merge(DF1, DF2, by.x = [by], by.y = [by], all = FALSE, all.x = all, all.y = all,sort = TRUE….)

DF1, DF2 – data frames to be used for merging

by, by.x, by.y – specifications of the columns used for merging. We can specify multiple columns also like c(“Col1”, “Col2”). If we specify by=NULL, result we got cartesian product of both the data frames. Columns which we want to used for matching the two data frames should be specified in by.x and by.y.

all – logical; Should be either TRUE or FALSE.

all.x – logical; if TRUE, it works like an left join. The default is FALSE, it works like inner join.

all.y – logical; if TRUE, it works like an right join. The default is FALSE, it works like inner join.

sort – logical. Default is True.

Lets try to understand with the help of example. Suppose, we have two data frame

> stu=data.frame(roll_no=c(3,1,2,5,4), names=c(“Peter”,”Jack”,”David”,”James”,”John”))
> marks=data.frame(roll=c(4,2,3,6,1), Maths=c(89,92,76,67,90), Science=c(98,92,88,91,92))

Inner Join

> x=merge(stu, marks, by.x=”roll_no”, by.y=”roll”, all=F, sort=F)
> x=merge(stu, marks, by.x=”roll_no”, by.y=”roll”, all.x=F, sort=F)
> x=merge(stu, marks, by.x=”roll_no”, by.y=”roll”, all.y=F, sort=F)

merge unsorted

Output of above three statements are same and all are work like inner join. Its SQL equivalent are “Select roll_no, names, Maths, Science from stu inner join marks on stu.roll_no = marks.roll

Left Join

>merge(stu, marks, by.x=”roll_no”, by.y=”roll”, all.x=T)

merge left join

In this case, all.x=TRUE, that means include entire data of table exists on the left side and coerced NA for non matching columns of table exists on the right hand side. Its SQL equivalent are “Select roll_no, names, Maths, Science from stu left join marks on stu.roll_no = marks.roll

Right Join

> merge(stu, marks, by.x=”roll_no”, by.y=”roll”, all.y=T)

merge right join

In this case, all.y=TRUE, that means include entire data of table exists on the right side and coerced NA for non matching columns of table exists on the left hand side. Its SQL equivalent are “Select roll_no, names, Maths, Science from stu right join marks on stu.roll_no = marks.roll

Full join

> merge(stu, marks, by.x=”roll_no”, by.y=”roll”, all=T)

merge full join

It include entire data of both the tables and coerced NA where data not exists.  Its SQL equivalent are “Select roll_no, names, Maths, Science from stu full join marks on stu.roll_no = marks.roll

Cartesain Product

> merge(stu, marks, by=NULL)

merge cartesian product

It will do cartesian product, if we specify by=NULL. Its SQL equivalent are “Select roll_no, names, roll, Maths, Science from stu. marks”

 
Leave a comment

Posted by on August 22, 2013 in Data Frame, R Objects

 

Tags: , , , , , , , , , , , , , , , , , , , , ,