How to join (merge) data frames (inner, outer, left, right)
Given two data frames:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3))) df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1))) df1 # CustomerId Product # 1 Toaster # 2 Toaster # 3 Toaster # 4 Radio # 5 Radio # 6 Radio df2 # CustomerId State # 2 Alabama # 4 Alabama # 6 Ohio
How can I do database style, i.e., sql style, joins? That is, how do I get:
- An inner join of
Return only the rows in which the left table have matching keys in the right table.
- An outer join of
Returns all rows from both tables, join records from the left which have matching keys in the right table.
- A left outer join (or simply left join) of
Return all rows from the left table, and any rows with matching keys from the right table.
- A right outer join of
Return all rows from the right table, and any rows with matching keys from the left table.
How can I do a SQL style select statement?
By using the
merge function and its optional parameters:
merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify
merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the
by.y parameters if the matching variables have different names in the different data frames.
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.
You can merge on multiple columns by giving
by a vector, e.g.,
by = c("CustomerId", "OrderId").
If the column names to merge on are not the same, you can specify, e.g.,
by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where
CustomerId_in_df1 is the name of the column in the first data frame and
CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)
Read more... Read less...
I would recommend checking out Gabor Grothendieck's sqldf package, which allows you to express these operations in SQL.
library(sqldf) ## inner join df3 <- sqldf("SELECT CustomerId, Product, State FROM df1 JOIN df2 USING(CustomerID)") ## left join (substitute 'right' for right join) df4 <- sqldf("SELECT CustomerId, Product, State FROM df1 LEFT JOIN df2 USING(CustomerID)")
I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).
See Gabor's sqldf GitHub for more information on joins.
There is the data.table approach for an inner join, which is very time and memory efficient (and necessary for some larger data.frames):
library(data.table) dt1 <- data.table(df1, key = "CustomerId") dt2 <- data.table(df2, key = "CustomerId") joined.dt1.dt.2 <- dt1[dt2]
merge also works on data.tables (as it is generic and calls
data.table documented on stackoverflow:
How to do a data.table merge operation
Translating SQL joins on foreign keys to R data.table syntax
Efficient alternatives to merge for larger data.frames R
How to do a basic left outer join with data.table in R?
Yet another option is the
join function found in the plyr package
library(plyr) join(df1, df2, type = "inner") # CustomerId Product State # 1 2 Toaster Alabama # 2 4 Radio Alabama # 3 6 Radio Ohio
join] preserves the order of x no matter what join type is used.
You can do joins as well using Hadley Wickham's awesome dplyr package.
library(dplyr) #make sure that CustomerId cols are both type numeric #they ARE not using the provided code in question and dplyr will complain df1$CustomerId <- as.numeric(df1$CustomerId) df2$CustomerId <- as.numeric(df2$CustomerId)
Mutating joins: add columns to df1 using matches in df2
#inner inner_join(df1, df2) #left outer left_join(df1, df2) #right outer right_join(df1, df2) #alternate right outer left_join(df2, df1) #full join full_join(df1, df2)
Filtering joins: filter out rows in df1, don't modify columns
semi_join(df1, df2) #keep only observations in df1 that match in df2. anti_join(df1, df2) #drops all observations in df1 that match in df2.
There are some good examples of doing this over at the R Wiki. I'll steal a couple here:
Since your keys are named the same the short way to do an inner join is merge():
a full inner join (all records from both tables) can be created with the "all" keyword:
a left outer join of df1 and df2:
a right outer join of df1 and df2:
you can flip 'em, slap 'em and rub 'em down to get the other two outer joins you asked about :)
A left outer join with df1 on the left using a subscript method would be:
df1[,"State"]<-df2[df1[ ,"Product"], "State"]
The other combination of outer joins can be created by mungling the left outer join subscript example. (yeah, I know that's the equivalent of saying "I'll leave it as an exercise for the reader...")
New in 2014:
Especially if you're also interested in data manipulation in general (including sorting, filtering, subsetting, summarizing etc.), you should definitely take a look at
dplyr, which comes with a variety of functions all designed to facilitate your work specifically with data frames and certain other database types. It even offers quite an elaborate SQL interface, and even a function to convert (most) SQL code directly into R.
The four joining-related functions in the dplyr package are (to quote):
inner_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x where there are matching values in y, and all columns from x and y
left_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x, and all columns from x and y
semi_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x where there are matching values in y, keeping just columns from x.
anti_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x where there are not matching values in y, keeping just columns from x
It's all here in great detail.
Selecting columns can be done by
select(df,"column"). If that's not SQL-ish enough for you, then there's the
sql() function, into which you can enter SQL code as-is, and it will do the operation you specified just like you were writing in R all along (for more information, please refer to the dplyr/databases vignette). For example, if applied correctly,
sql("SELECT * FROM hflights") will select all the columns from the "hflights" dplyr table (a "tbl").