Advertisement
Advertisement

## Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

### Question

I have the following 2 data.frames:

``````a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
``````

I want to find the row a1 has that a2 doesn't.

Is there a built in function for this type of operation?

(p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)

Here is my solution:

``````a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)
``````
2010/07/03
1
165
7/3/2010 12:04:25 PM

### Accepted Answer

This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package `compare`:

``````library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison\$tM
#  a b
#1 1 a
#2 2 b
#3 3 c
``````

The function `compare` gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

``````difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison\$tM[,i])))
colnames(difference) <- colnames(a1)
difference
#  a b
#1 4 d
#2 5 e
``````
2016/07/14
89
7/14/2016 4:59:14 PM

In dplyr:

``````setdiff(a1,a2)
``````

Basically, `setdiff(bigFrame, smallFrame)` gets you the extra records in the first table.

In the SQLverse this is called a

For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins

But back to this question - here are the results for the `setdiff()` code when using the OP's data:

``````> a1
a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e

> a2
a b
1 1 a
2 2 b
3 3 c

> setdiff(a1,a2)
a b
1 4 d
2 5 e
``````

Or even `anti_join(a1,a2)` will get you the same results.
For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

2017/04/19

It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:

``````a1\$included_a1 <- TRUE
a2\$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)
``````

missing values in included_a1 will note which rows are missing in a1. similarly for a2.

One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.

2010/07/03

I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.

``````  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
> df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
> df_compare = compare_df(df1, df2, "row")

> df_compare\$comparison_df
row chng_type a b
1   4         + 4 d
2   5         + 5 e
``````

A more complicated example:

``````library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", "Duster 360", "Merc 240D"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
hp = c(110, 110, 181, 110, 245, 62),
cyl = c(6, 6, 4, 6, 8, 4),
qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))

df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
hp = c(110, 110, 93, 110, 175, 105),
cyl = c(6, 6, 4, 6, 8, 6),
qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))

> df_compare\$comparison_df
grp chng_type                id1 id2  hp cyl  qsec
1   1         -  Hornet Sportabout Dus 175   8 17.02
2   2         +         Datsun 710 Dat 181   4 33.00
3   2         -         Datsun 710 Dat  93   4 18.61
4   3         +         Duster 360 Dus 245   8 15.84
5   7         +          Merc 240D Mer  62   4 20.00
6   8         -            Valiant Val 105   6 20.22
``````

The package also has an html_output command for quick checking

df_compare\$html_output 2016/03/08

You could use the `daff` package (which wraps the `daff.js` library using the `V8` package):

``````library(daff)

diff_data(data_ref = a2,
data = a1)
``````

produces the following difference object:

``````Daff Comparison: ‘a2’ vs. ‘a1’
First 6 and last 6 patch lines:
@@   a   b
1 ... ... ...
2       3   c
3 +++   4   d
4 +++   5   e
5 ... ... ...
6 ... ... ...
7       3   c
8 +++   4   d
9 +++   5   e
``````

The tabular diff format is described here and should be pretty self-explanatory. The lines with `+++` in the first column `@@` are the ones which are new in `a1` and not present in `a2`.

The difference object can be used to `patch_data()`, to store the difference for documentation purposes using `write_diff()` or to visualize the difference using `render_diff()`:

``````render_diff(
diff_data(data_ref = a2,
data = a1)
)
``````

generates a neat HTML output: 2020/07/24

I adapted the `merge` function to get this functionality. On larger dataframes it uses less memory than the full merge solution. And I can play with the names of the key columns.

Another solution is to use the library `prob`.

``````#  Derived from src/library/base/R/merge.R
#  Part of the R package, http://www.R-project.org
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  http://www.r-project.org/Licenses/

XinY <-
function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
notin = FALSE, incomparables = NULL,
...)
{
fix.by <- function(by, df)
{
## fix up 'by' to be a valid set of cols by number: 0 is row.names
if(is.null(by)) by <- numeric(0L)
by <- as.vector(by)
nc <- ncol(df)
if(is.character(by))
by <- match(by, c("row.names", names(df))) - 1L
else if(is.numeric(by)) {
if(any(by < 0L) || any(by > nc))
stop("'by' must match numbers of columns")
} else if(is.logical(by)) {
if(length(by) != nc) stop("'by' must match number of columns")
by <- seq_along(by)[by]
} else stop("'by' must specify column(s) as numbers, names or logical")
if(any(is.na(by))) stop("'by' must specify valid column(s)")
unique(by)
}

nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
by.x <- fix.by(by.x, x)
by.y <- fix.by(by.y, y)
if((l.b <- length(by.x)) != length(by.y))
stop("'by.x' and 'by.y' specify different numbers of columns")
if(l.b == 0L) {
## was: stop("no columns to match on")
## returns x
x
}
else {
if(any(by.x == 0L)) {
x <- cbind(Row.names = I(row.names(x)), x)
by.x <- by.x + 1L
}
if(any(by.y == 0L)) {
y <- cbind(Row.names = I(row.names(y)), y)
by.y <- by.y + 1L
}
## create keys from 'by' columns:
if(l.b == 1L) {                  # (be faster)
bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
} else {
## Do these together for consistency in as.character.
## Use same set of names.
bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
bx <- bz[seq_len(nx)]
by <- bz[nx + seq_len(ny)]
}
comm <- match(bx, by, 0L)
if (notin) {
res <- x[comm == 0,]
} else {
res <- x[comm > 0,]
}
}
## avoid a copy
## row.names(res) <- NULL
attr(res, "row.names") <- .set_row_names(nrow(res))
res
}

XnotinY <-
function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
notin = TRUE, incomparables = NULL,
...)
{
XinY(x,y,by,by.x,by.y,notin,incomparables)
}
``````
2018/05/24

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Email: [email protected]