Introduction
SQL joins are vital to relational database operations, which allow users to combine data from multiple tables based on shared keys.
In R, the dplyr
package provides an intuitive syntax for performing these operations.
Connect to the Database
In this example, Iām using:
- AdventureWorks2017
- SQL Server 2019
Code
# Connect to the Database
con <- DBI::dbConnect(odbc::odbc(), "AdventureWorks", timeout = 15)
Data
Sales Detail
Code
sales_detail <- dbGetQuery(con,
"SELECT *
FROM Sales.SalesOrderDetail"
)
Joins
Inner Join
Code
sales_header |>
inner_join(sales_detail,
by = "SalesOrderID") |>
dim()
Left Join
Code
sales_header |>
left_join(sales_detail,
by = "SalesOrderID") |>
dim()
Right Join
Code
sales_header |>
right_join(sales_detail,
by = "SalesOrderID") |>
dim()
Code
sales_header |>
full_join(sales_detail, by = "SalesOrderID") |>
dim()
Full Join
Code
sales_header |>
full_join(sales_detail,
by = "SalesOrderID") |>
dim()
Cross Join
Code
# Not Executed
sales_header |>
mutate(dummy = 1) |>
full_join(sales_detail |> mutate(dummy = 1), by = "dummy") |>
select(-dummy) |> dim()
This full_join
generated 3817239405 rows, which is 1.67 billion rows more than R
can handle.
$3817239405-2147483647 = 1669755758$
Semi Join
Code
sales_header |>
semi_join(sales_detail,
by = "SalesOrderID")|>
dim()
Anti Join
Code
# Rows in header not in detail
sales_header |>
anti_join(sales_detail,
by = "SalesOrderID") |>
nrow()
Code
# Rows in detail not in header
sales_detail |>
anti_join(sales_header,
by = "SalesOrderID") |>
nrow()
Disconnect DB
Code
dbDisconnect(con) # Disconnect db