We use a join when we want to blend fields from one data source with another to create a new dataset with fields from both original data sources side by side. We can create joins when there is a relationship between one or more fields in both data sources. In practice, this means that at least one field in the first data source contains values that also appear in a field in the second source. For Excel users, a join serves a similar purpose to a lookup function but is far more expansive and flexible.
Let’s look at an example. In the two data sources below (order data and customer data) we can see that the Customer ID field appears in both sets of data and that they contain values that appear in both sources such as 4002. Therefore there is a relationship between the two sources through the “Customer_ID” fields that we can call the relational fields. In this case, joining the two data sources on the Customer ID field would result in matches between four out of the five records in each source.
Order_No | Order_Date | Customer_ID | ProductID |
---|---|---|---|
01259 | 22/11/17 | 4002 | 376401 |
14716 | 01/04/17 | 3979 | 218178 |
23441 | 21/11/17 | 3382 | 218178 |
88155 | 19/11/17 | 3552 | 355904 |
42078 | 10/05/17 | 1654 | 227774 |
Customer_ID | Age | Gender | Location |
---|---|---|---|
4002 | 60 | M | Non-US |
1681 | 37 | M | US |
3382 | 20 | F | US |
3552 | 40 | F | Non-US |
3979 | 52 | F | US |
We now know that we can join these two sources to create a new data set but the records contained in this new data will depend on the type of join we use. Let’s look at join types now.
An inner join returns a data set that includes ONLY records where the values in the relational field(s) match. In the example below we have used an inner join to blend our order and customer data sources. As we saw above, there are four records from each source with matching customer IDs and therefore an inner join between the two data sources creates the following data set:
Order_No | Order_Date | Customer_ID | ProductID | Customer_ID | Age | Gender | Location |
---|---|---|---|---|---|---|---|
01259 | 22/11/17 | 4002 | 376401 | 4002 | 60 | M | Non-US |
23441 | 21/11/17 | 3382 | 218178 | 3382 | 20 | F | US |
88155 | 19/11/17 | 3552 | 355904 | 3552 | 40 | F | Non-US |
14716 | 01/04/17 | 3979 | 218178 | 3979 | 52 | F | US |
Outer joins have three variation; left, right and outer.
A full outer join creates a data set that contains all records from both data sources but where there is aisn’t a matching value in the relational field(s) in either of the then fields from this record will return as Nulls. As we can see below customer_ID 1681 only appears in table 2 and as such table 1 fields for this record are Null. The same is true for customer_ID 1654 in table 1.
Order_No | Order_Date | Customer_ID | ProductID | Customer_ID | Age | Gender | Location |
---|---|---|---|---|---|---|---|
01259 | 22/11/17 | 4002 | 376401 | 4002 | 60 | M | Non-US |
23441 | 21/11/17 | 3382 | 218178 | 3382 | 20 | F | US |
88155 | 19/11/17 | 3552 | 355904 | 3552 | 40 | F | Non-US |
14716 | 01/04/17 | 3979 | 218178 | 3979 | 52 | F | US |
42078 | 10/05/17 | 1654 | 227774 | NULL | NULL | NULL | NULL |
NULL | NULL | NULL | NULL | 1681 | 37 | M | US |
If we take the order data source as the left and join this to the customer data source as the right data source then performing a left outer join means that we take all the records from order data regardless of whether there is a matching record in the customer data source. Any records that appear in the order data without a match in the customer data will appear with NULLs for the customer data fields in the joined data set. In the below example customer_ID 1654 doesn’t appear in the customer data, therefore performing a left outer join creates the following table with NULLs in the customer data fields for the records that includes the non-matching value in the relational field.
Order_No | Order_Date | Customer_ID | ProductID | Customer_ID | Age | Gender | Location |
---|---|---|---|---|---|---|---|
01259 | 22/11/17 | 4002 | 376401 | 4002 | 60 | M | Non-US |
23441 | 21/11/17 | 3382 | 218178 | 3382 | 20 | F | US |
88155 | 19/11/17 | 3552 | 355904 | 3552 | 40 | F | Non-US |
14716 | 01/04/17 | 3979 | 218178 | 3979 | 52 | F | US |
42078 | 10/05/17 | 1654 | 227774 | NULL | NULL | NULL | NULL |
A right outer join does the same as the left outer join but with the logic reversed in that all records from the right data sources are retained and non-matching records from the left table are discarded as shown below.
Order_No | Order_Date | Customer_ID | ProductID | Customer_ID | Age | Gender | Location |
---|---|---|---|---|---|---|---|
01259 | 22/11/17 | 4002 | 376401 | 4002 | 60 | M | Non-US |
23441 | 21/11/17 | 3382 | 218178 | 3382 | 20 | F | US |
88155 | 19/11/17 | 3552 | 355904 | 3552 | 40 | F | Non-US |
14716 | 01/04/17 | 3979 | 218178 | 3979 | 52 | F | US |
NULL | NULL | NULL | NULL | 1681 | 37 | M | US |
The joins we have seen so far create new data sets by joining row-wise (side-by-side). A union joins the data column-wise and basically appends any rows from one data set to another where there are common fields. In our example, if we performed a union the only common field is Customer_ID and we would create the following data set.
Order_No | Order_Date | Customer_ID | ProductID | Age | Gender | Location |
---|---|---|---|---|---|---|
01259 | 22/11/17 | 4002 | 376401 | NULL | NULL | NULL |
14716 | 01/04/17 | 3979 | 218178 | NULL | NULL | NULL |
23441 | 21/11/17 | 3382 | 218178 | NULL | NULL | NULL |
88155 | 19/11/17 | 3552 | 355904 | NULL | NULL | NULL |
42078 | 10/05/17 | 1654 | 227774 | NULL | NULL | NULL |
NULL | NULL | 4002 | NULL | 60 | M | Non-US |
NULL | NULL | 1681 | NULL | 37 | M | US |
NULL | NULL | 3382 | NULL | 20 | F | US |
NULL | NULL | 3552 | NULL | 40 | F | Non-US |
NULL | NULL | 3979 | NULL | 52 | F | US |