To add new columns to dataframes with Pandas we have a couple of options depending on how simple or complex the calculations are for the new columns.
The simple method involves us declaring the new column name and the value or calculation to use. This can be a constant of any pandas data type or a value calculated from other columns in the data frame.
For more complex column creation such as creating columns using functions, we can use the apply operation.
When we use the apply function and the axis=1 parameter we effectively pass each row of a DataFrame into the function that we declare in the parameters. If we build a custom function then we can use any combination of existing columns to create a new column and the logic inside the function can be as complex as required.
As an extension to the apply method we can also use Python’s lambda operation in place of a regular function as we can in any Python script. This is especially useful if the logic in our function only requires the use of 1 existing column to calculate value in our new column.
In our data frame we have information about what was ordered and about the different costs and discounts associated with each order and product but a lot of the key financial and operational metrics are missing such as:
Let’s create some new columns that calculate these metrics.
We have the discount rate but we don’t know what that is in monetary terms. Let’s create a new column called Discount_Value that gives us this:
Now that we know the discount value that was applied to each order, we can calculate the gross sale value for each product that was ordered. This is the money the business received from the customer and the calculation needed here is simply the retail price minus the discount value.
This gives us some important sales information but doesn’t give us the full picture as we also need to know the net sales value which is the sales value once returns have been accounted for. In other words if a customer has kept the product then the net sales value is equal to the gross sales value but if the product was returned then the net sales value is zero as the business would have refunded the sale value back to the customer.
Whenever a customer kept a product the reason column contains a “Not Returned” string.
So to get the net sales value we are going to have to define a function that returns a zero when the reason columns is equal to “Not Returned” but otherwise returns the gross sales value.
Finally we use the apply operation to pass each row of our data frame to the function for it derive and then return the net sales value.
We can use similar logic to the function we created to calculate net sales value to create a column that tells us the number of net units units sold (the number of units after returns have been subtracted). As we only have one product for each row this will either be a 1 or 0 but having this column will allow us to sum the net number of products sold when we aggregate our data. Let’s use lambda to create this column:
For our final final few columns we need to know some information about how much profit (sale value after costs to the business has been calculated) was made for each product the business sold and how many products were returned.
Note: We need to multiply by net units sold to make sure that we see zero profit for returned products. We could have equally used the apply method and a function to more explicitly calculate this.