Using Pandas to create a conditional column by selecting multiple columns in two different dataframes
Sorry for the long title but I wanted to make sure that the problem statement is clearly represented in the title. I tried searching for a solution to this problem online on stackoverflow and other forums but couldn’t find exactly what I wanted. Therefore I thought I’d write a short blog post about it so that it can help me and anyone else who’d face such a task in the future.
I was recently doing some data transformations and faced a situation where I had to select multiple columns from two different dataframes and check for certain conditions in both of them and if the condition satisfied, I wanted to append one column from one dataframe to another. It may sound straightforward but trust me, it isn’t!
Let me try and explain the situation with an example.
Consider you have a shipping company. For every shipment you make, you track certain details about it. The data you collect might include many fields such as the shipping date, the id of the product being shipped, if the product was actually shipped or the consignment was cancelled etc. This can be represented in a dataframe in pandas as below:
Now let’s consider that your shipments are planned in batches (which are tracked using a batch id) and these batches have a given start date and a given end date. Let’s suppose that one product is delivered in one batch. This batch information has been stored in another table. In this case, the data for the batches can be represented in another dataframe as below:
Great! So now our task is to get batch ids from the second dataframe and append it as a new column to the first dataframe provided certain conditions are fulfilled. Here we want to append batch ids based on two conditions. If the shipping date lies in between the range of the start and end date of a batch and if the product ids in both the tables match, we say that this shipment belongs to the given batch.
To acheive this, we first read in the csv files containing data from two different tables. We store the data from the shipment details file as a dataframe called df and store batch details as batch_df dataframe. The code for the comparison is as shown below:
We make use of the apply function in pandas and pass a function as a parameter to it. This function, add_batch_id() in turn uses the apply function on the other dataframe and contains a lambda function inside it which checks for the conditions. The resultant dataframe is stored as batch_df. All none values are dropped and if there is no match the dataframe becomes empty. We use if-else statement to check if batch_df has become empty and if it is empty, we fill it with NaN values, if not we convert the dataframe to a string and return it back to be appended as a new column.
The resultant dataframe after applying the above function look like this:
Yay! It works! We’ve managed to map our batch ids with their respective shipments. I hope this helps you to solve your problem. I had to spend sometime before finding a solution and I hope this can save you some of your precious time!
I’ve uploaded the code used in this example on my github as well. You can find it here: https://github.com/imaadmkhan1/pandas_multi_column_selection/