Operations Research and Machine Learning

Budget @HOME. Data Preparation

| |



Not only Machine Learning, but also Operations Research projects quite often start with the data preparation step. In this part, we explore our bank account data, prepare it and deal with the transaction classification.

This sunburst chart from plotly library is a sneak preview from the visualization part of this post series. Our task, for now, is to prepare data, in order to allow such type of visualizations and data explorations through plotting libraries.

Transactions Data Preparation

As was mentioned before, I will use randomly generated data. There is already a prepared data generator, drawing transactions from the simple distribution:

transactions = generate_transaction_data()

Note: when I do it for myself, I use of course my real bank account data. Nothing can stop you from doing the same, but the data preparation step can vary then significantly.

We can start with just getting a rough idea of what we are dealing with:

transactions.head(5)
Beneficiary / Originator Payment Details Debit Credit Booking date Currency
Edeka .. Edeka -55.46 nan 09/12/2019 EUR
Rewe .. Rewe -0.14 nan 09/28/2019 EUR
Edeka .. Edeka -0.43 nan 12/09/2019 EUR
Lidl .. Lidl -5.3 nan 12/22/2019 EUR
Edeka .. Edeka -16.38 nan 05/16/2019 EUR

Let’s now check the types.

transactions.dtypes

We see that “Booking date” is an object, not a date. We will need to change this.

Beneficiary / Originator     object
Payment Details              object
Debit                       float64
Credit                      float64
Booking date                 object
Currency                     object
dtype: object

Does Debit meet Credit?

transactions[['Debit','Credit']].sum()

Seems to be ok

Debit     33908.58
Credit    39000.00
dtype: float64

So, the very first step is to correctly treat date column. In addition we may want extract day, month, year and weekday from it in order to conveniently aggregated data later:

transactions['Booking date'] = pd.to_datetime(transactions['Booking date'])
transactions['day'] = transactions['Booking date'].dt.day
transactions['month'] = transactions['Booking date'].dt.month
transactions['year'] = transactions['Booking date'].dt.year
transactions['weekday'] = transactions['Booking date'].dt.weekday

Now we can conveniently check expenses by month:

transactions.groupby(['year','month'])[['Debit','Credit']].sum()
year month Debit Credit
2019 1 -2494.86 3000
2019 2 -2375.05 3000
2019 3 -2882.58 3000
2019 4 -2641.56 3000
2019 5 -3202.15 3000
2019 6 -2260.02 3000
2019 7 -2481.98 3000
2019 8 -2561.8 3000
2019 9 -2578.52 3000
2019 10 -3339.24 3000
2019 11 -2520.05 3000
2019 12 -3270.77 3000
2020 1 -1300 3000

Inferring transaction type

For further analysis, we need to differentiate transaction types, i.e. which expense is related to grocery shopping, which is leisure, and so on. In principle, one can train a classifier to solve this problem, but in such a case, we will need labeled data. The easiest way forward is a rule-based approach for classification. E.g. we know that if Payment Details contain names such “Lidl”, “Edeka” or “Rewe” (typical supermarkets in Germany), this is most likely grocery shopping. If you can find “Booking.com” or “Lufthansa” - this is something to do with traveling. You can always adapt this rule-based approach to your specific data.

To keep it simple, let’s focus on just 6 transaction types:

types= ['grocery', 'fashion', 'shopping', 'travel', 'rent', "unknown"]

And let’s say transaction is of type “grocery” whenever it contains one of the shops below:

types_mapping = {}
types_mapping['grocery'] = [
    "lidl",
    "rewe",
    "edeka",
    "aldi",
]

Now we need to go row by row and see, which type we should assign to the observed transaction. If we have no idea - we assign an “unknown” value. I decide to do it with help of pandas apply and a simple function below:

def assign_type(row,types_mapping: Dict[str,List[str]], info_columns:List[str]):
    matching = []

    for c in info_columns:
        if not pd.isna(row[c]):
            matching.extend([s for s in types_mapping 
                if any(xs in row[c] for xs in types_mapping[s])])
    matching.append('unknown')
    return matching[0]

Here, we’ve noticed in data that information about the shop can be in either ‘Payment Details’ or in the “Beneficiary / Originator” columns. That’s why we look in several columns contained in the “info_columns” list. If a value is not nan in each column, we check if any of the type-specific keywords (e.g. “Lidl” shop name) is contained in the column value. If yes - we add this class in the matching. Potentially, there can be several matchings for some reasons. That’s why we append all of them and return the first entry only (simple conflict resolution). And we always add the ‘unknown’ type for the case if no matching was found.

What’s left is to prepare partial (because apply expects 1-argument function) and create a new column in the transactions dataframe:

f = partial(assign_type, types_mapping=types_mapping, info_columns=["Beneficiary / Originator", "Payment Details"])
transactions['type'] = transactions.apply(f, axis=1)

Done, now our dataframe contains correct date data type, has separate columns for day, month and year, and transaction type, which should help us later:

Beneficiary / Originator Payment Details Debit Credit Booking date Currency day month year weekday type
Edeka .. Edeka -55.46 nan 2019-09-12 00:00:00 EUR 12 9 2019 3 grocery
Rewe .. Rewe -0.14 nan 2019-09-28 00:00:00 EUR 28 9 2019 5 grocery
Edeka .. Edeka -0.43 nan 2019-12-09 00:00:00 EUR 9 12 2019 0 grocery
Lidl .. Lidl -5.3 nan 2019-12-22 00:00:00 EUR 22 12 2019 6 grocery
Edeka .. Edeka -16.38 nan 2019-05-16 00:00:00 EUR 16 5 2019 3 grocery

Adding new mappings

In terms of dealing with your data, one can always refine the rules. Let’s check how many transactions are rendered as unknown:

transactions.groupby('type').size()

We see 40 unknown transactions:

type 0
fashion 22
grocery 291
rent 13
shopping 43
travel 4
unknown 40

If we look at those, we can see that indeed I forgot several often transaction classes:

transactions[transactions['type']=='unknown']

One of the missing shops in the grocery category is “ROSSMANN”

types_mapping['grocery'] = types_mapping['grocery'].append('rossmann')

I also forgot about salary:

types_mapping['salary'] = ["salary"]

If we rerun the method, we see that the “unknown” class is gone:

type 0
fashion 22
grocery 331
rent 13
salary 13
shopping 43
travel 4

I wouldn’t expect this rule-based approach to be able to identify all transaction types, that’s quite manual work, but for your data, you can achieve decent numbers with several iterations

How to use your own data?

If you want to use your own data, you need to transform it to the expected format with these columns:

Beneficiary / Originator     string with originator or  beneficiary information
Payment Details              string with payment information
Debit                        float, expenses
Credit                       float, incomes, e.g. salary
Booking date                 str, of the following format mm/dd/YYYY
Currency                     str, optional

Code

This code can be found in the repository as a notebook

The whole series of posts

  1. Budget optimization intro
  2. Data preparation
  3. Data visualizaion
  4. Optimization model
  5. Make it work