Blog of Applied Mathematician (Bam!) · Operations Research and Machine Learning

Not only Machine Learning, but also Operations Research projects quite often start with the data preparation step. In this part, we explore our bank account data, prepare it and deal with the transaction classification.

This sunburst chart from plotly library is a sneak preview from the visualization part of this post series. Our task, for now, is to prepare data, in order to allow such type of visualizations and data explorations through plotting libraries.

Transactions Data Preparation

As was mentioned before, I will use randomly generated data. There is already a prepared data generator, drawing transactions from the simple distribution:

transactions = generate_transaction_data()

Note: when I do it for myself, I use of course my real bank account data. Nothing can stop you from doing the same, but the data preparation step can vary then significantly.

We can start with just getting a rough idea of what we are dealing with:

transactions.head(5)

Beneficiary / Originator	Payment Details	Debit	Credit	Booking date	Currency
Edeka ..	Edeka	-55.46	nan	09/12/2019	EUR
Rewe ..	Rewe	-0.14	nan	09/28/2019	EUR
Edeka ..	Edeka	-0.43	nan	12/09/2019	EUR
Lidl ..	Lidl	-5.3	nan	12/22/2019	EUR
Edeka ..	Edeka	-16.38	nan	05/16/2019	EUR

Let’s now check the types.

transactions.dtypes

We see that “Booking date” is an object, not a date. We will need to change this.

Beneficiary / Originator     object
Payment Details              object
Debit                       float64
Credit                      float64
Booking date                 object
Currency                     object
dtype: object

Does Debit meet Credit?

transactions[['Debit','Credit']].sum()

Seems to be ok

Debit     33908.58
Credit    39000.00
dtype: float64

So, the very first step is to correctly treat date column. In addition we may want extract day, month, year and weekday from it in order to conveniently aggregated data later:

transactions['Booking date'] = pd.to_datetime(transactions['Booking date'])
transactions['day'] = transactions['Booking date'].dt.day
transactions['month'] = transactions['Booking date'].dt.month
transactions['year'] = transactions['Booking date'].dt.year
transactions['weekday'] = transactions['Booking date'].dt.weekday

Now we can conveniently check expenses by month:

transactions.groupby(['year','month'])[['Debit','Credit']].sum()

year	month	Debit	Credit
2019	1	-2494.86	3000
2019	2	-2375.05	3000
2019	3	-2882.58	3000
2019	4	-2641.56	3000
2019	5	-3202.15	3000
2019	6	-2260.02	3000
2019	7	-2481.98	3000
2019	8	-2561.8	3000
2019	9	-2578.52	3000
2019	10	-3339.24	3000
2019	11	-2520.05	3000
2019	12	-3270.77	3000
2020	1	-1300	3000

Inferring transaction type

For further analysis, we need to differentiate transaction types, i.e. which expense is related to grocery shopping, which is leisure, and so on. In principle, one can train a classifier to solve this problem, but in such a case, we will need labeled data. The easiest way forward is a rule-based approach for classification. E.g. we know that if Payment Details contain names such “Lidl”, “Edeka” or “Rewe” (typical supermarkets in Germany), this is most likely grocery shopping. If you can find “Booking.com” or “Lufthansa” - this is something to do with traveling. You can always adapt this rule-based approach to your specific data.

To keep it simple, let’s focus on just 6 transaction types:

types= ['grocery', 'fashion', 'shopping', 'travel', 'rent', "unknown"]

And let’s say transaction is of type “grocery” whenever it contains one of the shops below:

types_mapping = {}
types_mapping['grocery'] = [
    "lidl",
    "rewe",
    "edeka",
    "aldi",
]

Now we need to go row by row and see, which type we should assign to the observed transaction. If we have no idea - we assign an “unknown” value. I decide to do it with help of pandas apply and a simple function below:

def assign_type(row,types_mapping: Dict[str,List[str]], info_columns:List[str]):
    matching = []

    for c in info_columns:
        if not pd.isna(row[c]):
            matching.extend([s for s in types_mapping 
                if any(xs in row[c] for xs in types_mapping[s])])
    matching.append('unknown')
    return matching[0]

Here, we’ve noticed in data that information about the shop can be in either ‘Payment Details’ or in the “Beneficiary / Originator” columns. That’s why we look in several columns contained in the “info_columns” list. If a value is not nan in each column, we check if any of the type-specific keywords (e.g. “Lidl” shop name) is contained in the column value. If yes - we add this class in the matching. Potentially, there can be several matchings for some reasons. That’s why we append all of them and return the first entry only (simple conflict resolution). And we always add the ‘unknown’ type for the case if no matching was found.

What’s left is to prepare partial (because apply expects 1-argument function) and create a new column in the transactions dataframe:

f = partial(assign_type, types_mapping=types_mapping, info_columns=["Beneficiary / Originator", "Payment Details"])
transactions['type'] = transactions.apply(f, axis=1)

Done, now our dataframe contains correct date data type, has separate columns for day, month and year, and transaction type, which should help us later:

Beneficiary / Originator	Payment Details	Debit	Credit	Booking date	Currency	day	month	year	weekday	type
Edeka ..	Edeka	-55.46	nan	2019-09-12 00:00:00	EUR	12	9	2019	3	grocery
Rewe ..	Rewe	-0.14	nan	2019-09-28 00:00:00	EUR	28	9	2019	5	grocery
Edeka ..	Edeka	-0.43	nan	2019-12-09 00:00:00	EUR	9	12	2019	0	grocery
Lidl ..	Lidl	-5.3	nan	2019-12-22 00:00:00	EUR	22	12	2019	6	grocery
Edeka ..	Edeka	-16.38	nan	2019-05-16 00:00:00	EUR	16	5	2019	3	grocery

Adding new mappings

In terms of dealing with your data, one can always refine the rules. Let’s check how many transactions are rendered as unknown:

transactions.groupby('type').size()

We see 40 unknown transactions:

type	0
fashion	22
grocery	291
rent	13
shopping	43
travel	4
unknown	40

If we look at those, we can see that indeed I forgot several often transaction classes:

transactions[transactions['type']=='unknown']

One of the missing shops in the grocery category is “ROSSMANN”

types_mapping['grocery'] = types_mapping['grocery'].append('rossmann')

I also forgot about salary:

types_mapping['salary'] = ["salary"]

If we rerun the method, we see that the “unknown” class is gone:

type	0
fashion	22
grocery	331
rent	13
salary	13
shopping	43
travel	4

I wouldn’t expect this rule-based approach to be able to identify all transaction types, that’s quite manual work, but for your data, you can achieve decent numbers with several iterations

How to use your own data?

If you want to use your own data, you need to transform it to the expected format with these columns:

Beneficiary / Originator     string with originator or  beneficiary information
Payment Details              string with payment information
Debit                        float, expenses
Credit                       float, incomes, e.g. salary
Booking date                 str, of the following format mm/dd/YYYY
Currency                     str, optional

Code

This code can be found in the repository as a notebook

Operations Research and Machine Learning

Budget @HOME. Data Preparation

Transactions Data Preparation

Inferring transaction type

Adding new mappings

How to use your own data?

Code

The whole series of posts

Operations Research and Machine Learning

Budget @HOME. Data Preparation

Transactions Data Preparation

Inferring transaction type

Adding new mappings

How to use your own data?

Code

The whole series of posts

Related Posts

Reliable OR development. Testing with Hypothesis 07 Feb 2021

Budget @HOME. Make it work 10 Dec 2020

Budget @HOME. Optimization model 09 Dec 2020