Just when I thought I could clean any column using the user interface in Power Query, I was stumped when needing to clean a list of 10-20K+ bank merchant name ID’s and turn them into readable business names.
Note: This blog ends with a sample file that walks through adding a find replace list, and running a function over a column of sample merchant IDs.
I realized that I not only had to identify and replace key words from the column, but also had to remove excess/residual data surrounding the original record. Essentially I am replacing the entire record based on key words or partial strings.
To tackle this, I took a years worth of monthly bank merchant records, where the business name is identified by name or abbreviation, and worked through these challenges:
1. I need to identify multiple strings that represent the same business name, and convert them to the common business names.
2. When the replacement is made, I need the column to ignore the various surrounding characters… payment type, City, or store number.
3. I need a method that allows me to easily manage a list of core business name changes in a two column find and replace table.
If you deal with hundreds or thousands of Amazon charges from a bank statement with various descriptors, I am sure you know exactly what I’m talking about.
To illustrate the issue, lets take the convenience store 7-Eleven for example. Based on the franchise/location, the merchant ID could use various spellings of the same name: 7/11, 7-ELEVEN, or 7ELEVENUSA. Now add various leading transaction codes, and trailing store numbers, and this becomes more complex to transform.
How would we efficiently adjust each instance to simply read 7-ELEVEN in every row?
To start, I experimented with single Replace Values to change each spelling variation to read “7-ELEVEN”. But what about the surrounding codes and numbers? Was it any better to have a field that now read 12347-ELEVEN#1112?
I searched the web for “Change multiple values in a single power query column”, and landed on this blog from John MacDougall from howtoexcel.org that was just the ticket. John used a Power Query function that iterates through a two column cross reference table. One by one, the function iterates through the different possible find and replace strings, row by row.
You establish a Find and Replace table that looks something like this:
This solved my first and third issue; but I still needed to address how to handle the surrounding characters in the string that were no longer needed.
In order to ignore the surrounding store numbers and transaction codes, you’ll notice above that I added the carrot (^) characters surrounding my ‘Replace’ value column. Using this as a delimiter I am able to isolate the key descriptor of the business, and establish division between all the data that surrounds my new replaced business name. Next I simply split ‘by each occurrence’ of my custom delimiter the ^ character, and am left with three columns.
All worked well for the the merchant codes that needed replacement, however I needed to now adjust for the merchant names that did not need to be changed.
To address the no-change merchant names, I needed a custom column that would simply check for null in the center (no split results) and replace that with the original text found in the first column.
With the above function and a simple delimiter, I was able to find and replace over 10,000-20,000 merchant ID’s with the appropriate business name.
John was kind enough to warn me that I might run into performance issues on larger datasets and to split columns when at all possible. I couldn’t solve this problem with split columns, but I do suggest for large data sets that if possible, you consider establishing a table merge to eliminate the need for the function to iterate over every cell in the column.