How to replace a value in PySpark dataframe using tuples as dictionary keys

Question:

I have a dataframe with a column that registers bank names, but I have different values that refers to the same bank. The data looks something like this:

+---+--------------------+
| id|                name|
+---+--------------------+
|  1|     BANCO SANTANDER|
|  2|           SANTANDER|
|  3|BANCO SANTANDER S.A.|
|  4|           JP MORGAN|
|  5|     JP MORGAN CHASE|
|  6|            CITIBANK|
|  7|                CITI|
|  8|           CITIGROUP|
|  9|       HSBC HOLDINGS|
| 10|                HBSC|
+---+--------------------+

Since I can have one or more possible replacements to do for the same bank and I have an extensive list of institutions to correct, I created a dict so I could spare some time instead of creating case when statements, which will take a lot of time to do. The dict looks like this:

bank_dict = {
  ('JP MORGAN CHASE',):'JP MORGAN',
  ('CITI', 'CITIGROUP'):'CITIBANK',
  ('BANCO SANTANDER', 'BANCO SANTANDER S.A.', 'SANTANDER CREDIT CARDS'):'SANTANDER',
  ('HSBC HOLDINGS',):'HSBC'
}

What I need to do is check if my current value matches any of the values from the dict key and, if so, replace it with value. The expected result would be the following:

+---+--------------------+---------+
| id|                name| new_name|
+---+--------------------+---------+
|  1|     BANCO SANTANDER|SANTANDER|
|  2|           SANTANDER|SANTANDER|
|  3|BANCO SANTANDER S.A.|SANTANDER|
|  4|           JP MORGAN|JP MORGAN|
|  5|     JP MORGAN CHASE|JP MORGAN|
|  6|            CITIBANK| CITIBANK|
|  7|                CITI| CITIBANK|
|  8|           CITIGROUP| CITIBANK|
|  9|       HSBC HOLDINGS|     HBSC|
| 10|                HBSC|     HBSC|
+---+--------------------+---------+

What do I need to do to make this work?

Asked By: Marlon Iwanaga

||

Answers:

With a slight modification to your dictionary, with your current setup, you could use the following:

bank_dict = {
  ('JP MORGAN CHASE',):'JP MORGAN',
  ('CITI','CITIGROUP'):'CITIBANK',
  ('BANCO SANTANDER','BANCO SANTANDER S.A.','SANTANDER CREDIT CARDS'):'SANTANDER',
  ('HSBC HOLDINGS',):'HSBC'
}


df['new_name'] = df.loc[:, 'name']

for bank_tuple in bank_dict:
    simple_bank_name = bank_dict[bank_tuple]
    for bank_name in bank_tuple:
        mask = df['new_name'] == bank_name
        df.loc[mask, 'new_name'] = simple_bank_name
        

Note that I added an extra comma to the end of the dictionary keys that only had one value, else they were being saved as strings and not tuples

where

df = pd.DataFrame(
    {
     'name': {0: 'BANCO SANTANDER',
              1: 'SANTANDER',
              2: 'BANCO SANTANDER S.A.',
              3: 'JP MORGAN',
              4: 'JP MORGAN CHASE',
              5: 'CITIBANK',
              6: 'CITI',
              7: 'CITIGROUP',
              8: 'HSBC HOLDINGS',
              9: 'HBSC'}})
Answered By: Isaac Rene

You can use udf it simpler to go through a pyspark Dataframe

from pyspark.sql import types as T

# replace the name with the value in the dict
def replace_name(name):
    for k, v in bank_dict.items():
        if name in k:
            return v
    return name

udf_replace_name = udf(replace_name, T.StringType())

df = df.withColumn('new_name', udf_replace_name('name'))

Alternatively use pandas_udf

@pandas_udf('array<string>')
def replace_name(name):
    return next((v for k, v in bank_dict.items() if name in k), name)

df = df.withColumn('new_name', replace_name(col('name')))
Answered By: Zakaria Hamane

DataFrame class has a method replace which can take a dictionary as argument. But before that you would need to redo your dictionary.

d = {x:v for k, v in bank_dict.items() for x in k}
bank_dict = {**d, **{v:v for v in d.values()}}
df.replace(bank_dict, subset=['name'])

Full example:

df = spark.createDataFrame(
    [(1, 'BANCO SANTANDER'),
     (2, 'SANTANDER'),
     (3, 'BANCO SANTANDER S.A.'),
     (4, 'JP MORGAN'),
     (5, 'JP MORGAN CHASE'),
     (6, 'CITIBANK'),
     (7, 'CITI'),
     (8, 'CITIGROUP'),
     (9, 'HSBC HOLDINGS'),
     (10, 'HBSC')],
    ['id', 'name'])
bank_dict = {
  ('JP MORGAN CHASE',):'JP MORGAN',
  ('CITI', 'CITIGROUP'):'CITIBANK',
  ('BANCO SANTANDER', 'BANCO SANTANDER S.A.', 'SANTANDER CREDIT CARDS'):'SANTANDER',
  ('HSBC HOLDINGS',):'HSBC'
}

d = {x:v for k, v in bank_dict.items() for x in k}
bank_dict = {**d, **{v:v for v in d.values()}}
df.replace(bank_dict, subset=['name']).show()
# +---+---------+
# | id|     name|
# +---+---------+
# |  1|SANTANDER|
# |  2|SANTANDER|
# |  3|SANTANDER|
# |  4|JP MORGAN|
# |  5|JP MORGAN|
# |  6| CITIBANK|
# |  7| CITIBANK|
# |  8| CITIBANK|
# |  9|     HSBC|
# | 10|     HBSC|
# +---+---------+
Answered By: ZygD