Join pandas dataframe by strg-pattern

Question:

I have a question about a merge of two pd.dataframes based on a strg-pattern in a column. There are some very helpful discussions on stackoverlow and I found an approach (Merge two dataframe if one string column is contained in another column in Pandas) that fits to my requirements very well.

This approach works perfect in my MWE.

# Target-df  
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
                              'SIEGFRIED LTD. Zofingen CH',
                              'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
                              'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ', 
                               ], 
                   'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
                                         'R2-CEP 1996-036 - Rev 02',
                                         'R0-CEP 2008-165 - Rev 00',
                                         'R1-CEP 2002-193 - Rev 00',
                                          ],
                   'Substance':['Suxamethonium Chloride',
                                'Amitriptyline hydrochloride',
                                'Oxytetracycline hydrochloride',
                                'Ephedrine hydrochloride', 
                                 ], 
                       }
                       )

# print(df)
Company Certificate+Number Substance
MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN R1-CEP 2012-025 – Rev 02 Suxamethonium Chloride
SIEGFRIED LTD. Zofingen CH R2-CEP 1996-036 – Rev 02 Amitriptyline hydrochloride
SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN R0-CEP 2008-165 – Rev 00 Oxytetracycline hydrochloride
CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ R1-CEP 2002-193 – Rev 00 Ephedrine hydrochloride

Second, I have a huge df with information on cities, countries, country-codes etc. First, as a minimal example:

world_cities_min = pd.DataFrame({'Geoname ID':[1275339,
                                 '2657915',
                                 '1785286',
                                 '3061344', 
                                 ], 
                                  'City':['Mumbai',
                                          'Zofingen',
                                          'Zibo',
                                          'Zibo',
                                           ],
                                  'ASCII Name':['Mumbai',
                                                'Zofingen',
                                                'Zibo',
                                                'City', 
                                                 ], 
                                  'Country':['India',
                                             'Switzerland',
                                             'China',
                                             'Czech Republic', 
                                            ],
                                  'Alpha2':['IN',
                                            'CH',
                                            'CN',
                                            'CZ', 
                                            ], 
                               })
    
#print(world_cities_min.head(5))
Geoname ID City ASCII Name Country Alpha2
1275339 Mumbai Mumbai India IN
2657915 Zofingen Zofingen Switzerland CH
1785286 Zibo Zibo China CH
3061344 Zibo City Czech Republic CZ

Extract pattern to find city-names (according to the approach from source Merge two dataframe if one string column is contained in another column in Pandas

pat = '|'.join(r"b{}b".format(x) for x in world_cities_min['ASCII Name'])

# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
    
#print(df)

However, when I use the complete df of worldcities, I get the following error: ValueError: Cannot set a DataFrame with multiple columns to the single column ASCII Name

# Once again, the original target-df  
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
                              'SIEGFRIED LTD. Zofingen CH',
                              'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
                              'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ', 
                               ], 
                   'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
                                         'R2-CEP 1996-036 - Rev 02',
                                         'R0-CEP 2008-165 - Rev 00',
                                         'R1-CEP 2002-193 - Rev 00',
                                          ],
                   'Substance':['Suxamethonium Chloride',
                                'Amitriptyline hydrochloride',
                                'Oxytetracycline hydrochloride',
                                'Ephedrine hydrochloride', 
                                 ], 
                       }
                       )

Loading the complete df

url = 'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/geonames-all-cities-with-a-population-1000/exports/csv?lang=en&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B'


column_names = ['Geoname ID',
                'Name', 
                'ASCII Name',   
                'Alternate Names',
                'Feature Class',
                'Feature Code',
                'Country Code',
                'Country name EN',  
                'Country Code 2'    ,
                'Admin1 Code'   ,
                'Admin2 Code'   ,
                'Admin3 Code',  
                'Admin4 Code',  
                'Population',
                'Elevation',    
                'DIgital Elevation Model',  
                'Timezone', 
                'Modification date',    
                'LABEL EN', 
                'Coordinates'
                 ]
    
world_cities  = pd.read_csv(url,
                        header=1,
                        sep=';',
                          names=column_names,
                          usecols = [
                                    'Name', 
                                    'ASCII Name',   
                                    'Country Code'  ,
                                    'Country name EN',  
                                    'Coordinates'],
                            converters={
                                        },
                          )

… doing the same thing:

pat = '|'.join(r"b{}b".format(x) for x in world_cities_min['ASCII Name'])

# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
    
#print(df)

Leads to: ValueError: Cannot set a DataFrame with multiple columns to the single column ASCII Name

May I ask you to help me troubleshoot? Where is the issue in the complete df, and how can I deal with it? My overall goal is to keep the City, Country Name and Alpha2 code as separate columns. Unfortunately, the information is present in df['Company'] without a unique str-pattern

Thank you very much for any advice.

Asked By: Paul G.

||

Answers:

Cause of the error

The larger world_cities dataframe contains characters in some city names that have a special meaning in regular expressions. For instance, some of these names contain parentheses (), which has a special meaning and are used to denote capturing groups. Have a look at the following ASCII names which are captured from the world_cities

                  Name      ASCII Name Country Code Country name EN         Coordinates
84328      Hamm (Sieg)     Hamm (Sieg)           DE         Germany   50.76531, 7.67761
63174    Obolo-Eke (1)   Obolo-Eke (1)           NG         Nigeria    6.88333, 7.63333
50291    Halle (Saale)   Halle (Saale)           DE         Germany  51.48158, 11.97947
126292  Seen (Kreis 3)  Seen (Kreis 3)           CH     Switzerland   47.47646, 8.76996
131692  Schwedt (Oder)  Schwedt (Oder)           DE         Germany  53.05963, 14.28154

Solution

import re 

# ensure null values are dropped
cities = world_cities['ASCII Name'].dropna()

# Escape the special regex reserved characters in city names
pat = r'b(%s)b' % '|'.join(map(re.escape, cities))

# extract the matching occurences of the regex pattern
df['ASCII name'] = df['Company'].str.extract(pat, expand=False)

Result

                                                     Company        Certificate+Number                      Substance ASCII name
0              MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN  R1-CEP 2012-025 - Rev 02         Suxamethonium Chloride     Mumbai
1                                 SIEGFRIED LTD. Zofingen CH  R2-CEP 1996-036 - Rev 02    Amitriptyline hydrochloride   Zofingen
2     SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN  R0-CEP 2008-165 - Rev 00  Oxytetracycline hydrochloride       Zibo
3  CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ  R1-CEP 2002-193 - Rev 00        Ephedrine hydrochloride       Zibo
Answered By: Shubham Sharma
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.