How to extract city from dataframe column (not same format)

Question:

i have this dataframe, and i want to extract cities in a separate column. You can also see, that the format is not same, and the city can be anywhere in the row. How can i extract only cities i a new column?
Prompt. Here we are talking about German cities. May be to find a dictionary, that shows all German cities and somehow compare with my dataset?

Here is dictionary of german cities: https://gist.github.com/embayer/772c442419999fa52ca1

Dataframe

Adresse
0   Karlstr 10, 10 B, 30,; 04916 Hamburg
1   München Dorfstr. 28-55, 22555
2   Marnstraße. Berlin 12, 45666 Berlin
3   Musterstr, 24855 Dresden
... ...
850 Muster Hausweg 11, Hannover, 56668
851 Mariestr. 4, 48669 Nürnberg
852 Hilden Weederstr 33-55, 56889
853 Pt-gaanen-Str. 2, 45883 Potsdam

Output

Cities
0   Hamburg
1   München
2   Berlin
3   Dresden
... ...
850 Hannover
851 Nürnberg
852 Hilden
853 Potsdam
Asked By: Kamran

||

Answers:

You can use regular expression to extract the city names, as they are indicated by **:

import re 
import pandas

df = pd.DataFrame({"Adresse": ["Karlstr 10, 10 B, 30,; 04916 **Hamburg**", "**München** Dorfstr. 28-55, 22555", "Marnstraße. Berlin 12, 45666 **Berlin**", "Musterstr, 24855 **Dresden**"]})

df['Cities'] = [re.findall(r".***(.*)**", address)[0] for address in df['Adresse']]

This results in:

df
                                    Adresse   Cities
0  Karlstr 10, 10 B, 30,; 04916 **Hamburg**  Hamburg
1         **München** Dorfstr. 28-55, 22555  München
2   Marnstraße. Berlin 12, 45666 **Berlin**   Berlin
3              Musterstr, 24855 **Dresden**  Dresden

Answered By: Zephyrus

You could extract in a list all the cities from the dictionary you provided ( I asssume it’s the ‘stadt’ key ), and then use str.findall in your column:

cities_ = [cities[n]['stadt'] for n in range(0,len(cities))]
df.Adresse.str.findall(r'|'.join(cities_))

>>>
0    [Karlstr, Hamburg]
1                    []
2                    []
3                    []
4                    []
5                    []
6                    []
7                    []
8                    []
Name: Adresse, dtype: object
Answered By: sophocles

You can simply use str.extract since all the names are between couple of stars.

df["cities"] = df["Adress"].str.extract(r'**(w+)**')
Answered By: Nathan Furnal
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.