Get value from next row into the previous row as a separate column
Question:
import pandas as pd
url = r'https://www.geonames.org/postal-codes/DE/BE/berlin.html'
table = pd.read_html(r'https://www.geonames.org/postal-codes/DE/BE/berlin.html')
table[2].to_excel('berlin_zipcodes.xlsx')
Take for example the first 2 rows:
52.517 is supposedly the longitude
13.387 is supposedly the latitude.
row[0] should have 52.517 as the value of the column "Longitude" and 13.387 as the value of the column "Latitude".
The excel screenshot was created using Excel, but I would like to automate the process with Python.
Answers:
You can try:
import pandas as pd
url = r'https://www.geonames.org/postal-codes/DE/BE/berlin.html'
table = pd.read_html(r'https://www.geonames.org/postal-codes/DE/BE/berlin.html')[2]
# identify rows with coordinates
m = table.pop('Unnamed: 0').isna()
# filter other ones
out = table[~m]
# backfill the coordinates and split to new columns
out[['Longitue', 'Latitude']] = table['Place'].where(m).bfill()[~m].str.split('/', n=1, expand=True)
out.to_excel('berlin_zipcodes.xlsx')
Output:
Place Code Country Admin1 Admin2 Admin3 Admin4 Longitude Latitude
0 Berlin 10117 Germany Berlin NaN Berlin, Stadt Berlin 52.517 13.387
2 Berlin 10115 Germany Berlin NaN Berlin, Stadt Berlin 52.532 13.385
4 Berlin 10119 Germany Berlin NaN Berlin, Stadt Berlin 52.53 13.405
6 Berlin 10178 Germany Berlin NaN Berlin, Stadt Berlin 52.521 13.41
8 Berlin 10179 Germany Berlin NaN Berlin, Stadt Berlin 52.512 13.416
.. ... ... ... ... ... ... ... ... ...
378 Berlin 13583 Germany Berlin NaN Berlin, Stadt Berlin 52.544 13.182
380 Berlin 13589 Germany Berlin NaN Berlin, Stadt Berlin 52.557 13.168
382 Berlin 13159 Germany Berlin NaN Berlin, Stadt Berlin 52.623 13.398
384 Berlin 14131 Germany Berlin NaN Berlin, Stadt Berlin 52.517 13.4
386 Reinickendorf 13047 Germany Berlin NaN Berlin, Stadt Berlin 52.567 13.333
[194 rows x 9 columns]
intermediates
# this computes a boolean Series to select the rows with coordinates
m = table.pop('Unnamed: 0').isna()
# this masks the non coordinates from the "Place" column
# and backfills the coordinates to the previous row
table['Place'].where(m).bfill()
# then we select the other rows
table['Place'].where(m).bfill()[~m]
# and split on "/" to get 2 new columns
table['Place'].where(m).bfill()[~m].str.split('/', n=1, expand=True)
import pandas as pd
url = r'https://www.geonames.org/postal-codes/DE/BE/berlin.html'
table = pd.read_html(r'https://www.geonames.org/postal-codes/DE/BE/berlin.html')
table[2].to_excel('berlin_zipcodes.xlsx')
Take for example the first 2 rows:
52.517 is supposedly the longitude
13.387 is supposedly the latitude.
row[0] should have 52.517 as the value of the column "Longitude" and 13.387 as the value of the column "Latitude".
The excel screenshot was created using Excel, but I would like to automate the process with Python.
You can try:
import pandas as pd
url = r'https://www.geonames.org/postal-codes/DE/BE/berlin.html'
table = pd.read_html(r'https://www.geonames.org/postal-codes/DE/BE/berlin.html')[2]
# identify rows with coordinates
m = table.pop('Unnamed: 0').isna()
# filter other ones
out = table[~m]
# backfill the coordinates and split to new columns
out[['Longitue', 'Latitude']] = table['Place'].where(m).bfill()[~m].str.split('/', n=1, expand=True)
out.to_excel('berlin_zipcodes.xlsx')
Output:
Place Code Country Admin1 Admin2 Admin3 Admin4 Longitude Latitude
0 Berlin 10117 Germany Berlin NaN Berlin, Stadt Berlin 52.517 13.387
2 Berlin 10115 Germany Berlin NaN Berlin, Stadt Berlin 52.532 13.385
4 Berlin 10119 Germany Berlin NaN Berlin, Stadt Berlin 52.53 13.405
6 Berlin 10178 Germany Berlin NaN Berlin, Stadt Berlin 52.521 13.41
8 Berlin 10179 Germany Berlin NaN Berlin, Stadt Berlin 52.512 13.416
.. ... ... ... ... ... ... ... ... ...
378 Berlin 13583 Germany Berlin NaN Berlin, Stadt Berlin 52.544 13.182
380 Berlin 13589 Germany Berlin NaN Berlin, Stadt Berlin 52.557 13.168
382 Berlin 13159 Germany Berlin NaN Berlin, Stadt Berlin 52.623 13.398
384 Berlin 14131 Germany Berlin NaN Berlin, Stadt Berlin 52.517 13.4
386 Reinickendorf 13047 Germany Berlin NaN Berlin, Stadt Berlin 52.567 13.333
[194 rows x 9 columns]
intermediates
# this computes a boolean Series to select the rows with coordinates
m = table.pop('Unnamed: 0').isna()
# this masks the non coordinates from the "Place" column
# and backfills the coordinates to the previous row
table['Place'].where(m).bfill()
# then we select the other rows
table['Place'].where(m).bfill()[~m]
# and split on "/" to get 2 new columns
table['Place'].where(m).bfill()[~m].str.split('/', n=1, expand=True)