Is there a more efficient way to apply this custom function to the entire dataset?
Question:
I have a dataset that looks like this with IP addresses (for security’s sake, these are all made up):
0
1
2
100.0.200.0
160.60.30.0
NaN
NaN
101.60.10.0
10.0.0.1
I want to apply a function that would take these IP addresses (where they exist) and essentially return a sliced version of them by removing the fourth octet so it should look like this:
0
1
2
100.0.200
160.60.30
NaN
NaN
101.60.10
10.0.0
I have written the below code that does the job but it is very slow since it uses recursion and I want to be able to do this faster.
def sliceip(row):
row = str(row)
return row.rsplit(".",1)[0]
def applysliceip(rowx):
for i, item in enumerate(rowx):
rowx[i] = sliceip(item)
return rowx
# And I apply this to the entire dataframe as such:
split_IPs = IPs.apply(lambda row: applysliceip(row))
So my Question is there a more pythonic and faster way to accomplish the above and return the same output without having to use so much memory?
Answers:
A possible solution, which uses pandas.DataFrame.applymap
and regex
to replace the last .
and digits
by empty string:
import re
df.applymap(lambda x: re.sub(r'.d+$', '', x))
Output:
0 1 2
0 100.0.200 160.60.30 NaN
1 NaN 101.60.10 10.0.0
A faster solution, based on numpy
:
import re
v = np.vectorize(lambda x: re.sub(r'.d+$', '', x))
pd.DataFrame(np.where(pd.notnull(df), v(df), df))
You can use a regular expression to match and replace instead of using a custom function.
IPs.replace(r"(d+.d+.d+).d+", r"1", regex=True)
I have a dataset that looks like this with IP addresses (for security’s sake, these are all made up):
0 | 1 | 2 |
---|---|---|
100.0.200.0 | 160.60.30.0 | NaN |
NaN | 101.60.10.0 | 10.0.0.1 |
I want to apply a function that would take these IP addresses (where they exist) and essentially return a sliced version of them by removing the fourth octet so it should look like this:
0 | 1 | 2 |
---|---|---|
100.0.200 | 160.60.30 | NaN |
NaN | 101.60.10 | 10.0.0 |
I have written the below code that does the job but it is very slow since it uses recursion and I want to be able to do this faster.
def sliceip(row):
row = str(row)
return row.rsplit(".",1)[0]
def applysliceip(rowx):
for i, item in enumerate(rowx):
rowx[i] = sliceip(item)
return rowx
# And I apply this to the entire dataframe as such:
split_IPs = IPs.apply(lambda row: applysliceip(row))
So my Question is there a more pythonic and faster way to accomplish the above and return the same output without having to use so much memory?
A possible solution, which uses pandas.DataFrame.applymap
and regex
to replace the last .
and digits
by empty string:
import re
df.applymap(lambda x: re.sub(r'.d+$', '', x))
Output:
0 1 2
0 100.0.200 160.60.30 NaN
1 NaN 101.60.10 10.0.0
A faster solution, based on numpy
:
import re
v = np.vectorize(lambda x: re.sub(r'.d+$', '', x))
pd.DataFrame(np.where(pd.notnull(df), v(df), df))
You can use a regular expression to match and replace instead of using a custom function.
IPs.replace(r"(d+.d+.d+).d+", r"1", regex=True)