Create new column based on whether values in one data frame are within ranges from second data frame
Question:
I have an output data frame that contains the predictions of where target sounds are in a set of recordings. The data frame has the sound.file name, start and end time. Here is an example of what my data looks like:
preds = pd.DataFrame({
'sound.file':np.random.choice( ['A','B','C'], 20),
'start':np.random.choice(10, 20),
})
preds['end'] = preds['start'] + np.random.choice([1,2], 20)
I then have a reference data frame which contains the sound.files names and the actual start and end times of the target signals. The reference detections won’t be integers as they are the real timings of calls within the recording.
ref = pd.DataFrame({
'sound.file':np.random.choice( ['A','B','C'], 5),
'start':np.random.uniform(10, 5),
})
ref['end'] = ref['start'] + np.random.uniform([1,2], 5)
I want to add a column to the preds
data frame that has either a 1
if a predicted signal overlaps with an actual signal from the same sound.file
or 0
if it does not.
My output would look something like this:
preds['match'] = np.random.choice([0,1], 20)
preds
I can do this is R
and there a a couple of different ways to do it, for example like this. However, I am not familiar with python so need some help.
Answers:
ANSWER TO YOUR POST
With the following random dataframes that I got running your code:
preds
:
ref
:
Here is one way to do it:
# Add interval as a column (e.g. start=1 and end=4 => actual={1, 2, 3, 4}) and groupby
ref["actual"] = ref.apply(lambda x: set(range(x["start"], x["end"] + 1)), axis=1)
ref = ref.groupby("sound.file").agg({"actual": list}).reset_index()
# Add interval as a column
preds["predicted"] = preds.apply(lambda x: set(range(x["start"], x["end"] + 1)), axis=1)
# Add actual column to preds
preds = pd.merge(left=preds, right=ref, on="sound.file", how="left")
# Deal with NaN values
preds["actual"] = preds["actual"].apply(lambda x: [{}] if x is np.nan else x)
# Check for overlaps
preds["match"] = preds.apply(
lambda x: 1
if any([x["predicted"].intersection(actual) for actual in x["actual"]])
else 0,
axis=1,
)
# Cleanup
preds = preds.drop(columns=["predicted", "actual"])
So that preds
:
BEYOND THE SCOPE OF YOUR POST
Here is how to deal with continuous intervals (float values).
# Setup
preds = pd.DataFrame(
{
"sound.file": np.random.choice(["A", "B", "C"], 20),
"start": np.random.uniform(low=0, high=10, size=20),
}
)
preds["end"] = preds["start"] + np.random.choice([1, 2], 20)
preds
:
ref = pd.DataFrame(
{
"sound.file": np.random.choice(["A", "B", "C"], 5),
"start": np.random.uniform(low=0, high=10, size=5),
}
)
ref["end"] = ref["start"] + np.random.choice([1, 2], 5)
ref
:
# Add interval as a column (e.g. start=1.2358 and end=4.4987 => actual=[1.2358, 4.4987]
# and groupby
ref["actual"] = ref[["start", "end"]].apply(lambda x: round(x, 4)).values.tolist()
ref = ref.groupby("sound.file").agg({"actual": sorted}).reset_index()
# Add actual column to preds
preds = pd.merge(left=preds, right=ref, on="sound.file", how="left")
# Deal with NaN values
preds["actual"] = preds["actual"].apply(lambda x: [[-1]] if x is np.nan else x)
# Check for overlaps
preds["match"] = preds.apply(
lambda x: 1
if any(
[(x["start"] >= period[0]) & (x["end"] <= period[-1]) for period in x["actual"]]
)
| any(
[
(x["start"] >= period[0]) & (x["start"] <= period[-1])
for period in x["actual"]
]
)
| any(
[(x["end"] >= period[0]) & (x["end"] <= period[-1]) for period in x["actual"]]
)
| any(
[(x["start"] <= period[0]) & (x["end"] >= period[-1]) for period in x["actual"]]
)
else 0,
axis=1,
)
# Cleanup
preds = preds.drop(columns="actual")
So that preds
:
I have an output data frame that contains the predictions of where target sounds are in a set of recordings. The data frame has the sound.file name, start and end time. Here is an example of what my data looks like:
preds = pd.DataFrame({
'sound.file':np.random.choice( ['A','B','C'], 20),
'start':np.random.choice(10, 20),
})
preds['end'] = preds['start'] + np.random.choice([1,2], 20)
I then have a reference data frame which contains the sound.files names and the actual start and end times of the target signals. The reference detections won’t be integers as they are the real timings of calls within the recording.
ref = pd.DataFrame({
'sound.file':np.random.choice( ['A','B','C'], 5),
'start':np.random.uniform(10, 5),
})
ref['end'] = ref['start'] + np.random.uniform([1,2], 5)
I want to add a column to the preds
data frame that has either a 1
if a predicted signal overlaps with an actual signal from the same sound.file
or 0
if it does not.
My output would look something like this:
preds['match'] = np.random.choice([0,1], 20)
preds
I can do this is R
and there a a couple of different ways to do it, for example like this. However, I am not familiar with python so need some help.
ANSWER TO YOUR POST
With the following random dataframes that I got running your code:
preds
:
ref
:
Here is one way to do it:
# Add interval as a column (e.g. start=1 and end=4 => actual={1, 2, 3, 4}) and groupby
ref["actual"] = ref.apply(lambda x: set(range(x["start"], x["end"] + 1)), axis=1)
ref = ref.groupby("sound.file").agg({"actual": list}).reset_index()
# Add interval as a column
preds["predicted"] = preds.apply(lambda x: set(range(x["start"], x["end"] + 1)), axis=1)
# Add actual column to preds
preds = pd.merge(left=preds, right=ref, on="sound.file", how="left")
# Deal with NaN values
preds["actual"] = preds["actual"].apply(lambda x: [{}] if x is np.nan else x)
# Check for overlaps
preds["match"] = preds.apply(
lambda x: 1
if any([x["predicted"].intersection(actual) for actual in x["actual"]])
else 0,
axis=1,
)
# Cleanup
preds = preds.drop(columns=["predicted", "actual"])
So that preds
:
BEYOND THE SCOPE OF YOUR POST
Here is how to deal with continuous intervals (float values).
# Setup
preds = pd.DataFrame(
{
"sound.file": np.random.choice(["A", "B", "C"], 20),
"start": np.random.uniform(low=0, high=10, size=20),
}
)
preds["end"] = preds["start"] + np.random.choice([1, 2], 20)
preds
:
ref = pd.DataFrame(
{
"sound.file": np.random.choice(["A", "B", "C"], 5),
"start": np.random.uniform(low=0, high=10, size=5),
}
)
ref["end"] = ref["start"] + np.random.choice([1, 2], 5)
ref
:
# Add interval as a column (e.g. start=1.2358 and end=4.4987 => actual=[1.2358, 4.4987]
# and groupby
ref["actual"] = ref[["start", "end"]].apply(lambda x: round(x, 4)).values.tolist()
ref = ref.groupby("sound.file").agg({"actual": sorted}).reset_index()
# Add actual column to preds
preds = pd.merge(left=preds, right=ref, on="sound.file", how="left")
# Deal with NaN values
preds["actual"] = preds["actual"].apply(lambda x: [[-1]] if x is np.nan else x)
# Check for overlaps
preds["match"] = preds.apply(
lambda x: 1
if any(
[(x["start"] >= period[0]) & (x["end"] <= period[-1]) for period in x["actual"]]
)
| any(
[
(x["start"] >= period[0]) & (x["start"] <= period[-1])
for period in x["actual"]
]
)
| any(
[(x["end"] >= period[0]) & (x["end"] <= period[-1]) for period in x["actual"]]
)
| any(
[(x["start"] <= period[0]) & (x["end"] >= period[-1]) for period in x["actual"]]
)
else 0,
axis=1,
)
# Cleanup
preds = preds.drop(columns="actual")
So that preds
: