Create new column based on whether values in one data frame are within ranges from second data frame

Question

I have an output data frame that contains the predictions of where target sounds are in a set of recordings. The data frame has the sound.file name, start and end time. Here is an example of what my data looks like:

preds = pd.DataFrame({
    'sound.file':np.random.choice( ['A','B','C'], 20),
    'start':np.random.choice(10, 20),
    })
preds['end'] = preds['start'] + np.random.choice([1,2], 20)

I then have a reference data frame which contains the sound.files names and the actual start and end times of the target signals. The reference detections won’t be integers as they are the real timings of calls within the recording.

ref = pd.DataFrame({
    'sound.file':np.random.choice( ['A','B','C'], 5),
    'start':np.random.uniform(10, 5),
    })
ref['end'] = ref['start'] + np.random.uniform([1,2], 5)

I want to add a column to the preds data frame that has either a 1 if a predicted signal overlaps with an actual signal from the same sound.file or 0 if it does not.

My output would look something like this:

preds['match'] = np.random.choice([0,1], 20)
preds

I can do this is R and there a a couple of different ways to do it, for example like this. However, I am not familiar with python so need some help.

Asked By: TomCLewis

||

Source

Answer 1

ANSWER TO YOUR POST

With the following random dataframes that I got running your code:
preds:

ref:

Here is one way to do it:

# Add interval as a column (e.g. start=1 and end=4 => actual={1, 2, 3, 4}) and groupby
ref["actual"] = ref.apply(lambda x: set(range(x["start"], x["end"] + 1)), axis=1)
ref = ref.groupby("sound.file").agg({"actual": list}).reset_index()

# Add interval as a column
preds["predicted"] = preds.apply(lambda x: set(range(x["start"], x["end"] + 1)), axis=1)

# Add actual column to preds
preds = pd.merge(left=preds, right=ref, on="sound.file", how="left")

# Deal with NaN values
preds["actual"] = preds["actual"].apply(lambda x: [{}] if x is np.nan else x)

# Check for overlaps
preds["match"] = preds.apply(
    lambda x: 1
    if any([x["predicted"].intersection(actual) for actual in x["actual"]])
    else 0,
    axis=1,
)

# Cleanup
preds = preds.drop(columns=["predicted", "actual"])

So that preds:

BEYOND THE SCOPE OF YOUR POST

Here is how to deal with continuous intervals (float values).

# Setup

preds = pd.DataFrame(
    {
        "sound.file": np.random.choice(["A", "B", "C"], 20),
        "start": np.random.uniform(low=0, high=10, size=20),
    }
)
preds["end"] = preds["start"] + np.random.choice([1, 2], 20)

preds:

ref = pd.DataFrame(
    {
        "sound.file": np.random.choice(["A", "B", "C"], 5),
        "start": np.random.uniform(low=0, high=10, size=5),
    }
)
ref["end"] = ref["start"] + np.random.choice([1, 2], 5)

ref:

# Add interval as a column (e.g. start=1.2358 and end=4.4987 => actual=[1.2358, 4.4987]
# and groupby
ref["actual"] = ref[["start", "end"]].apply(lambda x: round(x, 4)).values.tolist()
ref = ref.groupby("sound.file").agg({"actual": sorted}).reset_index()

# Add actual column to preds
preds = pd.merge(left=preds, right=ref, on="sound.file", how="left")

# Deal with NaN values
preds["actual"] = preds["actual"].apply(lambda x: [[-1]] if x is np.nan else x)

# Check for overlaps
preds["match"] = preds.apply(
    lambda x: 1
    if any(
        [(x["start"] >= period[0]) & (x["end"] <= period[-1]) for period in x["actual"]]
    )
    | any(
        [
            (x["start"] >= period[0]) & (x["start"] <= period[-1])
            for period in x["actual"]
        ]
    )
    | any(
        [(x["end"] >= period[0]) & (x["end"] <= period[-1]) for period in x["actual"]]
    )
    | any(
        [(x["start"] <= period[0]) & (x["end"] >= period[-1]) for period in x["actual"]]
    )
    else 0,
    axis=1,
)

# Cleanup
preds = preds.drop(columns="actual")

So that preds:

Answered By: Laurent

Create new column based on whether values in one data frame are within ranges from second data frame

Question:

Answers:

ANSWER TO YOUR POST

BEYOND THE SCOPE OF YOUR POST