Slicing audio given video frames

Question

I have audio from a video that I’ve loaded with PyTorch. Given a starting index and ending index corresponding to the video segment of interest, along with the video FPS and audio sampling rate, how would I go about extracting the slice of audio that matches the segment of interest of the video?

My intuition is to convert frames to time via:

start_time = frame_start / fps
end_time = frame_end / fps

the convert time to sample position with:

start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))

Is this correct? Or is there something I’m missing? I’m worried that there will be loss of information since I’m converting the samples into ints with floor.

Asked By: mehsheenman

||

Source

Answer 1

Your solution is just fine. Assuming your sample rate is 16000, the flooring will cause a video/audio desynch on the order of 4.166e-05 seconds, which is orders of magnitude below what human ears are able to discern.

import math

fps = 60
frame_start = 121
frame_end = 181

sr=16000

start_time = frame_start / fps
end_time = frame_end / fps

start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))

print(end_time-end_sample/sr) # 4.166666666671759e-05

Answered By: Ludvig J.

Answer 2

Let’s say you have

fs = 44100                # audio sampling frequency
vfr = 24                  # video frame rate
frame_start  = 10         # index of first frame
frame_end  = 10           # index of last frame
audio = np.arange(44100)  # audio in form of ndarray

you can calculate at which points in time you want to slice the audio

time_start = frame_start / vfr
time_end = frame_end / vfr         # or (frame_end + 1) / vfr for inclusive cut

and then to which samples those points in time correspond:

sample_start_idx = int(time_start * fs)
sample_end_idx = int(time_end * fs)

Its up to you if you want to be super-precise and take into account the fact that audio corresponding to a given frame should rather be starting half a frame before a frame and end half a frame after.
In such a case use:

time_start = np.clip((frame_start - 0.5) / vfr, 0, np.inf)
time_end = (frame_end + 0.5) / vfr

Answered By: dankal444

Slicing audio given video frames

Question:

Answers: