Slicing audio given video frames
Question:
I have audio from a video that I’ve loaded with PyTorch. Given a starting index and ending index corresponding to the video segment of interest, along with the video FPS and audio sampling rate, how would I go about extracting the slice of audio that matches the segment of interest of the video?
My intuition is to convert frames to time via:
start_time = frame_start / fps
end_time = frame_end / fps
the convert time to sample position with:
start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))
Is this correct? Or is there something I’m missing? I’m worried that there will be loss of information since I’m converting the samples into ints with floor.
Answers:
Your solution is just fine. Assuming your sample rate is 16000, the flooring will cause a video/audio desynch on the order of 4.166e-05 seconds, which is orders of magnitude below what human ears are able to discern.
import math
fps = 60
frame_start = 121
frame_end = 181
sr=16000
start_time = frame_start / fps
end_time = frame_end / fps
start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))
print(end_time-end_sample/sr) # 4.166666666671759e-05
Let’s say you have
fs = 44100 # audio sampling frequency
vfr = 24 # video frame rate
frame_start = 10 # index of first frame
frame_end = 10 # index of last frame
audio = np.arange(44100) # audio in form of ndarray
you can calculate at which points in time you want to slice the audio
time_start = frame_start / vfr
time_end = frame_end / vfr # or (frame_end + 1) / vfr for inclusive cut
and then to which samples those points in time correspond:
sample_start_idx = int(time_start * fs)
sample_end_idx = int(time_end * fs)
Its up to you if you want to be super-precise and take into account the fact that audio corresponding to a given frame should rather be starting half a frame before a frame and end half a frame after.
In such a case use:
time_start = np.clip((frame_start - 0.5) / vfr, 0, np.inf)
time_end = (frame_end + 0.5) / vfr
I have audio from a video that I’ve loaded with PyTorch. Given a starting index and ending index corresponding to the video segment of interest, along with the video FPS and audio sampling rate, how would I go about extracting the slice of audio that matches the segment of interest of the video?
My intuition is to convert frames to time via:
start_time = frame_start / fps
end_time = frame_end / fps
the convert time to sample position with:
start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))
Is this correct? Or is there something I’m missing? I’m worried that there will be loss of information since I’m converting the samples into ints with floor.
Your solution is just fine. Assuming your sample rate is 16000, the flooring will cause a video/audio desynch on the order of 4.166e-05 seconds, which is orders of magnitude below what human ears are able to discern.
import math
fps = 60
frame_start = 121
frame_end = 181
sr=16000
start_time = frame_start / fps
end_time = frame_end / fps
start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))
print(end_time-end_sample/sr) # 4.166666666671759e-05
Let’s say you have
fs = 44100 # audio sampling frequency
vfr = 24 # video frame rate
frame_start = 10 # index of first frame
frame_end = 10 # index of last frame
audio = np.arange(44100) # audio in form of ndarray
you can calculate at which points in time you want to slice the audio
time_start = frame_start / vfr
time_end = frame_end / vfr # or (frame_end + 1) / vfr for inclusive cut
and then to which samples those points in time correspond:
sample_start_idx = int(time_start * fs)
sample_end_idx = int(time_end * fs)
Its up to you if you want to be super-precise and take into account the fact that audio corresponding to a given frame should rather be starting half a frame before a frame and end half a frame after.
In such a case use:
time_start = np.clip((frame_start - 0.5) / vfr, 0, np.inf)
time_end = (frame_end + 0.5) / vfr