Adjust synchronizations of video and subtitles automatically by temporal distribution

I'm asking this as a meditation, because I don't expect any Perl (or other) code for that.

Questions:

Is it possible to filter speech frequencies in a video with significant accuracy to identify the passages were people talk?
Can the resulting pattern be used to synchronize a subtitle file, to match the gaps?

I'm looking for a low tech solution offering a handful of plausible adjustments to chose, not a speech recognition bazooka (like YT's auto-subtitles)

Background:

I'm often downloading foreign language movies and like to see them with original voice and subtitles to practice and learn vocabulary, but am often obliged to download and adjust the subtitles timing, because

they are shifted, because of trailers or of "what happened last time" intros
they are stretched, because of different frame rates
they need readjustment in the middle because scenes were cut out

there are already Perl modules to fix the first two cases for .srt files.

That is, if the parameters are known. But finding them can be tricky.

FWIW: VLC offers an option for such synchros, but tends to freeze for a minute if the shift is in the area of 20 secs. No fun when trying out the best settings.

Cheers Rolf
_{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)

Wikisyntax for the Monastery}

Comment on Adjust synchronizations of video and subtitles automatically by temporal distribution

Replies are listed 'Best First'.
Re: Adjust synchronizations of video and subtitles automatically by temporal distribution by afoken (Chancellor) on Jan 31, 2023 at 08:48 UTC
Is it possible to filter speech frequencies in a video with significant accuracy to identify the passages were people talk? Telephony started with very bad microphones, transmitting barely anything outside the range 300 Hz to 3 kHz, but that was "good enough". Technical development improved the microphones, but analog telephony was and still is intentionally limited to that frequency range. Even when switching to ISDN, the sampling rate was only 8 kHz, limiting audio to about 3 kHz. Things changed only after migration to SIP, with "HD" audio codecs that allow higher frequencies, using more bandwidth and/or more available computing power. So I would expect that a filter with that frequency range could be a usable indicator for speech. Unfortunately, because the human ear is most sensitive in exactly this range, almost all audible warning signals also use that frequency range. So you will get some false positives. A FFT should be able to identify sharp peaks coming from all kind of beepers and ignore those peaks. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: Adjust synchronizations of video and subtitles automatically by temporal distribution by cavac (Parson) on Feb 02, 2023 at 14:39 UTC
I suspect you will also get some false positives with big budget movie musical scoring. For example, some tracks of the "Titanic"() movie use instruments that are supposed to sound like voices. Unless you want a lot of subtitles saying "aaaaaahhhh", more advanced filtering or access to a soundtrack without the music would be required. () "Take Her to Sea, Mr. Murdoch" by James Horner PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP	[reply]
Re^2: Adjust synchronizations of video and subtitles automatically by temporal distribution by LanX (Saint) on Feb 01, 2023 at 01:25 UTC
Thanks. Let's simplify this to a decision problem to have a start. Let's suppose we have n SRT-files with different time-stamps, and one is a perfect match to a given soundtrack. Now we want to rank which ones fit best. (That's actually a real life scenario) With SRT-files I can easily tell sequences of non-speech gaps, like here 1.3 secs between 00:05:15,300 and 00:05:16,400 `1 00:05:00,400 --> 00:05:15,300 This is an example of a subtitle. 2 00:05:16,400 --> 00:05:25,300 This is an example of a subtitle - 2nd subtitle.` [download] I could check how the gaps of those n SRTs overlap with "silent" passages in the soundtrack (e.g an XOR metric) and rank the SRTs by proximity. Question: how can I technically get the timestamps of silent passages of a soundtrack? Let's define silent as falling under a certain volume's threshold after filtering frequencies. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re: Adjust synchronizations of video and subtitles automatically by temporal distribution by bliako (Monsignor) on Jan 31, 2023 at 08:49 UTC
https://cmusphinx.github.io/wiki/longaudioalignment/ and then subs ?	[reply]

Back to Meditations