Working audio/video is best left to C. This means that you probably want to look for some C video libraries with an XS interface.
Luckily, there's FFmpeg (available at a CPAN near you).
In fact, it even has an example of capturing a frame from a stream in the SYNOPSIS portion of the POD.