in reply to web crawler infrastructure

"Do you know such a cpan module which can provide this functionality?"

There's no module on cpan which matches all three criteria. There are also other considerations, for example PDF files may simply by scanned images, meaning you'd have to OCR them to get the text. WWW::Mechanize::FireFox, PDF::OCR2, Super Search.