jeanluca has asked for the wisdom of the Perl Monks concerning the following question:

Hi All Is there a perl-module that can do the same thing as the linux tool 'strings'; extracting information/ASCII from word-documents, pdf files, etc ??

Replies are listed 'Best First'.
Re: Perl variant of linux tool strings
by Tanktalus (Canon) on Mar 23, 2005 at 20:34 UTC

    Depending on your needs, this may be doable with just a regular expression...

    while (/([[:print:]]){4,}/g) { print $1,$/; }

Re: Perl variant of linux tool strings
by duct_tape (Hermit) on Mar 23, 2005 at 20:46 UTC
      I like to collect words from a pdf or word document! So far Perl Power tools does a very good job! Thanks

        For collecting words from pdf documents, you can use the ps2ascii utility which comes with ghostscript. It executes the document with ghostscript, using a special device that outputs only ascii text. As ghostscript can handle pdfs too, ps2ascii works fine on them (although I did have some compatibility problems with some pdfs, depending on the generating program and the version of ghostscript).

        This doesn't work for word documents of course.

Re: Perl variant of linux tool strings
by dragonchild (Archbishop) on Mar 23, 2005 at 20:35 UTC
    $ perl -n -e '@strings =~ /([\w\n\r\s]+)/g; print "@strings\n"' file1 file2 file3

    from the commandline ... or am I missing something?

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      Thanks for the answers. What I want with this is the 'words' inside, for example a pdf file.
        What's your real question? Do you want to parse a PDF file or do you want to extract the ASCII sequences from within a non-ASCII file?

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.