Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

OK, here's the scenario: I have a number of files in various formats (from .doc to .p65) which I would like to extract the text from.
(Why not just open the file and cut and paste the text? Too much work when you have many.)

I tried to make a script which would take in the input file one line at a time, and look for words which contained funky (i.e. non-alphanumeric, non-common punctuation) characters. These words would then be replaced with null, and the new line dumped to an output file. Although this was successful in removing most of the garbage from the text file, it still left about 20%

Does anyone have any suggestions as to how this could be accomplished? It seems like such a simple task, but I've only started in Perl a short while ago.

Replies are listed 'Best First'.
Re: Removing Junk from Files
by physi (Friar) on Apr 12, 2001 at 11:07 UTC
    Hey if you only want alphanumeric, here you are :
    open FILE, "something"; while (<FILE>) { s/[^A-Za-z0-9!?,.;:'"]//g; print $_; }
    Add whatever charakter you want to see to the !?,.;:'" part.
    What does this do ?
    replace every charakter that is not one of the chars in [] with nothing. This might be better than list every not-wanted char in the []. the ^ at the beginning do the '!=' job :-).

    Hope this will help.

    ----------------------------------- --the good, the bad and the physi-- -----------------------------------
      s/[\W!?,.;:'"]//g; would do too... I think \W uses locale

      Greetz
      Beatnik
      ... Quidquid perl dictum sit, altum viditur.
Re: Removing Junk from Files
by RhetTbull (Curate) on Apr 12, 2001 at 19:12 UTC
    if you're on a *nix system:

    strings filename

    Yeah, it's not perl, but it was made for *exactly* this sort of thing and works quite well. Regards,

    --RT

      Sure! In case you cannot find a Win32 port (I anticipate you are on a Win32 box), get strings from PPT.

      HTH
      --
      idnopheq
      Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.

Re: Removing Junk from Files
by idnopheq (Chaplain) on Apr 12, 2001 at 19:19 UTC
    Well, for your WinWord .doc files, try RTF::Parser. Some of your other file formats may well have parser modules available.

    For quick and dirty, sometimes I'll save a file in html via the app and then html2txt it for the output. Lazy? Yes! But, I did automate the process via Win32::OLE.

    Anywho, I'm changing employers today (hurrah!), so I can't provide my script just now.

    HTH
    --
    idnopheq
    Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.

    UPDATE: Check out iguane's WORD TO TEXT SIMPLY for the OLE stuff I mentioned.