wcw has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, My knowledge of PERL is quite limited. Here's my problem: I have been able to parse 20K+ messages representing the archives of a forum at Yahoogroups by using mboxparser. To facilitate loading the individual elements of each message into mySQL, they are delimited by ctrl-A instead of /n. Each record is then delimited by ctrl-B. Each email message, however, has tag line advertising that I want to remove. The tag lines begin with: "------ Yahoo" Can someone point me to a script or give me guidance on how to search a file for the term "------ Yahoo" and delete it and all text following until the ctrl-B delimiter. With kind regards for you assistance, Bill
  • Comment on How do I delete from a delimiter to the end of a file?

Replies are listed 'Best First'.
Re: How do I delete from a delimiter to the end of a file?
by ikegami (Patriarch) on Aug 25, 2008 at 00:29 UTC
    $file =~ s/------ Yahoo.*?(?=\cB)//sg;

    Update: Added "s" modifier.

      .*? usually comes with a speed penalty (unless the optimizer eliminates running .*? at all), as Perl needs to do bookkeeping for possible backtracking.

      I'd write it as:

      s/------ Yahoo[^\cb]*+//g; # Keeps the ^B s/------ Yahoo[^\cb]*+\cB//; # Removes the ^B as well.

        I don't know from where you got your information, but it appears to be incorrect.

        Rate JavaFan JavaFan_noplus ikegami JavaFan 104/s -- -5% -16% JavaFan_noplus 109/s 5% -- -12% ikegami 123/s 19% 13% -- Rate JavaFan JavaFan_noplus ikegami JavaFan 109/s -- -2% -11% JavaFan_noplus 110/s 2% -- -10% ikegami 122/s 13% 11% -- Rate JavaFan JavaFan_noplus ikegami JavaFan 103/s -- -5% -21% JavaFan_noplus 109/s 5% -- -17% ikegami 131/s 27% 20% --
Re: How do I delete from a delimiter to the end of a file?
by kyle (Abbot) on Aug 25, 2008 at 02:46 UTC

    I actually like ikegami's solution better, but this is what I thought of first:

    perl -pi -e 'BEGIN{$/="\cB"} s{-{6}\sYahoo.*\z}{$/}ms' list of files

      I suspect yours will work better than ikegami's in a large number of situations. Reading 20k messages one-at-a-time is likely a better idea than requiring the entire archive of 20k messages to be read into memory at once.

      Yours is even a complete example, not just a single regex that leaves the process of replacing files and slurping as an exercise.

      - tye