RupertSw has asked for the wisdom of the Perl Monks concerning the following question:

I've used Mail::Mbox::MessageParser to get some emails, one at a time, in a reference to a string.

What I want to do is to eat a load of unrelated information in the (automated) emails until I get to a line that is just 70 underscores, after which the useful info starts. I think I can parse the rest quite happily with assorted regular expressions, but getting rid of all the stuff before looks difficult.

Is there a good/efficient way to do this? I've looked all over the site and CPAN, but can't find anyone reading through huge strings or files.

Any help greatly appreciated.

Replies are listed 'Best First'.
Re: Chomping most of a _long_ text string
by holli (Abbot) on May 29, 2005 at 15:34 UTC
    Why not just
    s/^.+_{70}//s

    ?


    holli, /regexed monk/

      That'll be fine on a long bit of text? I got the impression you were supposed to avoid REs on > ~1kb bits of text.

      In which case, thanks a lot, and I'll shut up and get on with it!

      Rupert
        That should be s/^.+?_{70}//s.
        No, regexes scale well with long strings.


        holli, /regexed monk/

      My goodness! Far more replies than I expected!

      Regexes are working wonderfully for me, and I'm pleased to say I got the anti-greediness question mark trick myself (but went to work as a waiter, hence not thanking all the replies earlier)

      What I was actually parsing was a mailbox full of past issues of @Risk, a security list. My code now has a loop doing something like this:

      $$email =~ s/^.+?_{70}\n\n([[:digit:]])/$1/s; while($$email =~ /(\d{2}\.\d+\.\d) CVE: ([^\n]+)\nPlatform: ([^\n]+)\n +Title: ([^\n]+)\nDescription: (.+?)\nRef: (http[^\n]*)/gs) { print "CVE: $2\nPlat: $3\nTitle: $4\nDesc: $5\nURL: $6\n\n"; }

      And, yes, I have yet to tidy up the RE, but it works and is more than fast enough for what I need. Many thanks again

      Rupert
Re: Chomping most of a _long_ text string
by TedPride (Priest) on May 29, 2005 at 17:51 UTC
    index works just fine here, no need to use regex.
    use strict; use warnings; my $u = 70; # Number of underscores my $t = join '',<DATA>; substr($t,0,index($t,"\n".'_'x$u."\n")+$u+2) = ''; print $t; __DATA__ Junk data goes here ______________________________________________________________________ Useful data goes here

      I pondered whether I prefered index() or $+[0] and I concluded $+[0]. The index() expression becomes so cluttered, and it has a problem if the substring isn't found. Then you'll destroy the beginning of the string anyway, and remove $u-1 chars from the beginning. I figured that most likely, if the marker isn't there, it's already removed. If not you still need to perform some check, and for me the regex version is nicer.

      I'd like to flip the coin and say "matching works find here, no need to use index()", but if one likes index() one should use index(). :-)

      ihb

      See perltoc if you don't know which perldoc to read!

Re: Chomping most of a _long_ text string
by ihb (Deacon) on May 29, 2005 at 16:40 UTC

    # Find the mark. $str =~ /_{70}/ and substr($str, 0, $+[0], ''); # Replace everything up to right after # the match with the empty string.

    ihb

    See perltoc if you don't know which perldoc to read!

      See perltoc if you don't know which perldoc to read!
      I would actually recommend starting with perldoc perl, not perltoc.

        Both documents are great, but I get the feeling perltoc is a far less known document, so I prefer to spread that instead. perltoc is right to the target if you're looking for documentation, just like perlfunc is when looking for functions.

        (Reading "See perldoc perl" might also feel like the ultimate RTFM slap.)

        ihb

        See perltoc if you don't know which perldoc to read!

Re: Chomping most of a _long_ text string
by davidrw (Prior) on May 29, 2005 at 18:52 UTC
    Im not sure if this is any better performance-wise than the regex suggestions, but you could use split:
    (undef, $goodpart) = split(/_{70}/, $msg);
    This works better if $msg is a single email at a time, though you could split a whole inbox as well...

      You'd better have a limit on that split as well, so that you don't chop up any e-mails that have 70 underscores in it:

      (undef, $goodpart) = split /_{70}/, $msg, 2;
      Note that since this supposedly is a huge string you create a huge copy while doing this, even if you assign it back to $msg, afaik.

      ihb

      See perltoc if you don't know which perldoc to read!

Re: Chomping most of a _long_ text string
by TedPride (Priest) on May 29, 2005 at 21:20 UTC
    He said that the emails are automated, so I'm assuming that every email has the marker in it. It would also be easy to modify the code so that the substr is only done if index > -1.

    You're probably right though that the savings between regex and substr are minimal enough so that the neater code can be used rather than the most efficient.