Re: Chomping most of a _long_ text string
by holli (Abbot) on May 29, 2005 at 15:34 UTC
|
s/^.+_{70}//s
?
| [reply] [d/l] |
|
|
That'll be fine on a long bit of text? I got the impression you were supposed to avoid REs on > ~1kb bits of text.
In which case, thanks a lot, and I'll shut up and get on with it!
Rupert
| [reply] |
|
|
That should be s/^.+?_{70}//s.
| [reply] [d/l] |
|
|
No, regexes scale well with long strings.
| [reply] [d/l] |
|
|
My goodness! Far more replies than I expected!
Regexes are working wonderfully for me, and I'm pleased to say I got the anti-greediness question mark trick myself (but went to work as a waiter, hence not thanking all the replies earlier)
What I was actually parsing was a mailbox full of past issues of @Risk, a security list. My code now has a loop doing something like this:
$$email =~ s/^.+?_{70}\n\n([[:digit:]])/$1/s;
while($$email =~ /(\d{2}\.\d+\.\d) CVE: ([^\n]+)\nPlatform: ([^\n]+)\n
+Title: ([^\n]+)\nDescription: (.+?)\nRef: (http[^\n]*)/gs)
{
print "CVE: $2\nPlat: $3\nTitle: $4\nDesc: $5\nURL: $6\n\n";
}
And, yes, I have yet to tidy up the RE, but it works and is more than fast enough for what I need. Many thanks again
Rupert | [reply] [d/l] |
Re: Chomping most of a _long_ text string
by TedPride (Priest) on May 29, 2005 at 17:51 UTC
|
index works just fine here, no need to use regex.
use strict;
use warnings;
my $u = 70; # Number of underscores
my $t = join '',<DATA>;
substr($t,0,index($t,"\n".'_'x$u."\n")+$u+2) = '';
print $t;
__DATA__
Junk data
goes here
______________________________________________________________________
Useful data
goes here
| [reply] [d/l] |
|
|
I pondered whether I prefered index() or $+[0] and I concluded $+[0]. The index() expression becomes so cluttered, and it has a problem if the substring isn't found. Then you'll destroy the beginning of the string anyway, and remove $u-1 chars from the beginning. I figured that most likely, if the marker isn't there, it's already removed. If not you still need to perform some check, and for me the regex version is nicer.
I'd like to flip the coin and say "matching works find here, no need to use index()", but if one likes index() one should use index(). :-)
ihb
See perltoc if you don't know which perldoc to read!
| [reply] [d/l] [select] |
Re: Chomping most of a _long_ text string
by ihb (Deacon) on May 29, 2005 at 16:40 UTC
|
# Find the mark.
$str =~ /_{70}/
and substr($str, 0, $+[0], '');
# Replace everything up to right after
# the match with the empty string.
ihb
See perltoc if you don't know which perldoc to read!
| [reply] [d/l] |
|
|
See perltoc if you don't know which perldoc to read!
I would actually recommend starting with perldoc perl, not perltoc.
| [reply] |
|
|
Both documents are great, but I get the feeling perltoc is a far less known document, so I prefer to spread that instead. perltoc is right to the target if you're looking for documentation, just like perlfunc is when looking for functions.
(Reading "See perldoc perl" might also feel like the ultimate RTFM slap.)
ihb
See perltoc if you don't know which perldoc to read!
| [reply] |
Re: Chomping most of a _long_ text string
by davidrw (Prior) on May 29, 2005 at 18:52 UTC
|
Im not sure if this is any better performance-wise than the regex suggestions, but you could use split:
(undef, $goodpart) = split(/_{70}/, $msg);
This works better if $msg is a single email at a time, though you could split a whole inbox as well... | [reply] [d/l] [select] |
|
|
(undef, $goodpart) = split /_{70}/, $msg, 2;
Note that since this supposedly is a huge string you create a huge copy while doing this, even if you assign it back to $msg, afaik.
ihb
See perltoc if you don't know which perldoc to read!
| [reply] [d/l] [select] |
Re: Chomping most of a _long_ text string
by TedPride (Priest) on May 29, 2005 at 21:20 UTC
|
He said that the emails are automated, so I'm assuming that every email has the marker in it. It would also be easy to modify the code so that the substr is only done if index > -1.
You're probably right though that the savings between regex and substr are minimal enough so that the neater code can be used rather than the most efficient. | [reply] |