Re: Get chars between 2 markers using regular expressions
by McDarren (Abbot) on Dec 06, 2005 at 12:16 UTC
|
Update: Fixed a couple of problems as pointed out by sauoq and Random_Walk. Thanks guys :)
perlre is your friend here :)
For your first marker, you probably want to use a character class. So you might have something like: He[012].
For the middle part of your expression, it depends what you expect it to contain. If you're confident that it will only be alphanumeric characters and whitespace, then you could use ([\w\s]+)
\w denotes an alphanumeric character, \s denotes any whitespace character. These are wrapped in a character class by using the square brackets "[]", and the + quantifier is used, meaning "match one or more". The whole lot is wrapped in parentheses because you want to "capture" the string.
The end part of the expression is easy, as you said it will always end with "~~"
So, putting it all together you get (untested):
m/He[012]([\w\s]+)~~/
The captured string will be available in the $1 variable afterwards.
One point to note: If you have several such strings in a single line of data, then only the last first match will be returned. You could capture all matches into a list by using the 'g' modifier, like so:
my @strings = m/He[012]([/w/s]+)~~/g;
Hope this helps,
Darren :) | [reply] [d/l] [select] |
|
|
Very good answer overall, but there are a couple nits.
Firstly, you keep using a slash where you need a backslash; it isn't /w and /s but \w and \s.
Secondly, you made the statement:
If you have several such strings in a single line of data, then only the last match will be returned.
That's incorrect in that it will be the first match returned, not the last one.
-sauoq
"My two cents aren't worth a dime.";
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
Re: Get chars between 2 markers using regular expressions
by tirwhan (Abbot) on Dec 06, 2005 at 12:08 UTC
|
$string="He0Hello~~He2World~~";
while ($string=~m/He\d(\w+)~~/g) {
print "$1\n";
}
Update: actually, that's not quite correct, this will only work if your content consists of only alphanumeric or underscore characters. A more precise regex would be:
$string=~m/He\d([^~]+)~~/g
Which will capture all data and break on a single tilde sign. If you need to capture single tildes in your data as well you need
$string=~m/He\d(.+?)~~/g
which is more accurate but slower (because it needs to look ahead).
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
| [reply] [d/l] [select] |
|
|
which is more accurate but slower (because it needs to look ahead).
According to this benchmark on my machine
it performs like this:
Rate [^~]+ .+?
[^~]+ 236537/s -- -22%
.+? 303038/s 28% --
which says, that the non-greedy matchall is even faster than the inverted character class. | [reply] [d/l] [select] |
|
|
use strict;
use warnings;
use Benchmark qw(:all);
my $string="He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hello~~He2W~o
+rld~~He0Hello~~He2W~orld~~";
my $f;
sub invertedCharclass { while($string=~m/He\d([^~]+)~~/g){$f=$1} }
sub nonGreedy { while($string=~m/He\d(.+?)~~/g){$f=$1} }
cmpthese (-10,
{
'[^~]+' => \&invertedCharclass,
'.+?' => \&nonGreedy,
}
);
gives this on my machine:
Rate .+? [^~]+
.+? 105088/s -- -28%
[^~]+ 146676/s 40% --
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
| [reply] [d/l] [select] |
|
|
Your help has been terific, thank you! I understand this better now. The next thing I want to do though kind of extends this...
He0Title1~~Te1~~Te2~~Te3~~Te4~~He1Title1~~Te5~~Te6~~Te7~~Te8He1Title2~~Te9~~Te10~~Te11~~Te12~~
Instead of getting what is between HeX and ~~ I need to get what is between Te and ~~, between the two HeX.
So in this case, i'd have an array 1,2,3,4, an array 5,6,7,8 and an array 9,10,11,12.
/Te(.+?)~~/ gets what is between Te and ~~ (not always digits)
What I cannot do is get what is after the ~~ of each HeX, up to the next one (if there is one).../~~(.+?)He/ does until the next He, but at the end of the string there is no He so it misses that bit off.
Can anyone show me how to do that? Thank yous for your help. Stephen.
| [reply] |
|
|
| [reply] [d/l] [select] |
Re: Get chars between 2 markers using regular expressions
by inman (Curate) on Dec 06, 2005 at 12:18 UTC
|
my @matches = /
He\d #Match 'He' followed by a single digit.
(.+?) #Capture characters non-greedily
~~ #until the end marker is reached
/gx; #The g modifier matches multiple instances
#into the @matches list
| [reply] [d/l] |
Re: Get chars between 2 markers using regular expressions
by tphyahoo (Vicar) on Dec 06, 2005 at 15:55 UTC
|
My gut feeling is that you need more test data. For instance, what about
He0HelloHe2World~~. Do you want to match the string starting with Hello, or the string starging with World? It's not really clear from your description. This kind of edge case may drive you mad if you don't get concrete about the desired behavior up front. | [reply] [d/l] |
Re: Get chars between 2 markers using regular expressions
by thundergnat (Deacon) on Dec 06, 2005 at 17:20 UTC
|
If I was faced with trying to do something like this, I would probably modify the input record separator to automatically break the text into chuncks and then parse it from there.
{
local $/ = '~~';
while (my $record = <DATA>){
chomp $record;
$record =~ s/\n//g;
# then do what you want with the records
print +($record =~ /^T/) ? ' ' : '', "$record\n";
}
}
__DATA__
He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hell
+o~~He2W~orld~~
He0Title1~~Te1~~Te2~~Te3~~Te4~~He1Title1~~Te5~~Te6~~Te7~~Te8He1Title2~
+~Te9~~Te10~~Te11~~Te12~~
He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hell
+o~~He2W~orld~~
He0Title1~~Te1~~Te2~~Te3~~Te4~~He1Title1~~Te5~~Te6~~Te7~~Te8He1Title2~
+~Te9~~Te10~~Te11~~Te12~~
He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hell
+o~~He2W~orld~~
| [reply] [d/l] |
Re: Get chars between 2 markers using regular expressions
by blazar (Canon) on Dec 06, 2005 at 13:15 UTC
|
my @wanted=/He[0-2](.*?)~~/g;
| [reply] [d/l] |