Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello PerlMonks,

I am new to Perl and regular expressions and I was hoping some one can offer some help on how to get a string from between two markers.

The first marker is He0, He1 or He2, and the end marker is always ~~.

He1Hello World~~, I would get 'Hello World',
He0Hello~~He2World~~, I would get Hello, and World

I guess I need to use m/\W/ but I don't know how to specify between the two markers.

Thanks for your help. Stephen.

  • Comment on Get chars between 2 markers using regular expressions

Replies are listed 'Best First'.
Re: Get chars between 2 markers using regular expressions
by McDarren (Abbot) on Dec 06, 2005 at 12:16 UTC
    Update: Fixed a couple of problems as pointed out by sauoq and Random_Walk. Thanks guys :)

    perlre is your friend here :)

    For your first marker, you probably want to use a character class. So you might have something like: He[012].

    For the middle part of your expression, it depends what you expect it to contain. If you're confident that it will only be alphanumeric characters and whitespace, then you could use ([\w\s]+)

    \w denotes an alphanumeric character, \s denotes any whitespace character. These are wrapped in a character class by using the square brackets "[]", and the + quantifier is used, meaning "match one or more". The whole lot is wrapped in parentheses because you want to "capture" the string.

    The end part of the expression is easy, as you said it will always end with "~~"

    So, putting it all together you get (untested):

    m/He[012]([\w\s]+)~~/

    The captured string will be available in the $1 variable afterwards.

    One point to note: If you have several such strings in a single line of data, then only the last first match will be returned. You could capture all matches into a list by using the 'g' modifier, like so:

    my @strings = m/He[012]([/w/s]+)~~/g;

    Hope this helps,
    Darren :)

      Very good answer overall, but there are a couple nits.

      Firstly, you keep using a slash where you need a backslash; it isn't /w and /s but \w and \s.

      Secondly, you made the statement:

      If you have several such strings in a single line of data, then only the last match will be returned.
      That's incorrect in that it will be the first match returned, not the last one.

      -sauoq
      "My two cents aren't worth a dime.";
      

      McDarren, try tipping your / slashes to \ slashes in the character class [\w\s]

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
Re: Get chars between 2 markers using regular expressions
by tirwhan (Abbot) on Dec 06, 2005 at 12:08 UTC
    $string="He0Hello~~He2World~~"; while ($string=~m/He\d(\w+)~~/g) { print "$1\n"; }

    Update: actually, that's not quite correct, this will only work if your content consists of only alphanumeric or underscore characters. A more precise regex would be:

    $string=~m/He\d([^~]+)~~/g

    Which will capture all data and break on a single tilde sign. If you need to capture single tildes in your data as well you need

    $string=~m/He\d(.+?)~~/g
    which is more accurate but slower (because it needs to look ahead).

    Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
      which is more accurate but slower (because it needs to look ahead).
      According to this benchmark on my machine it performs like this:

      Rate [^~]+ .+? [^~]+ 236537/s -- -22% .+? 303038/s 28% --

      which says, that the non-greedy matchall is even faster than the inverted character class.

        You're only matching the regex once, not collecting all instances of the match. The difference gets more pronounced the longer the string becomes:

        use strict; use warnings; use Benchmark qw(:all); my $string="He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hello~~He2W~o +rld~~He0Hello~~He2W~orld~~"; my $f; sub invertedCharclass { while($string=~m/He\d([^~]+)~~/g){$f=$1} } sub nonGreedy { while($string=~m/He\d(.+?)~~/g){$f=$1} } cmpthese (-10, { '[^~]+' => \&invertedCharclass, '.+?' => \&nonGreedy, } );
        gives this on my machine:
        Rate .+? [^~]+ .+? 105088/s -- -28% [^~]+ 146676/s 40% --

        Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
      Your help has been terific, thank you! I understand this better now. The next thing I want to do though kind of extends this...

      He0Title1~~Te1~~Te2~~Te3~~Te4~~He1Title1~~Te5~~Te6~~Te7~~Te8He1Title2~~Te9~~Te10~~Te11~~Te12~~

      Instead of getting what is between HeX and ~~ I need to get what is between Te and ~~, between the two HeX.

      So in this case, i'd have an array 1,2,3,4, an array 5,6,7,8 and an array 9,10,11,12.

      /Te(.+?)~~/ gets what is between Te and ~~ (not always digits)

      What I cannot do is get what is after the ~~ of each HeX, up to the next one (if there is one).../~~(.+?)He/ does until the next He, but at the end of the string there is no He so it misses that bit off.

      Can anyone show me how to do that? Thank yous for your help. Stephen.

        You'll want to make a couple of passes. The first one should split on He\d. That will give you a separate string to turn into each array. For each of those strings, you can use your Te regex to extract what you're looking for.

        Caution: Contents may have been coded under pressure.
Re: Get chars between 2 markers using regular expressions
by inman (Curate) on Dec 06, 2005 at 12:18 UTC
    You need to do a couple of things here. Match and capture. You have already started the match part with the m// construct (although the m is not required). You need to match the start and end and capture everything in between. You capture a value using parentheses around a pattern.

    The following code is commented by virtue of the x modifier. Take a look at perlre for details.

    my @matches = / He\d #Match 'He' followed by a single digit. (.+?) #Capture characters non-greedily ~~ #until the end marker is reached /gx; #The g modifier matches multiple instances #into the @matches list
Re: Get chars between 2 markers using regular expressions
by tphyahoo (Vicar) on Dec 06, 2005 at 15:55 UTC
    My gut feeling is that you need more test data. For instance, what about He0HelloHe2World~~. Do you want to match the string starting with Hello, or the string starging with World? It's not really clear from your description. This kind of edge case may drive you mad if you don't get concrete about the desired behavior up front.
Re: Get chars between 2 markers using regular expressions
by thundergnat (Deacon) on Dec 06, 2005 at 17:20 UTC

    If I was faced with trying to do something like this, I would probably modify the input record separator to automatically break the text into chuncks and then parse it from there.

    { local $/ = '~~'; while (my $record = <DATA>){ chomp $record; $record =~ s/\n//g; # then do what you want with the records print +($record =~ /^T/) ? ' ' : '', "$record\n"; } } __DATA__ He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hell +o~~He2W~orld~~ He0Title1~~Te1~~Te2~~Te3~~Te4~~He1Title1~~Te5~~Te6~~Te7~~Te8He1Title2~ +~Te9~~Te10~~Te11~~Te12~~ He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hell +o~~He2W~orld~~ He0Title1~~Te1~~Te2~~Te3~~Te4~~He1Title1~~Te5~~Te6~~Te7~~Te8He1Title2~ +~Te9~~Te10~~Te11~~Te12~~ He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hello~~He2W~orld~~He0Hell +o~~He2W~orld~~
Re: Get chars between 2 markers using regular expressions
by blazar (Canon) on Dec 06, 2005 at 13:15 UTC