http://qs1969.pair.com?node_id=179125

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to match a data structure with a regex. I'm new to regex, but have made a start. The structure consists of 5 character blocks, seperated by spaces. The 1st is always 'UFOFH', then there are a number of blocks containing digits 0-9 and the / character. There are anywhere between 4 to 27 of these blocks. An example with 20 such blocks is below:

UFOFH 33603 01231 /0000 0024/ 1024/ 2025/ 3027/ 4030/ 5025/ 6028/ 7060/ 8081/ 9098/ 0110/ 1107/ 2106/ 3102/ 4080/ 5065/ 6057/ <\p>

I have tried the below, and some varations (the # is the regex seperator):

print $mailbox =~ m#UFOFH ([0-9//]{5} ){4,27}#ig;

But this only returns the penultimate charcter block before the newline (eg. '5065/') I have played around with the brackets, finding that this only returns specific bits of the structure (eg. one set of numbers). How do I wite the regex to return the whole structure from the UFOFH all the way to the last 5 character block?

Thanks

Replies are listed 'Best First'.
Re: Regex matching
by bronto (Priest) on Jul 03, 2002 at 12:28 UTC

    Fast and simple solution:

    $text = q(UFOFH 33603 01231 /0000 0024/ 1024/ 2025/ 3027/ 4030/ 5025/ +6028/ 7060/ 8081/ 9098/ 0110/ 1107/ 2106/ 3102/ 4080/ 5065/ 6057/) ; $text =~ s/^UFOFH // and print split /\s+/,$text ;

    If the line you have in $text begins with "UFOFH ", it wipes it away and splits on spaces.

    Cheers!
    --bronto

    Update: pure regexp solution:

    $text = q(UFOFH 33603 01231 /0000 0024/ 1024/ 2025/ 3027/ 4030/ 5025/ +6028/ 7060/ 8081/ 9098/ 0110/ 1107/ 2106/ 3102/ 4080/ 5065/ 6057/) ; $text =~ m|^UFOFH | and print "GOOOOHH!\n" if $text =~ m|([\d/]{5}\s?){4,27}|ig ;

    Update II:typo: you don't need the i when you are not matching alphabetics

    Using the conditional if beginning of string matches... then match... should improve greatly the speed of the matching, compared to your solution. In fact, you are matching your pattern anywhere in the string, and that's expensive. The tecnique I used matches a short string at a specified position (the beginning of the line), and that's really fast! And if it doesn't match you don't waste your time trying to match your pattern against a string you are not interested in.

    Anybody guessed I am in a hurry? :-)

    # Another Perl edition of a song:
    # The End, by The Beatles
    END {
      $you->take($love) eq $you->made($love) ;
    }

Re: Regex matching
by mikeirw (Pilgrim) on Jul 03, 2002 at 12:06 UTC

    This should work according to your example:

    #!/usr/bin/perl use warnings; use strict; my $string = "UFOFH 33603 01231 /0000 0024/ 1024/ 2025/ 3027/ 4030/ 50 +25/ 6028/ 7060/ 8081/ 9098/ 0110/ 1107/ 2106/ 3102/ 4080/ 5065/ 6057/ +"; if ( $string =~ m#UFOFH ([\d/]{5}\s?){4,27}# ) { print "$string\n"; }

    Update: I updated the regexp to be a little prettier. Need more coffee...

Re: Regex matching
by Ay_Bee (Monk) on Jul 03, 2002 at 13:23 UTC
    I don't know how you wish to use the data but to split out each charac +ter group (Except "UFOFH")into an array try this snippet $_='UFOFH 33603 01231 /0000 0024/ 1024/ 2025/ 3027/ 4030/ 5025/ 6028/ +7060/ 8081/ 9098/ 0110/ 1107/ 2106/ 3102/ 4080/ 5065/ 6057/ <\p>'; my @mailbox=( m#([0-9//]{5})#g); for my $i (0 .. $#mailbox) { print "mailbox[$i] => $mailbox[$i]\n"; }; which results in mailbox[0] => 33603 mailbox[1] => 01231 mailbox[2] => /0000 mailbox[3] => 0024/ mailbox[4] => 1024/ mailbox[5] => 2025/ mailbox[6] => 3027/ mailbox[7] => 4030/ mailbox[8] => 5025/ mailbox[9] => 6028/ mailbox[10] => 7060/ mailbox[11] => 8081/ mailbox[12] => 9098/ mailbox[13] => 0110/ mailbox[14] => 1107/ mailbox[15] => 2106/ mailbox[16] => 3102/ mailbox[17] => 4080/ mailbox[18] => 5065/ mailbox[19] => 6057/
    Ay_Bee
    -_-_-_-_-_-_-_-_-_-_-_- My memory concerns me - but I forget why !!!
Re: Regex matching
by Aristotle (Chancellor) on Jul 03, 2002 at 13:24 UTC
    But this only returns the penultimate charcter block before the newline

    The last block is followed by a newline, not a blank, while in your regex there is a space trailing the block, so it does not match. Next problem, it only returns a single block because you match the entire string at once, but capture into always the same parens. So after one single iteration you have "eaten" the entire string, but you only get the last paren content.

    Update: Checking for a UFOFH at the start of the line then splitting on blanks as suggested is a probably the easiest and most robust solution, but you can do it with just regexen:
    $_ = q(UFOFH 33603 01231 /0000 0024/ 1024/ 2025/ 3027/ 4030/ 5025/ 602 +8/ 7060/ 8081/ 9098/ 0110/ 1107/ 2106/ 3102/ 4080/ 5065/ 6057/); my @block; if(/^UFOFH/gi) { # NB: /g is necessary for the following \G to work my @matches = m#\G\s([/\d]{5})#g; @block = splice @matches, 0, 27 if @matches >= 4; } print "@block\n";

    The \G is an anchor kind of like ^, except it does not match at the beginning of the string, but at the point where the previous /g match ended. As you see, it is slightly complicated to do this without split..

    Update: I have a ways yet to go, obviously. The above solution bugged me, just slightly but it did. It took a day until I realized how to do it well - or more like, slap my forehead and say "D'oh".. This is the real thing (until further notice..):

    my @matches = m#(?:^UFOFH|\G)\s([/\d]{5})#g; my @block = splice @matches, 0, 27 if @matches >= 4; print "@block\n";
    ____________
    Makeshifts last the longest.

      Thanks everyone for all your help. I have now got a working expression! And it is slightly neater than the previous effort...

      The data structures are among email headers, which sometimes contain bits of the data. Unfortuantly not everyone sticks to quite the same data format, sometimes those /'s can be numbers, and the 'UFOFH' comes in a mixed case.

      Thanks again

Re: Regex matching
by Sifmole (Chaplain) on Jul 03, 2002 at 12:27 UTC
    What sort of return are you expecting? You should also check out the split command, it may be more appropriate here than a regex.

    I "know" why you are not getting all the matches, and only the penultimate character block, but I can't seem to explain it right and don't want to misspeak. Hopefully someone else will properly explain that.

    You regex can be improved as well
    You can use \d instead of [0-9]
    You have the / listed twice in your set.
    There is no need for a set because your pattern is number followed by slash followed by space.
    The reason you don't match out to the last item is because the last item does not have the trailing space. Perhaps you want to match space, number, slash instead.
    Is it important that you get between 4 and 27 such blocks? What happens if the line has 29 of them? If it is not important then you don't need the {4,27} and can just use +.
    Why are you using case-insensitive matching? Is the initial UFOFH possibly in mixed or lower case? If not, then the i is just slowing you down.

Re: Regex matching
by Anonymous Monk on Jul 04, 2002 at 12:39 UTC
    Err..this might be a stupud question, but....If you're looking to match the whole UFOFH expression (UFOFH + all the associated 5-char feilds) , why not just parse lines matching "^UFOFH" ?
Re: Regex Matching
by LAI (Hermit) on Jul 04, 2002 at 14:08 UTC
    If, as it seems, you want to get a match on a line which matches the specified UFOFH ... pattern, and print the whole line (or dump it into a variable), I would do the following (and, of course, TMTOWTDI):
    $mailbox =~ m#UFOFH( [\d/]{5}){4,27}# && print $&;
    This way, you use the $& variable to print out whatever was matched. With this implementation, any additional characters after 4 to 27 5-character blocks will be silently discarded, as will any characters before the UFOFH. You can prevent this (if you want) by rewriting the code as:
    $mailbox =~ m#^UFOFH( [\d/]{5}){4,27}$# && print $&;
    to trap the beginning and end of the string. Don't forget to chomp() it though!

    LAI
    :eof