Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I am trying to get a very basic script to print out the matches from two regular expressions at once. Specifically, I am trying to print out all of the numbers (digits) and all of the words in between a "#" and the word "fin" in .txt files which take the following format:

The 2 cats and the dog. The 8 cats and the 6 dogs. The 3 pigs and the 2 sheep. #story fin #cats and dogs fin #sheep fin

So, for example, from the above file, I would expect the output to be:

2 8 6 3 2 story cats and dogs sheep

At the moment, I am using the following script:

open(FILE, 'C:\Users\li\perl\animals.txt'); $/ = " "; while (<FILE>) { if (m/((\d)+?)|((?<=#)(.*?)(?= fin))/g) { print "$1\n"; } }

However, while this returns the numbers, it does not return the desired words. I believe that my mistake is using the | operator, which I think is telling the script to finish becuase it has found the first part of the regex and doesn't need to continue for the rest?

A google search suggested that lookaheads could be used in a way that mirrors an "and" operator:  (?=.*word1)(?=.*word2)(?=.*word3) (http://www.ocpsoft.org/tutorials/regular-expressions/and-in-regex/) However, the following regex, created using the lookaheads suggested above, returns no results for me

open(FILE, 'C:\Users\li\perl\animals.txt'); $/ = " "; while (<FILE>) { if (m/(?=.*((\d)+?))(?=.*((?<=#)(.*?)(?= fin)))/g) { print "$1\n"; } }

I also read about using Smart Match How do I efficiently match many regular expressions at once?. However, when I run the following script, the only thing that appears is a notification that "Smartmatch is experimental at C:\Users\li\perl\animalscript2.pl line 3."

open(FILE, 'C:\Users\li\perl\animals.txt'); my @patterns = ( qr/((\d)+)/, qr/((?<=#)(.*?)(?= fin))/); if( $string ~~ @patterns ) { print "$1\n"; };

Any help would be greatly appreciated!

Replies are listed 'Best First'.
Re: Printing out matches for two regular expressions
by choroba (Cardinal) on Oct 22, 2017 at 09:08 UTC
    Why do you set $/ to a space? It reads the input file word by word and can never match the second part of the expression, it never sees the #cats together with their fin.

    To find all the matches on one line, use while instead of if.

    Finally, the second capture group populates $2, even after the vertical bar. Use a restart pattern to always start populating $1 in alternatives:

    while (m/(?|(\d)+?|(?<=#)(.*?)(?= fin))/g) {

    Which could be simplified to

    while (m/(?|(\d)+?|#(.*?) fin)/g) {

    Update: Are you sure about (\d)+? ? Have you tested it with numbers of more than one digit? You probably wanted just plain (\d+) .

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Brilliant! Thank you very much for your help!

      Also thanks for your tip about setting $/ to a space. In an earlier version of the script, I was trying to just locate multiple numbers across several lines; setting $/ to a space stopped the script from only printing the first number on each line. However, I realise now that it is inappropriate for the current task. Thanks again!

        The only thing I would add to choroba's comprehensive comments is that you might consider adding some kind of boundary assertion to the  fin delimiter pattern: see what happens when the  \b assertion in the match used below is omitted. I agree that the look-arounds don't seem needed, so I've left them out. (I use  \x23 instead of  # in my pattern only because my REPL doesn't like octothorpes.)

        c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; my @lines = ( 'The 2 cats and the dog.', 'The 8 cats and the 6 dogs.', 'The 3 pigs and the 2 sheep.', '', '#story fin #cats and dogs fin #sheep fin', 'blah yada', '#sharkfin soup fin #fish fingers fin', '9 fleas, 87 ticks, 654 lice.', '42 cats #some sheep fin and 1 dog', ); ;; for my $line (@lines) { printf qq{'$line' -> }; ;; my $parsed = my @extracted = $line =~ m{ (?| (\d+) | \x23 (.*?) \s+ fin \b) }xmsg; ;; print $parsed ? map qq{'$_' }, @extracted : 'nothing parsed'; } " 'The 2 cats and the dog.' -> '2' 'The 8 cats and the 6 dogs.' -> '8' '6' 'The 3 pigs and the 2 sheep.' -> '3' '2' '' -> nothing parsed '#story fin #cats and dogs fin #sheep fin' -> 'story' 'cats and dogs' +'sheep' 'blah yada' -> nothing parsed '#sharkfin soup fin #fish fingers fin' -> 'sharkfin soup' 'fish finger +s' '9 fleas, 87 ticks, 654 lice.' -> '9' '87' '654' '42 cats #some sheep fin and 1 dog' -> '42' 'some sheep' '1'


        Give a man a fish:  <%-{-{-{-<