New Novice has asked for the wisdom of the Perl Monks concerning the following question:

Most enlighted monks!

I am looking for a neat way to count the frequency of the occurence of a given expression in a string. The answer to this will include regular expressions and there is a bit to this question in the CPAN regular expressions tutorial - it is just too condensed for the mind of this humble novice.

The background to my inquiry is that I am extracting the number and dates of events from a calendar. Using a simple match and extraction only gets me the first time of an event, even if there are several listings in the calendar. E.g. (this is not tested, but you will get the idea),

my @dates; my string='All kinds of text 01-01-2003 Perl Party more text 01-01-200 +4 Perl Party and even more text 01-01-2005 Perl Party and finally s +ome other date 01-01-2006'; if ($string=~/([0-9]{2})-([0-9]{2})-([0-9]{4})\s$/ {my $date="$1\.$2\. +$3"; push @dates, %date}}

will return only one perl party, namely the in 2003. One idea to remedy this, would be to split the string into substrings using the event as a delimiter and run the substrings through a for-loop. E.g.,

my @substrings=split( /Perls\sin\sParty/, $string); for my $substring (@substrings) { if ($substring=~/([0-9]{2})-([0-9]{2})-([0-9]{4})\s$/) { my $date; $date="$1\.$2\.$3"; print "There is a Perl Party on $date\n";}

But this would also return a Perl Party for 2006, which is simply a date which happens to be at the end of the final string. So we would have to test for genuine perl party dates by checking if the next (actually only if the last) substring starts with the expression "Perl Party". Because using the search string as a delimiter deletes the search expression from the substrings, we would have to go back to the original string and test all the dates we found. E.g.,

if ($substring=~/([0-9]{2})-([0-9]{2})-([0-9]{4})\s$/) { my $date="$1\.$2\.$3"; my $testdate="$1\.$2\.$3"." Perl Party"; if ($string=~/$testdate/ { print "There is a Perl Party on $date\n";} }}

As you can see this is getting rather complicated. I bet there is an easier for to at least count the frequency of the occurence of an expression in a string an possibly also for the extraction of a substring preceeding it.

Thank you for your efforts!

Your most humbly novice

Replies are listed 'Best First'.
Re: Counting frequency of expressions in a string
by GrandFather (Saint) on Aug 28, 2005 at 11:04 UTC

    You need the g (global) modifier and possibly the s (ignore line ends) modifier. You also need to remove the $ (end of pattern match). You also need to match "Perl Party".

    Test code would look something like this:

    use strict; use warnings; my @dates; my $string = do {$/ = ""; <DATA>}; while ($string =~ /\b(\d{2})-(\d{2})-(\d{4})\sPerl\sParty/g) { push @dates, "$1\.$2\.$3"; } print join "\n", @dates; __DATA__ All kinds of text 01-01-2003 Perl Party more text 01-01-2004 Perl Part +y and even more text 01-01-2005 Perl Party and finally some other date 01-01-2006
    Update: /s removed and relevant comment struck. See bart's reply below.

    Perl is Huffman encoded by design.
       /\b(\d{2})-(\d{2})-(\d{4})\sPerl\sParty/gs
      You don't have any dots in this regexp, so the /s is useless. You might feel like I'm nitpicking, and I am: I wouldn't mind its presence in actual production code.

      But this is a site for earning learning, you shouldn't set a bad example, too many people are cargo culting code (especially regular expressions) from this site already, so I think we should releave the code we post here from any voodoo as much as possible, so people will actually begin to understand the code they're copying.

      Thank you.

        For earning? I have not earned much here. Maybe you mean learning? ;-)


        holli, /regexed monk/

        You are of course quite right. The hidden story is that there was a .*? in there following a \G. I realised the anchor and wild card were redundant and removed them, but forgot to remove the /s.

        So the further lesson is: look, then look again. And each time you make a change, look twice more. That time I only looked once more. :)


        Perl is Huffman encoded by design.
Re: Counting frequency of expressions in a string
by spiritway (Vicar) on Aug 28, 2005 at 10:59 UTC

    Your re needs to have a 'g' after the final slash. So: /Perls\sin\sParty/g for example. The 'g' stands for "global". I'm thinking I'm overlooking something else, but at this hour my poor brain is a bit fried. Hope this helps you.