You already have captured the (daily|month|days|...) group as $2, so you can just do something like:

  my ($found, $term) = ($1, $2);

However, I question whether your regexp is going to be adequate. Usually "daily" doesn't have anything relevant in front of it, so you'd be capturing, for example, "Here is (my) (daily) string!" (captured groups in (bold parens). And do you really want "(bi)-(month)ly" captured as such? "Every (other) (week)"? "Biweekly" (no hyphen) and "semiweekly" are usually acceptable in English as well.

What I'm trying to say is, parsing periodical time periods in English is hard enough, but pulling them out of sentences will be even harder. My suggestion would be to precisely match the entire periodic period, so you don't pull in extraneous information. Your regexp will not match as often, but some false negatives are likely preferable to inaccurate parsing.

I found a periodic frequency list for you that contains additional terms (some of them archaic and likely not applicable). I suggest more research. Then, I'd start building a regexp something like what I've started below.

Note, of course, that this is only a suggestion of a starting point. What I've come up with is certainly incomplete and needs to be expanded and tested rigorously with a sizable corpus of input strings.

#!/usr/bin/env perl use 5.010; use warnings; use strict; my $NUMBER = qr/(?i:three|four|five|six|seven|eight|nine|ten|\d+)/; my $PERIOD = qr/(?i:day|week|month|quarter|year)/; for ( map { chomp; $_ } <DATA> ) { say "`$_' contains `$1'" if /\b ( (?:bi|semi)? [-]? (?:weekly|monthly) | (?:every\sother | twice\s)? (?:daily|monthly|quarterly|a +nnually) | (?:once|twice|$NUMBER\stimes)\s (?:a|per)\s $PERIOD | (?:every\s(?:(?:other|twice)\s)?)? $PERIOD | (?:se|bi)?mestral ) \b /xi; } __DATA__ Here is the weekly TPS report. I go for a walk semimonthly. How often do you clean this toilet? Quarterly?! The sun comes up seven times per week. I get older every year. Not many people say "bimestral" anymore.

Output:

`Here is the weekly TPS report.' contains `weekly' `I go for a walk semimonthly.' contains `semimonthly' `How often do you clean this toilet? Quarterly?!' contains `Quarterly' `The sun comes up seven times per week.' contains `seven times per wee +k' `I get older every year.' contains `every year' `Not many people say "bimestral" anymore.' contains `bimestral'

Only once you are able to extract the entire period would I suggest you then attempt to parse it. (i.e., once you have "bimonthly", further parse or interpret that as you see fit).


In reply to Re: Group matching - Extracting what matches by rjt
in thread Group matching - Extracting what matches by madbee

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.