Re: How to extract information that spans over two lines in HTML

To match your date format:

m{
  # any of the 12 months:
  (?: January   |
      February  |
      March     |
      April     |
      May       |
      June      |
      July      |
      August    |
      September |
      October   |
      November  |
      December
    )
    \s          # a space between month and day
    [123]?\d    # Is the first day 1, \s1, or 01?
    ,           # a comma between day and year
    \s?         # an optional space between comma and year
    [12]\d{3}   # a four digit year
}xms
[download]

Each \s can match a newline as well as a space. Normally I'd write \s+ and \s* to match multiple spaces and zero-or-more spaces, respectively, but your spec seems to indicate only one is expected. I wasn't sure how a single digit day is to be represented, so I made it a single digit. If it needs a leading zero or space, the pattern needs to account for that.

To match intervening tags, I'd make it like this:

m{
  # any of the 12 months:
  (?: January   |
      February  |
      March     |
      April     |
      May       |
      June      |
      July      |
      August    |
      September |
      October   |
      November  |
      December
    )
    (?:<.+?>|\s)*  # zero or more tag like things or spaces
    [123]?\d       # Is the first day 1, \s1, or 01?
    ,              # a comma between day and year
    (?:<.+?>|\s)*  # zero or more tag like things or spaces
    [12]\d{3}      # a four digit year
}xms
[download]

Note that this allows there to be no space between the month and day.

To extract the date from this, wrap the pattern in parentheses and take it from $1 afterward.

if ( $html =~ m{ ( blah blah as above blah blah ) }xms ) {
    my $date = $1;
    # remove the tags, if there are any
    $date =~ s/<.+?>//g;
}
[download]

I have not tested any of the above. Hope it helps.

Comment on Re: How to extract information that spans over two lines in HTML Select or Download Code