coltman has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I am totally new to perl and I wanna pick your brain on the following question. Any suggestions will be greatly appreciated.

I have a txt file which contains a HTML script. I want to extract the first date information shown up in this file. The date information may take two formats like either November 12,2006 or September 21, 1999 (the differences between two formats is the space between the comma and year). The annoying thing about this txt file is that

(1) the date information may be broken by line end, for example

bla bla bla bla bla bla bla bla bla bla bla bla June

25, 1998 bla bla bla bla bla bla bla bla bla bla

I don't know how I can parse this date information which spans over two lines.

(2) As this is a HTML file, it contains a lot of HTML tags like <bla bla>, some of them may be inserted into the date informaiton such as "June <bla bla>25, <bla bla>1998". How can I extract date information while ignoring all the information within the HTML tags (because inside the <bla bla> tag, it may contain numbers, which can be confused with the day information)?

Thanks a bunch in advance! Thanks for the reply, part of the HTML script looks like:

<div style="DISPLAY: block; MARGIN-LEFT: 1pt; TEXT-INDENT: 0pt; LINE-H +EIGHT: 1.25; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: in +line; FONT-SIZE: 10pt; FONT-FAMILY: Arial, sans-serif">November] [1,2005</font></div> <div style="DISPLAY: block; MARGIN-LEFT: 1pt; TEXT-INDENT: 0pt; LI +NE-HEIGHT: 1.25; MARGIN-RIGHT: 322.55pt" align="left"><br></div>]

or

<div style="MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0p +t" align="justify"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FON +T-FAMILY: Arial, sans-serif"><u>Term</u></font><font style="DISPLAY: +inline; FONT-SIZE: 10pt; FONT-FAMILY: Arial, sans-serif">. The terms and conditions of this Agreement shall have effect as +of and from </font><font style="DISPLAY: inline; FONT-SIZE: 8pt; FONT-FA +MILY: Arial, sans-serif"><sup>st</sup></font><font style="DISPLAY: in +line; FONT-SIZE: 10pt; FONT-FAMILY: Arial, sans-serif">March 1, </font><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAM +ILY: Arial, sans-serif">2006] (the </font><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-F +AMILY: Arial, sans-serif"><strong>"Effective Date") </strong></font><font style="DISPLAY: inline; FONT-SIZE: 1 +0pt; FONT-FAMILY: Arial, sans-serif">and provided in this Agreement.</font></div>

20070327 Janitored by Corion: Removed square brackets around HTML, added code tags, as per Writeup Formatting Tips

Replies are listed 'Best First'.
Re: How to extract information that spans over two lines in HTML
by kyle (Abbot) on Mar 20, 2007 at 14:55 UTC

    To match your date format:

    m{ # any of the 12 months: (?: January | February | March | April | May | June | July | August | September | October | November | December ) \s # a space between month and day [123]?\d # Is the first day 1, \s1, or 01? , # a comma between day and year \s? # an optional space between comma and year [12]\d{3} # a four digit year }xms

    Each \s can match a newline as well as a space. Normally I'd write \s+ and \s* to match multiple spaces and zero-or-more spaces, respectively, but your spec seems to indicate only one is expected. I wasn't sure how a single digit day is to be represented, so I made it a single digit. If it needs a leading zero or space, the pattern needs to account for that.

    To match intervening tags, I'd make it like this:

    m{ # any of the 12 months: (?: January | February | March | April | May | June | July | August | September | October | November | December ) (?:<.+?>|\s)* # zero or more tag like things or spaces [123]?\d # Is the first day 1, \s1, or 01? , # a comma between day and year (?:<.+?>|\s)* # zero or more tag like things or spaces [12]\d{3} # a four digit year }xms

    Note that this allows there to be no space between the month and day.

    To extract the date from this, wrap the pattern in parentheses and take it from $1 afterward.

    if ( $html =~ m{ ( blah blah as above blah blah ) }xms ) { my $date = $1; # remove the tags, if there are any $date =~ s/<.+?>//g; }

    I have not tested any of the above. Hope it helps.

Re: How to extract information that spans over two lines in HTML
by wfsp (Abbot) on Mar 20, 2007 at 14:39 UTC
    There are modules that will help you do this (e.g. HTML::TokeParser::Simple).

    We need a bit more information to go on though. Could you show some (short) samples of what your dealing with?

    For instance, do you know what tags will be surrounding what you're looking for? Do you know what tags may be 'in' what you're looking for?

    If you can let us know it would be easier to help.

Re: How to extract information that spans over two lines in HTML
by GrandFather (Saint) on Mar 20, 2007 at 21:05 UTC

    For this sort of task I reach for mod?;;HTML::TreeBuilder. Consider:

    use strict; use warnings; use HTML::TreeBuilder; my $html = <<HTML; <p>June <b>25, </b>1998</p> <p>November 12,2006 September 21, 1999</p> <p>December 36, 10</p> HTML my $tree = HTML::TreeBuilder->new_from_content ($html); for ($tree->look_down ('_tag', 'p')) { my $text = $_->as_text (); print "$1\n" while $text =~ /(\w+\s+\d+,\s*\d+)/g; }

    Prints:

    June 25, 1998 November 12,2006 September 21, 1999 December 36, 10

    DWIM is Perl's answer to Gödel
Re: How to extract information that spans over two lines in HTML
by shigetsu (Hermit) on Mar 20, 2007 at 14:41 UTC
    As we can just guess how your input data looks like, I ask you gently to post some code. There are a few points that come to mind when reading your description of the problem at hand, but speculating may lead to misunderstandings and inefficiency in providing useful hints that support your solution.
Re: How to extract information that spans over two lines in HTML
by jonsmith1982 (Beadle) on Mar 20, 2007 at 18:21 UTC
    haven't looked much into the regexp needed to extract the date but here is a little tip.....
    /s Treat string as single line. That is, change "." to match any characte +r whatsoever, even a newline, which normally it would not match.