Hello all,

I am totally new to perl and I wanna pick your brain on the following question. Any suggestions will be greatly appreciated.

I have a txt file which contains a HTML script. I want to extract the first date information shown up in this file. The date information may take two formats like either November 12,2006 or September 21, 1999 (the differences between two formats is the space between the comma and year). The annoying thing about this txt file is that

(1) the date information may be broken by line end, for example

bla bla bla bla bla bla bla bla bla bla bla bla June

25, 1998 bla bla bla bla bla bla bla bla bla bla

I don't know how I can parse this date information which spans over two lines.

(2) As this is a HTML file, it contains a lot of HTML tags like <bla bla>, some of them may be inserted into the date informaiton such as "June <bla bla>25, <bla bla>1998". How can I extract date information while ignoring all the information within the HTML tags (because inside the <bla bla> tag, it may contain numbers, which can be confused with the day information)?

Thanks a bunch in advance! Thanks for the reply, part of the HTML script looks like:

<div style="DISPLAY: block; MARGIN-LEFT: 1pt; TEXT-INDENT: 0pt; LINE-H +EIGHT: 1.25; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: in +line; FONT-SIZE: 10pt; FONT-FAMILY: Arial, sans-serif">November] [1,2005</font></div> <div style="DISPLAY: block; MARGIN-LEFT: 1pt; TEXT-INDENT: 0pt; LI +NE-HEIGHT: 1.25; MARGIN-RIGHT: 322.55pt" align="left"><br></div>]

or

<div style="MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0p +t" align="justify"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FON +T-FAMILY: Arial, sans-serif"><u>Term</u></font><font style="DISPLAY: +inline; FONT-SIZE: 10pt; FONT-FAMILY: Arial, sans-serif">. The terms and conditions of this Agreement shall have effect as +of and from </font><font style="DISPLAY: inline; FONT-SIZE: 8pt; FONT-FA +MILY: Arial, sans-serif"><sup>st</sup></font><font style="DISPLAY: in +line; FONT-SIZE: 10pt; FONT-FAMILY: Arial, sans-serif">March 1, </font><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAM +ILY: Arial, sans-serif">2006] (the </font><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-F +AMILY: Arial, sans-serif"><strong>"Effective Date") </strong></font><font style="DISPLAY: inline; FONT-SIZE: 1 +0pt; FONT-FAMILY: Arial, sans-serif">and provided in this Agreement.</font></div>

20070327 Janitored by Corion: Removed square brackets around HTML, added code tags, as per Writeup Formatting Tips


In reply to How to extract information that spans over two lines in HTML by coltman

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.