coltman has asked for the wisdom of the Perl Monks concerning the following question:
Hello all,
I am totally new to perl and I wanna pick your brain on the following question. Any suggestions will be greatly appreciated.
I have a txt file which contains a HTML script. I want to extract the first date information shown up in this file. The date information may take two formats like either November 12,2006 or September 21, 1999 (the differences between two formats is the space between the comma and year). The annoying thing about this txt file is that
(1) the date information may be broken by line end, for example
bla bla bla bla bla bla bla bla bla bla bla bla June
25, 1998 bla bla bla bla bla bla bla bla bla bla
I don't know how I can parse this date information which spans over two lines.
(2) As this is a HTML file, it contains a lot of HTML tags like <bla bla>, some of them may be inserted into the date informaiton such as "June <bla bla>25, <bla bla>1998". How can I extract date information while ignoring all the information within the HTML tags (because inside the <bla bla> tag, it may contain numbers, which can be confused with the day information)?
Thanks a bunch in advance! Thanks for the reply, part of the HTML script looks like:
<div style="DISPLAY: block; MARGIN-LEFT: 1pt; TEXT-INDENT: 0pt; LINE-H +EIGHT: 1.25; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: in +line; FONT-SIZE: 10pt; FONT-FAMILY: Arial, sans-serif">November] [1,2005</font></div> <div style="DISPLAY: block; MARGIN-LEFT: 1pt; TEXT-INDENT: 0pt; LI +NE-HEIGHT: 1.25; MARGIN-RIGHT: 322.55pt" align="left"><br></div>]
or
<div style="MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0p +t" align="justify"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FON +T-FAMILY: Arial, sans-serif"><u>Term</u></font><font style="DISPLAY: +inline; FONT-SIZE: 10pt; FONT-FAMILY: Arial, sans-serif">. The terms and conditions of this Agreement shall have effect as +of and from </font><font style="DISPLAY: inline; FONT-SIZE: 8pt; FONT-FA +MILY: Arial, sans-serif"><sup>st</sup></font><font style="DISPLAY: in +line; FONT-SIZE: 10pt; FONT-FAMILY: Arial, sans-serif">March 1, </font><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAM +ILY: Arial, sans-serif">2006] (the </font><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-F +AMILY: Arial, sans-serif"><strong>"Effective Date") </strong></font><font style="DISPLAY: inline; FONT-SIZE: 1 +0pt; FONT-FAMILY: Arial, sans-serif">and provided in this Agreement.</font></div>
20070327 Janitored by Corion: Removed square brackets around HTML, added code tags, as per Writeup Formatting Tips
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How to extract information that spans over two lines in HTML
by kyle (Abbot) on Mar 20, 2007 at 14:55 UTC | |
|
Re: How to extract information that spans over two lines in HTML
by wfsp (Abbot) on Mar 20, 2007 at 14:39 UTC | |
|
Re: How to extract information that spans over two lines in HTML
by GrandFather (Saint) on Mar 20, 2007 at 21:05 UTC | |
|
Re: How to extract information that spans over two lines in HTML
by shigetsu (Hermit) on Mar 20, 2007 at 14:41 UTC | |
|
Re: How to extract information that spans over two lines in HTML
by jonsmith1982 (Beadle) on Mar 20, 2007 at 18:21 UTC |