apessos has asked for the wisdom of the Perl Monks concerning the following question:
Once again I come before the Perl Monks seeking wisdom. I have been plagued with editing large Word documents and I am looking for a simple way to grab data from the document once it's been converted to HTML. The problem lies with the non-uniform way the HTML is generated and newlines placed seemingly at random
The data I am trying to grab are contained within the <b> tags. Here is an example...
<p class=para><a name="watch dog"></a><b>watch dog -</b> A big dog that makes sure that you don't do anything that you're + not supposed to).</p> <p class=para><a name="WR"></a><b>wooden round –</b> A big piece of ro +und wood.</p>
I changed the content to protect the information but the structure is the same. The problem is that 3/4 of the document the text between the <b> tags appear on the same line (bottom example). The other 1/4 of the document the <b> tags are spread out over multiple lines. (top example)
I wrote a simple oneliner that grabbed 3/4 of the data, but I don't know how or if it is possible to easily grab the other 1/4. Here is the oneliner.
Any words of advice?perl -e 'while(<>){print "$1\n" if /<b>(.*)<\/b>/;}' smaller.txt
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Dealing with Word Compact HTML
by Fletch (Bishop) on Apr 14, 2004 at 14:45 UTC | |
by matija (Priest) on Apr 14, 2004 at 15:07 UTC | |
by format_c (Initiate) on Apr 14, 2004 at 22:45 UTC | |
|
Re: Dealing with Word Compact HTML
by b10m (Vicar) on Apr 14, 2004 at 15:01 UTC | |
|
Re: Dealing with Word Compact HTML
by seattlejohn (Deacon) on Apr 14, 2004 at 15:20 UTC | |
|
Re: Dealing with Word Compact HTML
by relax99 (Monk) on Apr 14, 2004 at 15:44 UTC | |
by qq (Hermit) on Apr 14, 2004 at 19:21 UTC | |
by relax99 (Monk) on Apr 15, 2004 at 12:42 UTC | |
|
Re: Dealing with Word Compact HTML
by eXile (Priest) on Apr 14, 2004 at 15:14 UTC | |
|
Re: Dealing with Word Compact HTML
by rje (Deacon) on Apr 14, 2004 at 15:42 UTC | |
|
Re: Dealing with Word Compact HTML
by apessos (Acolyte) on Apr 14, 2004 at 18:10 UTC |