Once again I come before the Perl Monks seeking wisdom. I have been plagued with editing large Word documents and I am looking for a simple way to grab data from the document once it's been converted to HTML. The problem lies with the non-uniform way the HTML is generated and newlines placed seemingly at random
The data I am trying to grab are contained within the <b> tags. Here is an example...
<p class=para><a name="watch dog"></a><b>watch dog -</b> A big dog that makes sure that you don't do anything that you're + not supposed to).</p> <p class=para><a name="WR"></a><b>wooden round –</b> A big piece of ro +und wood.</p>
I changed the content to protect the information but the structure is the same. The problem is that 3/4 of the document the text between the <b> tags appear on the same line (bottom example). The other 1/4 of the document the <b> tags are spread out over multiple lines. (top example)
I wrote a simple oneliner that grabbed 3/4 of the data, but I don't know how or if it is possible to easily grab the other 1/4. Here is the oneliner.
Any words of advice?perl -e 'while(<>){print "$1\n" if /<b>(.*)<\/b>/;}' smaller.txt
In reply to Dealing with Word Compact HTML by apessos
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |