Once again I come before the Perl Monks seeking wisdom. I have been plagued with editing large Word documents and I am looking for a simple way to grab data from the document once it's been converted to HTML. The problem lies with the non-uniform way the HTML is generated and newlines placed seemingly at random
The data I am trying to grab are contained within the <b> tags. Here is an example...
<p class=para><a name="watch dog"></a><b>watch dog
-</b> A big dog that makes sure that you don't do anything that you're
+ not supposed to).</p>
<p class=para><a name="WR"></a><b>wooden round –</b> A big piece of ro
+und wood.</p>
I changed the content to protect the information but the structure is the same. The problem is that 3/4 of the document the text between the <b> tags appear on the same line (bottom example). The other 1/4 of the document the <b> tags are spread out over multiple lines. (top example)
I wrote a simple oneliner that grabbed 3/4 of the data, but I don't know how or if it is possible to easily grab the other 1/4.
Here is the oneliner.
perl -e 'while(<>){print "$1\n" if /<b>(.*)<\/b>/;}' smaller.txt
Any words of advice?