Parse into array

vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I need to parse this

<td>Suggested Categories or Articles</td><td> <b>personal injury</b> <
+font size="-3" face="Verdana"> (0.56)</font><br><b>accident lawyers</
+b> <font size="-3" face="Verdana"> (0.4)</font><br><b>attorneys</b> <
+font size="-3" face="Verdana"> (0.35)</font><br><b>law firms</b> <fon
+t size="-3" face="Verdana"> (0.32)</font><br><b>litigation</b> <font 
+size="-3" face="Verdana"> (0.32)</font><br></td>
[download]

into array with content
"personal injury", "accident", ....
Namely with what is between BOLDs

<b>XXXXXX</b>
[download]

Please help. I know there should be one line regex (@ar) = $str =~ /....(...).../g

Comment on Parse into array Select or Download Code

Replies are listed 'Best First'.
Re: Parse into array by kennethk (Abbot) on Mar 11, 2009 at 23:18 UTC
Given that you are trying to parse html, have you considered HTML::Parser?	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Parse into array by bichonfrise74 (Vicar) on Mar 12, 2009 at 00:15 UTC
Try this... but as someone pointed out, this could break easily. #!/usr/bin/perl use strict; while( <DATA> ) { foreach ( split "</b>" ) { (my $name) = $_ =~ /\s?\w.\<b\>(\w+.)/; print "$name\n" if ( defined( $name ) ); } } __DATA__ <td>Suggested Categories or Articles</td><td> <b>personal injury</b> < +font size="-3" face="Verdana"> (0.56)</font><br><b>accident lawyers</ +b> <font size="-3" face="Verdana"> ( 0.4)</font><br><b>attorneys</b> <font size="-3" face="Verdana"> (0.35) +</font><br><b>law firms</b> <font size="-3" face="Verdana"> (0.32)</f +ont><br><b>litigation</b> <font size ="-3" face="Verdana"> (0.32)</font><br></td> [download]	[reply] [d/l]
Re^2: Parse into array by vit (Friar) on Mar 12, 2009 at 01:54 UTC
There is one complecation I should've mentioned. I have many HTML portions like this, and I need to parse only one which starts with "Suggested Categories or Articles....." Solved, no problem, Thanks everybody!!	[reply]
Re: Parse into array by nagalenoj (Friar) on Mar 12, 2009 at 04:11 UTC
This is just another way to solve your problem. use strict; use warnings; while( <DATA> ) { foreach ( split "<b>" ) { # used .* since HTML code can be nested like <b><i>injury</i>< +/b> (my $name) = $_ =~ /(.*?)\<\/b\>/; print "$name\n" if ( defined( $name ) ); } } __DATA__ <td>Suggested Categories or Articles</td><td> <b><i>personal injury</i +></b> <font size="-3" face="Verdana"> (0.56)</font><br><b>accident la +wyers</b> <font size="-3" face="Verdana"> ( 0.4)</font><br><b>attorne +ys</b> <font size="-3" face="Verdana"> (0.35)</font><br><b>law firms< +/b> <font size="-3" face="Verdana"> (0.32)</font><br><b>litigation</b +> <font size ="-3" face="Verdana"> (0.32)</font><br></td> [download] The output will be like `<i>personal injury</i> accident lawyers attorneys law firms litigation` [download]	[reply] [d/l] [select]
Re: Parse into array by wfsp (Abbot) on Mar 12, 2009 at 08:40 UTC
Glad to hear you solved it with a regex. In case it fails in the future here's my go with a parser. #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $r = HTML::TreeBuilder->new_from_content($html); printf qq{%s\n}, $_->as_text for $r->look_down(_tag => q{b}); __DATA__ <td>Suggested Categories or Articles</td> <td> <b>personal injury</b> <font size="-3" face="Verdana"> (0.56)</font><br> <b>accident lawyers</b> <font size="-3" face="Verdana"> (0.4)</font><br> <b>attorneys</b> <font size="-3" face="Verdana"> (0.35)</font><br> <b>law firms</b> <font size="-3" face="Verdana"> (0.32)</font><br> <b>litigation</b> <font size="-3" face="Verdana"> (0.32)</font><br> </td> [download] `personal injury accident lawyers attorneys law firms litigation` [download]	[reply] [d/l] [select]
Re: Parse into array by leslie (Pilgrim) on Mar 12, 2009 at 04:59 UTC
Use this below code It will help you, `while( <DATA> ) { foreach ( split "<b>" ) { (my $name) = $_ =~ /(\w+\s?\w.*)\<\/b\>/; print "$name\n" if ( defined( $name ) ); } }` [download]	[reply] [d/l]