vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I need to parse this
<td>Suggested Categories or Articles</td><td> <b>personal injury</b> < +font size="-3" face="Verdana"> (0.56)</font><br><b>accident lawyers</ +b> <font size="-3" face="Verdana"> (0.4)</font><br><b>attorneys</b> < +font size="-3" face="Verdana"> (0.35)</font><br><b>law firms</b> <fon +t size="-3" face="Verdana"> (0.32)</font><br><b>litigation</b> <font +size="-3" face="Verdana"> (0.32)</font><br></td>
into array with content
"personal injury", "accident", ....
Namely with what is between BOLDs
<b>XXXXXX</b>
Please help. I know there should be one line regex (@ar) = $str =~ /....(...).../g

Replies are listed 'Best First'.
Re: Parse into array
by kennethk (Abbot) on Mar 11, 2009 at 23:18 UTC
    Given that you are trying to parse html, have you considered HTML::Parser?
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Parse into array
by bichonfrise74 (Vicar) on Mar 12, 2009 at 00:15 UTC
    Try this... but as someone pointed out, this could break easily.
    #!/usr/bin/perl use strict; while( <DATA> ) { foreach ( split "</b>" ) { (my $name) = $_ =~ /\s?\w.*\<b\>(\w+.*)/; print "$name\n" if ( defined( $name ) ); } } __DATA__ <td>Suggested Categories or Articles</td><td> <b>personal injury</b> < +font size="-3" face="Verdana"> (0.56)</font><br><b>accident lawyers</ +b> <font size="-3" face="Verdana"> ( 0.4)</font><br><b>attorneys</b> <font size="-3" face="Verdana"> (0.35) +</font><br><b>law firms</b> <font size="-3" face="Verdana"> (0.32)</f +ont><br><b>litigation</b> <font size ="-3" face="Verdana"> (0.32)</font><br></td>
      There is one complecation I should've mentioned.
      I have many HTML portions like this, and I need to parse only one which starts with "Suggested Categories or Articles....." Solved, no problem, Thanks everybody!!
Re: Parse into array
by nagalenoj (Friar) on Mar 12, 2009 at 04:11 UTC
    This is just another way to solve your problem.
    use strict; use warnings; while( <DATA> ) { foreach ( split "<b>" ) { # used .* since HTML code can be nested like <b><i>injury</i>< +/b> (my $name) = $_ =~ /(.*?)\<\/b\>/; print "$name\n" if ( defined( $name ) ); } } __DATA__ <td>Suggested Categories or Articles</td><td> <b><i>personal injury</i +></b> <font size="-3" face="Verdana"> (0.56)</font><br><b>accident la +wyers</b> <font size="-3" face="Verdana"> ( 0.4)</font><br><b>attorne +ys</b> <font size="-3" face="Verdana"> (0.35)</font><br><b>law firms< +/b> <font size="-3" face="Verdana"> (0.32)</font><br><b>litigation</b +> <font size ="-3" face="Verdana"> (0.32)</font><br></td>
    The output will be like
    <i>personal injury</i> accident lawyers attorneys law firms litigation
Re: Parse into array
by wfsp (Abbot) on Mar 12, 2009 at 08:40 UTC
    Glad to hear you solved it with a regex. In case it fails in the future here's my go with a parser.
    #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $r = HTML::TreeBuilder->new_from_content($html); printf qq{*%s*\n}, $_->as_text for $r->look_down(_tag => q{b}); __DATA__ <td>Suggested Categories or Articles</td> <td> <b>personal injury</b> <font size="-3" face="Verdana"> (0.56)</font><br> <b>accident lawyers</b> <font size="-3" face="Verdana"> (0.4)</font><br> <b>attorneys</b> <font size="-3" face="Verdana"> (0.35)</font><br> <b>law firms</b> <font size="-3" face="Verdana"> (0.32)</font><br> <b>litigation</b> <font size="-3" face="Verdana"> (0.32)</font><br> </td>
    *personal injury* *accident lawyers* *attorneys* *law firms* *litigation*
Re: Parse into array
by leslie (Pilgrim) on Mar 12, 2009 at 04:59 UTC
    Use this below code It will help you,
    while( <DATA> ) { foreach ( split "<b>" ) { (my $name) = $_ =~ /(\w+\s?\w.*)\<\/b\>/; print "$name\n" if ( defined( $name ) ); } }