paola82 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I have a problem with parsing, web pages, maybe the same problem every time I have to parse data like this...I have an html page, I show a part of that

<TR class="violet3"> <TD ><B>hsa-miR-107</B></TD> <TD >17.1922</TD> <TD >-21.47</TD> <TD >2.119850e-02</TD> <TD >2.097540e-02</TD> <TD >6.191350e-04</TD> <TD >106</TD> <TD >127</TD> <TD ><pre><FONT COLOR="#FFFFFF">a</FONT><FONT COLOR="#FFFFFF">c</F +ONT><FONT COLOR="#FFFFFF">u</FONT><FONT .... </TR> <TR class="violet2"> <TD ><B>hsa-miR-103</B></TD> <TD >17.1922</TD> .... <TR class="violet3"> <TD ><B>hsa-miR-651</B></TD> </TR> <TR class="violet2"> <TD ><B>hsa-miR-320</B></TD>

I need to extract hsa-miR-651, hsa-miR-320, etc....what I have to do, regular expression don't help me and I don't understand how to use some moduls like html::element....I don't understand the synthax and I'm not actually sure, if it is ok to use it...can anyoneone help me? Thanks you all

Replies are listed 'Best First'.
Re: parsing html
by wfsp (Abbot) on May 14, 2009 at 15:55 UTC
    HTML::TokeParser::Simple is a good HTML parser to get started with and the docs are friendly. :-)

    I'm not to sure about your spec but if it is get the text within bold tags that are within td tags then something like this will get you started.

    #!/usr/bin/perl use warnings; use strict; use HTML::TokeParser::Simple; my $html = do{local $/;<DATA>}; my $p = HTML::TokeParser::Simple->new(string => $html); my ($in_td, $in_b); while (my $t = $p->get_token) { $in_td++, next if $t->is_start_tag(q{td}); $in_b++, next if $in_td and $t->is_start_tag(q{b}); next unless $in_td and $in_b; if ($t->is_text){ print $t->as_is, qq{\n}; $in_td = 0; $in_b = 0; } } __DATA__ <TR class="violet3"> <TD ><B>hsa-miR-107</B></TD> <TD >17.1922</TD> <TD >-21.47</TD> <TD >2.119850e-02</TD> <TD >2.097540e-02</TD> <TD >6.191350e-04</TD> <TD >106</TD> <TD >127</TD> <TD ><pre><FONT COLOR="#FFFFFF">a</FONT><FONT COLOR="#FFFFFF">c</FON +T><FONT COLOR="#FFFFFF">u</FONT><FONT></FONT> </TR> <TR class="violet2"> <TD ><B>hsa-miR-103</B></TD> <TD >17.1922</TD> <TR class="violet3"> <TD ><B>hsa-miR-651</B></TD> </TR> <TR class="violet2"> <TD ><B>hsa-miR-320</B></TD>
    hsa-miR-107 hsa-miR-103 hsa-miR-651 hsa-miR-320
    Good luck!
Re: parsing html
by mirod (Canon) on May 14, 2009 at 15:51 UTC

    It is not entirely clear what you want to do, but if you need to extract the values of the first cell in TRs which class name starts with 'violet', then you could use HTML::TreeBuilder::XPath:

    #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; # empty tree $tree->parse_file( 'myfile.html'); my @values= $tree->findvalues( '//tr[@class=~/^violet[0-9]/]/td[1]'); foreach my $value (@values) { print $value, "\n"; }
Re: parsing html
by ramrod (Priest) on May 14, 2009 at 15:29 UTC
    Out of curiosity, did you try to use HTML::Element?

    I searched CPAN, and I came across HTML::Parser I would start there if I were doing this on my own. The documentation has examples.

    At any rate, try these modules and post the problems/errors you receive. There's a better chance of receiving the advice you seek that way.

      Now I paste my code....the one I used and the error message...I would'nt past it before for not looking so stupid as I am....:-(

      #!/usr/local/bin/perl use strict; use warnings; use LWP::Simple; my $url3="http://microrna.sanger.ac.uk/cgi-bin/targets/v5/detail_view. +pl?transcript_id=ENST00000226253"; my $content=get $url3; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file($content); $tree->delete; use HTML::Element; my @elements = my $element->find('b',); my @anchors = $element->look_down('_tag' => 'b'); print "@elements\n";

      and now the error.... Can't call method "find" on an undefined value at test.pl line 17......I don't now how to select the string between "b" and "/b" because I don't actually know html......and I don't understand the synthax...

        Nearly! :-)
        !/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $p = HTML::TreeBuilder->new; $p->parse_content($html); # parse_content if you have a string my @tds = $p->look_down(_tag => q{td}); # get a list of all the td tag +s for my $td (@tds){ my $bold = $td->look_down(_tag => q{b}); # look for a bold tag if ($bold){ print $bold->as_text, qq{\n}; # if there is one print the text } } $p->delete; # when you've finished with it
        Aside from understanding wfsp's solution, definitely check out the Documentation section of HTML-Tree for some articles for relative beginners that certainly aided my understanding of OO modules, HTML, tree structures, and parsing.