parsing html

paola82 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I have a problem with parsing, web pages, maybe the same problem every time I have to parse data like this...I have an html page, I show a part of that

<TR class="violet3">
    <TD ><B>hsa-miR-107</B></TD>
    <TD >17.1922</TD>
    <TD >-21.47</TD>
    <TD >2.119850e-02</TD>
    <TD >2.097540e-02</TD>

    <TD >6.191350e-04</TD>
    <TD >106</TD>
    <TD >127</TD>
    <TD ><pre><FONT COLOR="#FFFFFF">a</FONT><FONT COLOR="#FFFFFF">c</F
+ONT><FONT COLOR="#FFFFFF">u</FONT><FONT 
....
</TR>
<TR class="violet2">
    <TD ><B>hsa-miR-103</B></TD>
    <TD >17.1922</TD>
    ....
<TR class="violet3">
    <TD ><B>hsa-miR-651</B></TD>
    </TR>
<TR class="violet2">
    <TD ><B>hsa-miR-320</B></TD>
[download]

I need to extract hsa-miR-651, hsa-miR-320, etc....what I have to do, regular expression don't help me and I don't understand how to use some moduls like html::element....I don't understand the synthax and I'm not actually sure, if it is ok to use it...can anyoneone help me? Thanks you all

Comment on parsing html Download Code

Replies are listed 'Best First'.
Re: parsing html by wfsp (Abbot) on May 14, 2009 at 15:55 UTC
HTML::TokeParser::Simple is a good HTML parser to get started with and the docs are friendly. :-) I'm not to sure about your spec but if it is get the text within bold tags that are within td tags then something like this will get you started. #!/usr/bin/perl use warnings; use strict; use HTML::TokeParser::Simple; my $html = do{local $/;<DATA>}; my $p = HTML::TokeParser::Simple->new(string => $html); my ($in_td, $in_b); while (my $t = $p->get_token) { $in_td++, next if $t->is_start_tag(q{td}); $in_b++, next if $in_td and $t->is_start_tag(q{b}); next unless $in_td and $in_b; if ($t->is_text){ print $t->as_is, qq{\n}; $in_td = 0; $in_b = 0; } } __DATA__ <TR class="violet3"> <TD ><B>hsa-miR-107</B></TD> <TD >17.1922</TD> <TD >-21.47</TD> <TD >2.119850e-02</TD> <TD >2.097540e-02</TD> <TD >6.191350e-04</TD> <TD >106</TD> <TD >127</TD> <TD ><pre><FONT COLOR="#FFFFFF">a</FONT><FONT COLOR="#FFFFFF">c</FON +T><FONT COLOR="#FFFFFF">u</FONT><FONT></FONT> </TR> <TR class="violet2"> <TD ><B>hsa-miR-103</B></TD> <TD >17.1922</TD> <TR class="violet3"> <TD ><B>hsa-miR-651</B></TD> </TR> <TR class="violet2"> <TD ><B>hsa-miR-320</B></TD> [download] `hsa-miR-107 hsa-miR-103 hsa-miR-651 hsa-miR-320` [download] Good luck!	[reply] [d/l] [select]
Re: parsing html by mirod (Canon) on May 14, 2009 at 15:51 UTC
It is not entirely clear what you want to do, but if you need to extract the values of the first cell in TRs which class name starts with 'violet', then you could use HTML::TreeBuilder::XPath: `#!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; # empty tree $tree->parse_file( 'myfile.html'); my @values= $tree->findvalues( '//tr[@class=~/^violet[0-9]/]/td[1]'); foreach my $value (@values) { print $value, "\n"; }` [download]	[reply] [d/l]
Re: parsing html by ramrod (Priest) on May 14, 2009 at 15:29 UTC
Out of curiosity, did you try to use HTML::Element? I searched CPAN, and I came across HTML::Parser I would start there if I were doing this on my own. The documentation has examples. At any rate, try these modules and post the problems/errors you receive. There's a better chance of receiving the advice you seek that way.	[reply]
Re^2: parsing html by paola82 (Sexton) on May 14, 2009 at 15:58 UTC
Now I paste my code....the one I used and the error message...I would'nt past it before for not looking so stupid as I am....:-( `#!/usr/local/bin/perl use strict; use warnings; use LWP::Simple; my $url3="http://microrna.sanger.ac.uk/cgi-bin/targets/v5/detail_view. +pl?transcript_id=ENST00000226253"; my $content=get $url3; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file($content); $tree->delete; use HTML::Element; my @elements = my $element->find('b',); my @anchors = $element->look_down('_tag' => 'b'); print "@elements\n";` [download] and now the error.... Can't call method "find" on an undefined value at test.pl line 17......I don't now how to select the string between "b" and "/b" because I don't actually know html......and I don't understand the synthax...	[reply] [d/l]
Re^3: parsing html by wfsp (Abbot) on May 14, 2009 at 17:06 UTC
Nearly! :-) `!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $p = HTML::TreeBuilder->new; $p->parse_content($html); # parse_content if you have a string my @tds = $p->look_down(_tag => q{td}); # get a list of all the td tag +s for my $td (@tds){ my $bold = $td->look_down(_tag => q{b}); # look for a bold tag if ($bold){ print $bold->as_text, qq{\n}; # if there is one print the text } } $p->delete; # when you've finished with it` [download]	[reply] [d/l]
quite SOLVED Re^4: parsing html by paola82 (Sexton) on May 15, 2009 at 09:10 UTC
Re: quite SOLVED Re^4: parsing html by wfsp (Abbot) on May 15, 2009 at 09:28 UTC
Some notes below your chosen depth have not been shown here
Re^3: parsing html by whakka (Hermit) on May 14, 2009 at 17:14 UTC
Aside from understanding wfsp's solution, definitely check out the Documentation section of HTML-Tree for some articles for relative beginners that certainly aided my understanding of OO modules, HTML, tree structures, and parsing.	[reply]