http://qs1969.pair.com?node_id=574282

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
Given this type of html (stored in a variable), how can I extract the string after the last <pre> tag:
$VAR1 = '<html><title>GAL7</title> <body bgcolor=white> <h2 align=center>GAL7</h2><hr> <form method="post" action="/cgi-bin/SCPD/getgene2?GAL7" enctype="appl +ication/x-www-form-urlencoded"> <input type="submit" name="action" value="Get mapped sites" /><input t +ype="submit" name="action" value="Get putative sites" /><input type=" +submit" name="action" value="Get interg enic region" /><br /><input type="submit" name="action" value="Retriev +e sequence" />Start<-ATG <input type="text" name="start" value="-450" + size="5" maxlength="5" />ATG->End <inp ut type="text" name="end" value="50" size="5" maxlength="5" /><div></d +iv></form><hr> <pre> >YBR018C GAL7 275433 275933 TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA ';
such that it returns simply:
my $new_output = ' >YBR018C GAL7 275433 275933 #this fasta marker line is to be + kept TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA ';
I couldn't figure out how to create mechanism that can distinguished between html tag and 'fasta' tag marked by ">".

Regards,
Edward

Replies are listed 'Best First'.
Re: Extracting Text After <pre> tag in HTML
by GrandFather (Saint) on Sep 22, 2006 at 01:28 UTC

    Use HTML::TreeBuilder:

    use strict; use warnings; use HTML::TreeBuilder; my $str = '<html><title>GAL7</title> <body bgcolor=white> <h2 align=center>GAL7</h2> <hr> <form method="post" action="/cgi-bin/SCPD/getgene2?GAL7" enctype="appl +ication/x-www-form-urlencoded"> <input type="submit" name="action" value="Get mapped sites" /> <input type="submit" name="action" value="Get putative sites" /> <input type="submit" name="action" value="Get interg enic region" /><br /> <input type="submit" name="action" value="Retrieve sequence" />Start<- +ATG <input type="text" name="start" value="-450" size="5" maxlength="5" /> +ATG->End <input type="text" name="end" value="50" size="5" maxlength="5" /> <div></div></form> <hr> <pre> >YBR018C GAL7 275433 275933 TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA </pre>'; my $tree = HTML::TreeBuilder->new; $tree->parse ($str); print $_->as_text () . "\n" for $tree->find ('pre');

    Prints:

    >YBR018C GAL7 275433 275933 TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA

    Update: Fixed link


    DWIM is Perl's answer to Gödel
Re: Extracting Text After <pre> tag in HTML
by graff (Chancellor) on Sep 22, 2006 at 01:35 UTC
    Well, given the sample of data you've shown, this would do what you want (assuming the string is in $_):
    s{.*<pre>}{}s;
    That is, delete everything up to and including the "pre" tag. Note the "m" modifier at the end, so that "." is allowed to match "\n".

    Now, if there's also a  </pre> tag that you're not showing us, and more html data after that, you'll probably want to get rid of that as well:

    s{</pre>.*}{}s;
    Of course, if a given html page contains more than one "pre" segment, you'll need to be more careful. Ultimately, you might need to actually read the manual page for an HTML parsing module, and start using it, because that would be the preferred approach for this sort of thing.

    But if the data are consistently as simple as your sample, a couple regex substitutions will probably suffice.

    (updated my regexes to use the "s" modifier as intendedm, rather than the "m" modifier. Thanks, mreece!!)

      Note the "m" modifier at the end, so that "." is allowed to match "\n".
      you have that backwards! /s allows . to match \n, not /m.
Re: Extracting Text After <pre> tag in HTML
by gellyfish (Monsignor) on Sep 22, 2006 at 21:25 UTC

    Just for the sake of completeness here's how you might do it with HTML::Parser:

    use HTML::Parser; my $VAR1 = '<html><title>GAL7</title> <body bgcolor=white> <h2 align=center>GAL7</h2><hr> <form method="post" action="/cgi-bin/SCPD/getgene2?GAL7" enctype="appl +ication/x-www-form-urlencoded"> <input type="submit" name="action" value="Get mapped sites" /><input t +ype="submit" name="action" value="Get putative sites" /><input type=" +submit" name="action" value="Get interg enic region" /><br /><input type="submit" name="action" value="Retriev +e sequence" />Start<-ATG <input type="text" name="start" value="-450" + size="5" maxlength="5" />ATG->End <inp ut type="text" name="end" value="50" size="5" maxlength="5" /><div></d +iv></form><hr> <pre> >YBR018C GAL7 275433 275933 TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA </pre>Some other stuff</body></html>'; sub default_start { my ($self, $tagname) = @_; if ( $tagname eq 'pre' ) { $self->handler(text => \&get_text, "self,dtext"); $self->handler(end => \&end_text, "self,tagname"); } } sub get_text { my ($self, $text) = @_; if ( not exists $self->{_text} ) { $self->{_text} = $text; } else { $self->{_text} .= $text; } } sub end_text { my ( $self, $tagname) = @_; if ( $tagname eq 'pre' ) { $self->handler(text => ''); $self->handler(start => ''); $self->handler(end => ''); } } my $parser = HTML::Parser->new(start_h => [\&default_start,'self,tagna +me']); $parser->parse($VAR1); print $parser->{_text};
    This might have the advantage over using other parsers if you are dealing with large documents as it doesn't build a preparsed representation of the documentation before handing the events to you.

    /J\

      how to work
       PRE tag within text area
Re: Extracting Text After <pre> tag in HTML
by radiantmatrix (Parson) on Sep 27, 2006 at 15:02 UTC

    In case you haven't been given enough Ways To Do It, my first thought was HTML::TokeParser::Simple:

    use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( file => 'test_data.html' ); my $t; #token; my @text; #get all text between pre tags while ($t = $p->get_token) { next unless $t->is_start_tag('pre'); my $content; while ($t = $p->get_token) { last if $t->is_end_tag('pre'); $content .= $t->as_is; } push @text, $content; }

    I'm guessing this isn't the fastest approach... but hey, TMTOWTDI.

    <radiant.matrix>
    A collection of thoughts and links from the minds of geeks
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet
Re: Extracting Text After <pre> tag in HTML
by mreece (Friar) on Sep 22, 2006 at 20:28 UTC
    if you want to do this with regular expressions, which is in most cases a bad idea (arguably unless you know the precise structure of your html, such as being darned certain there won't be nested or unmatched tags, etc) .. consider:
    ## OP specified 'last <pre>' tag, ## so assume there can be more than one <pre>..</pre> block ## find all <pre> blocks, using non-greedy .*? and also ## get \n in the case where the html ends with a newline and no </pre> ## anchor to non-capturing match for closing </pre> or end of string my @pre = ( $VAR1 =~ m{<pre>(.*?\n?)(?:</pre>|$)}isg ); ## we want the last one my $new_output = pop @pre;