Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This, in some ways, follow on from the discussions on extracting the raw text from an HTML source (see the How to get HTML::Parser to return a line of parsed text thread.

What I would like to do is to extract the first n words of printable data from a string of HTML text.

Using the code:

# Create a new, empty, scalar my $text = ""; # Define what the parser does my $p = HTML::Parser->new( text_h => [ sub {$text .= shift}, 'dtext' ] ); # .. and parse! $p->parse($full_text);

Based on this, it is quite easy to then get the first n words:

# now hack off the first lump of words @list_of_words = split /[ \t\r\f]+/, $text, $n;

However, how do I then correlate the words from @list_of_words to the start of the HTML text in $full_text?
(The plan being that I can do the "blah blah blah (more...)" thing...

Edit: chipmunk 2001-05-29

Replies are listed 'Best First'.
Re: getting the first n printable words from a string of HTML
by tachyon (Chancellor) on May 29, 2001 at 20:39 UTC

    You can correlate the words in the list as shown below

    my $html = "<h1>Foo</h1><p>Bar</p><p>Some more text here</p>"; my @list = ('Foo','Bar'); # eat up the bits in @list $html =~ m/$_/gc for @list; #use \G to match the rest ($rest) = $html =~ m/\G(.*)$/; print $rest;

    Here we use /gc and the \G assertion to to first eat up the string by matching the words in @list in sequence, and then match the rest of the string starting just past the last match.

    Note there are some circumstances where this will fail such as when you have a complete element in @list which matches an HTML tag or part therof.

    It should work for most practical circumstances I think.

    tachyon

      I'm confused.

      tachyon posted an interesting solution, but i can't makes any sense of it.

      first of all, i could find no mention of /c in perlman:perlre, but perl didn't barf on it -- so what does it do?

      second, i tried the following near-identical code and it didn't do at all what i expected:
      #!/usr/bin/perl -w use strict; my $string = "foo bar baz bar quux foo gin"; my @list = qw(foo bar); $string =~ m/\Q$_\E/gc for @list; my ($rest) = $string =~ m/\G(.*)$/; print "$rest\n";
      .. it printed:
      foo bar baz bar quux foo gin
      and honestly, i can't see how it could possibly work the way you wanted it to. it seems to me that the /g on the m// would cause the 'foo' to match at the first and sixth words, and then the 'bar' to match at words 2 and 4, leaving the \G assertion after word 4, not after word 2 where i think it belongs.

      obviously, i'm missing some of what's going on here. can you enlighten me?

Re: getting the first emn/em printable words from a string of HTML
by kiz (Monk) on May 29, 2001 at 18:58 UTC

    (This was actually entered by me, but I guffed the login bit and haddn't realised)

    -- Ian Stuart
    A man depriving some poor village, somewhere, of a first-class idiot.
    
Re: getting the first n printable words from a string of HTML
by tachyon (Chancellor) on May 30, 2001 at 07:07 UTC

    *UPDATE* It occurred to me that my code will cause a runtime error or weird behaviour if any element in @list contains regex metachars as these will be interpolated into the eat it up regex. To fix this we need to escape all these chars. Here is the patched code.

    tachyon

    my $html = "<h1>F(oo</h1><p>Bar</p><p>Some more text here</p>"; my @list = ('F(oo','Bar'); # you need to make the elements in @list regex # friendly by backslashing all the metachars # comment out this line to see this script choke # on the ( in F(oo s/([\$\^\*\(\)\+\{\[\\\|\.\?])/\\$1/g for @list; # eat up the bits in @list $html =~ m/$_/gc for @list; #use \G to match the rest ($rest) = $html =~ m/\G(.*)$/; print $rest;

    *Update* added \) which slipped throught the net. Caught by chipmunk. chipmunk also points out that quotemeta is a good solution but my pride won't allow me to use it because it is both shorter and more elegant!

    # s/([\$\^\*\(\)\+\{\[\\\|\.\?])/\\$1/g for @list; $_ = quotemeta $_ for @list;

      I tried this on my own system (perl 5.6.0) and the script runs, however on our main server (Perl 5.004_04), it gives a syntax error on the substitution command and the following match line (though not the final match command (hunn?).
      As an extra challenge, can you solve the problem for perl 5.004_04

      Also, $rest contains the text after the elements in @list, not the subset that matches the contects of @list - which is the bit I'm after.

      If it helps, I'm guarenteed that the segment I'm looking for is always at the start of the string

      :)

      -- Ian Stuart A man depriving some poor village, somewhere, of a first-class idiot.

Re: getting the first n printable words from a string of HTML
by kiz (Monk) on May 31, 2001 at 17:15 UTC

    I have a solution, which solves the problem a differnet way:

    #!/usr/cpan/bin/perl use HTML::TokeParser; use Data::Dumper; my $full_text = '<P><img src="/images/logo.png" alt="logo" /> <div cla +ss="nice_colours"><a href="http://edina.ac.uk/">EDINA</a> and <a href +="http://mimas.ac.uk/">MIMAS</A> are pleased to announce a new set of + <a href="http://digimap.ac.uk">EDINA Digimap</a> training dates</div +> (Modules One and Two only) at the <B>Universities of <I>Middlesex</ +I> and <I>Edinburgh</B></I>. <!-- This is a comment --> <P>More details of Digimap training on the <a href="http://edina.ac.uk +/events/">events page</A>.'; my $count = 25; my $p = HTML::TokeParser->new(\$full_text); my @display_elements = (); my $count_of_words = ""; my @tag_stack = (); my @problem_tag_stack = (); # The plan is to get one token at a time, and process it # If the token is a start-tag, add the tag to the # tag-stack, and add the raw text to the display-list # If the token is an end-tag, then it should match the top # tag on the tag-stack, so we pop that off (to show # it's not outstanding), and add the raw text to the # display-list # If the token is a comment-tag, we just skip it # If the token is text, we add it to the display-list, # counting the words as we do so - stopping once we # have $words listed. # # Once we have the requisit number of words, we then close # all the elements still left in the tag-stack # while ($token_ref = $p->get_token) { last unless ($count_of_words - $count); # drop out when at $count if ($token_ref->[0] =~ /T/i) { # we have some text, so count it and stack it my @local_words = split /\b/, $token_ref->[1]; # split on boundries foreach $_ (@local_words) { push @display_elements, $_; $count_of_words++ if /\w+/; # only count when "words" are present last unless ($count_of_words - $count); # drop out when at $count } }; # end of text if ($token_ref->[0] =~ /S/i) { # We have the start of a tag next if ($token_ref->[1] =~ /^img/i); # push the raw HTML onto display-list push @display_elements, pop @$token_ref; # push a reference to the closing tag & closing # element onto the tag stack push @tag_stack, $token_ref->[1]; }; # end of start tag if ($token_ref->[0] =~ /e/i) { # We have the end of a tag # push the raw HTML onto display-list push @display_elements, pop @$token_ref; # the raw HTML # now to pop it off the stack (hopefully) my $tag = $token_ref->[1]; my $top_tag = pop @tag_stack; push @tag_stack, $top_tag unless ($tag eq $top_tag) ; }; # end of end tag } # Now we need to close any outstanding tags, in order # We have a list of element names, so now we close them foreach (@tag_stack) { my $tag = "<\/$_>"; push @display_elements, $tag; } $text = join '', @display_elements; print "Full text: $full_text\n\nLeader: $text\n";

    -- Ian Stuart
    A man depriving some poor village, somewhere, of a first-class idiot.