in reply to getting the first n printable words from a string of HTML

You can correlate the words in the list as shown below

my $html = "<h1>Foo</h1><p>Bar</p><p>Some more text here</p>"; my @list = ('Foo','Bar'); # eat up the bits in @list $html =~ m/$_/gc for @list; #use \G to match the rest ($rest) = $html =~ m/\G(.*)$/; print $rest;

Here we use /gc and the \G assertion to to first eat up the string by matching the words in @list in sequence, and then match the rest of the string starting just past the last match.

Note there are some circumstances where this will fail such as when you have a complete element in @list which matches an HTML tag or part therof.

It should work for most practical circumstances I think.

tachyon

Replies are listed 'Best First'.
Re: Re: getting the first n printable words from a string of HTML
by Vynce (Friar) on May 30, 2001 at 17:37 UTC

    I'm confused.

    tachyon posted an interesting solution, but i can't makes any sense of it.

    first of all, i could find no mention of /c in perlman:perlre, but perl didn't barf on it -- so what does it do?

    second, i tried the following near-identical code and it didn't do at all what i expected:
    #!/usr/bin/perl -w use strict; my $string = "foo bar baz bar quux foo gin"; my @list = qw(foo bar); $string =~ m/\Q$_\E/gc for @list; my ($rest) = $string =~ m/\G(.*)$/; print "$rest\n";
    .. it printed:
    foo bar baz bar quux foo gin
    and honestly, i can't see how it could possibly work the way you wanted it to. it seems to me that the /g on the m// would cause the 'foo' to match at the first and sixth words, and then the 'bar' to match at words 2 and 4, leaving the \G assertion after word 4, not after word 2 where i think it belongs.

    obviously, i'm missing some of what's going on here. can you enlighten me?