Re: getting the first n printable words from a string of HTML

*UPDATE* It occurred to me that my code will cause a runtime error or weird behaviour if any element in @list contains regex metachars as these will be interpolated into the eat it up regex. To fix this we need to escape all these chars. Here is the patched code.

tachyon

my $html = "<h1>F(oo</h1><p>Bar</p><p>Some more text here</p>";
my @list = ('F(oo','Bar');

# you need to make the elements in @list regex
# friendly by backslashing all the metachars
# comment out this line to see this script choke
# on the ( in F(oo
s/([\$\^\*\(\)\+\{\[\\\|\.\?])/\\$1/g for @list;

# eat up the bits in @list
$html =~ m/$_/gc for @list;

#use \G to match the rest
($rest) = $html =~ m/\G(.*)$/;
print $rest;
[download]

*Update* added \) which slipped throught the net. Caught by chipmunk. chipmunk also points out that quotemeta is a good solution but my pride won't allow me to use it because it is both shorter and more elegant!

# s/([\$\^\*\(\)\+\{\[\\\|\.\?])/\\$1/g for @list;
$_ = quotemeta $_ for @list;
[download]

Comment on Re: getting the first n printable words from a string of HTML Select or Download Code

Replies are listed 'Best First'.
Re: Re: getting the first n printable words from a string of HTML by kiz (Monk) on May 30, 2001 at 19:04 UTC
I tried this on my own system (perl 5.6.0) and the script runs, however on our main server (Perl 5.004_04), it gives a syntax error on the substitution command and the following match line (though not the final match command (hunn?). As an extra challenge, can you solve the problem for perl 5.004_04 Also, $rest contains the text after the elements in @list, not the subset that matches the contects of @list - which is the bit I'm after. If it helps, I'm guarenteed that the segment I'm looking for is always at the start of the string :) -- Ian Stuart A man depriving some poor village, somewhere, of a first-class idiot.	[reply]

Replies are listed 'Best First'.

Re: Re: getting the first n printable words from a string of HTML
by kiz (Monk) on May 30, 2001 at 19:04 UTC

I tried this on my own system (perl 5.6.0) and the script runs, however on our main server (Perl 5.004_04), it gives a syntax error on the substitution command and the following match line (though not the final match command (hunn?).
As an extra challenge, can you solve the problem for perl 5.004_04

Also, $rest contains the text after the elements in @list, not the subset that matches the contects of @list - which is the bit I'm after.

If it helps, I'm guarenteed that the segment I'm looking for is always at the start of the string

:)

-- Ian Stuart A man depriving some poor village, somewhere, of a first-class idiot.

[reply]