Re: Problem with parsing HTML with Regex's
by jeffa (Bishop) on Nov 10, 2003 at 13:30 UTC
|
Parsing HTML with regexes should be a last resort, IMHO.
Use a parser instead. That's why it's called HTML
parsing! Here is a demonstration using
HTML::TokeParser::Simple and doc's data.
use strict;
use warnings;
use HTML::TokeParser::Simple;
use Data::Dumper;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
my (@img,@link,@a);
while (my $token = $parser->get_token) {
if ($token->is_start_tag('img')) {
push @img, $token->return_attr->{src};
} elsif ($token->is_start_tag('link')) {
push @link, $token->return_attr->{href};
} elsif ($token->is_start_tag('a')) {
push @a, $token->return_attr->{href};
}
}
print Dumper \@img,\@link,\@a;
__DATA__
<A href=normal.link2 class="foo" >
<img src="img.link2" alt="foo">
<a class=foo href='normal.link4'>
<img height=20 width=25 src=img.link3 >
<IMG src='img.link4'>
<link href="css.link1">
<a class=foo href="normal.link1">
<img src="img.link1">
<a href="normal.link3">
<a Href='normal.link5'>
It is not less code than doc's, but it is much more
readable. You might want to throw in error checking for
missing href and src attributes.
| [reply] [d/l] |
|
Same exact thing using YAPE::HTML ;)( YAPE::HTML is pure-perl for those who don't know)
use YAPE::HTML;
use Data::Dumper;
use strict;
use warnings;
my $content = q[
<img src="img.link1">
<img src="img.link2" alt="foo">
<img height=20 width=25 src=img.link3 >
<IMG src='img.link4'>
<link href="css.link1">
<a class=foo href="normal.link1">
<A href=normal.link2 class="foo" >
<a href="normal.link3">
<a class=foo href='normal.link4'>
<a Href='normal.link5'>
];
my $parser = YAPE::HTML->new($content);
my( @a, @link, @img );
# here is the tokenizing part
while ( my $chunk = $parser->next ) {
if( $chunk->type eq 'tag' ){
if( $chunk->tag eq 'a' ){
push @a,
$chunk->get_attr('href')
if $chunk->has_attr('href');
}
elsif( $chunk->tag eq 'link' ){
push @link,
$chunk->get_attr('href')
if $chunk->has_attr('href');
}
elsif($chunk->tag eq 'img'){
push @img,
$chunk->get_attr('src')
if $chunk->has_attr('src');
}
}
}
print Dumper \@img,\@link,\@a;
__END__
$VAR1 = [
'img.link1',
'img.link2',
'img.link3',
'img.link4'
];
$VAR2 = [
'css.link1'
];
$VAR3 = [
'normal.link1',
'normal.link2',
'normal.link3',
'normal.link4',
'normal.link5'
];
MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] [d/l] |
Re: Problem with parsing HTML with Regex's
by diotalevi (Canon) on Nov 10, 2003 at 07:27 UTC
|
I already gave you an expression in the chatterbox. Its as close as I think you'll get without the use of some module (which is a really good idea, BTW).
s(
((?#1) # Capture the entire tag beginning
<img(?s:.+?)
src\s*=\s* # Optional space
((?#2)['"]) # Capture the delimiter
((?#3)(?s:.*?)) # Capture the URL
\2 # Use whatever delimiter was used to start the URL
)
{ "$1$2" . resolveimg( $3 ) . $2 }gixe
| [reply] [d/l] |
|
Doesn't that basically do the exact same as:
$text =~ s#src=\"(.*?)\"#&resolveimg($1)#sige; ?
And that wasn't the problem, the problem was the stylesheet links.
| [reply] [d/l] |
Re: Problem with parsing HTML with Regex's
by Anonymous Monk on Nov 10, 2003 at 07:48 UTC
|
Which doesn't seem to work very well as I'm getting some rather strange output. TIA.
Such as?
Parsing html (which is what you're trying to do)
with regular expressions is hard (more so when you're green).
But as usual, there is always CPAN (HTML::StripScripts::Regex, YAPE::HTML )
use Regexp::Common qw /delimited/;
my $text = q~
qqq <img src = "src" >
sss <img src='src' >
~;
$text =~ s~
img \s+ src
\s* \= \s*
(?: $RE{delimited}{-delim=>'"'} |
$RE{delimited}{-delim=>"'"}
)
~bongo~sigx;
print $text,$/,$/;
__END__
qqq <bongo >
sss <bongo >
| [reply] [d/l] |
|
| [reply] |
Re: Problem with parsing HTML with Regex's
by doc (Scribe) on Nov 10, 2003 at 13:05 UTC
|
local $/;
my $data = <DATA>;
my @img = $data =~ m/<\s*img[^>]*?src\s*=\s*['"]?([^"' >\n]+)/gi;
my @css = $data =~ m/<\s*link[^>]*?href\s*=\s*['"]?([^"' >\n]+)/gi;
my @lnk = $data =~ m/<\s*a[^>]*?href\s*=\s*['"]?([^"' >\n]+)/gi;
use Data::Dumper;
print Dumper \@img, \@css, \@lnk;
__DATA__
<img src="img.link1">
<img src="img.link2" alt="foo">
<img height=20 width=25 src=img.link3 >
<IMG src='img.link4'>
<link href="css.link1">
<a class=foo href="normal.link1">
<A href=normal.link2 class="foo" >
<a href="normal.link3">
<a class=foo href='normal.link4'>
<a Href='normal.link5'>
__END__
$VAR1 = [
'img.link1',
'img.link2',
'img.link3',
'img.link4'
];
$VAR2 = [
'css.link1'
];
$VAR3 = [
'normal.link1',
'normal.link2',
'normal.link3',
'normal.link4',
'normal.link5'
];
| [reply] [d/l] |
Re: Problem with parsing HTML with Regex's
by ysth (Canon) on Nov 10, 2003 at 07:28 UTC
|
| [reply] [d/l] [select] |
|
Without a sample line that is going wrong, this is just a shot in the dark, but does replacing each .*? with ^=* and putting \b before and after img, a, and link help?
[^=]* is not much better
than .*?.
"A" is not "=", "B" is not "=", "C" is not "=", "D" is not "=", ....
Since he's parsing html, he should replace .*? with \s* (regular expressions are easy to write if you know precisely what you're matching).
| [reply] [d/l] [select] |
|
But who's to say that the input will always look like
"img src...", it could be "img border" or anything like that.
| [reply] |
|
| [reply] |
|
s#(?:\bimg\b[^<>]*src|\blink\b[^<>]*href)\=\"(.*?)\"#...
and
s#(?:\ba\b[^<>]*href)\=\"(.*?)\"#
(diotalevi's solution may be a better place to start, adjusting the .+ to not allow intervening tags.)
Or you may want to follow this suggestion; I had assumed you wanted to cover even something like <a title="whoohoo" href=...>, so I didn't switch to \s*.
But given that you want to be able to handle *any* web page that is entered, you ought to use a real html parser instead. (BTW, I tried http://google.com and was disappointed that the buttons didn't get translated.) | [reply] [d/l] [select] |
Re: Problem with parsing HTML with Regex's
by Anonymous Monk on Nov 10, 2003 at 08:21 UTC
|
Edit: Yes I know I should be using a module, but that's the reason I'm asking, I dont want to use one.
You don't have to use it to use it.
Look at the source and borrow the appropriate regular expressions.
| [reply] |
Re: Problem with parsing HTML with Regex's
by idsfa (Vicar) on Nov 10, 2003 at 16:01 UTC
|
<img alt=">click here<" src="/images/button.gif" NAME=">click<">
Yet Another Reason to use an HTML parser.
Updated:
None of the proposed purely regex solutions ... happy now? ;-)
Yes, the given regexes can all be modified to work with this (psychotic) example. The point is that parsing HTML is difficult, and that this wheel has already been invented a few times.
My parents just came back from a planet where the dominant life form had no
bilateral symmetry, and all I got was this stupid F-Shirt.
| [reply] [d/l] |
|
*cough*cough* YAPE::HTML is a regex solution ;)
| [reply] |