csuresh01 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
I am fairly new to this group and to perl I have some HTML Pages I have to parse the files and get some results from that. For Example :

$tmp ="<tr bgcolor=#CCFFFF><td><a href="/Tracking/SmcRel30Story">SmcRe +l30Story</a></td>"

From the $tmp variable I want to know what is postion of the "SmcRel30Story". Then I have to extract the value between href like this /Tracking/SmcRel30Story Can some one pls help me how to proceed this immediately

Regards
CS

20040716 Janitored by Corion: Added formatting, moved from PMD

Replies are listed 'Best First'.
Re: How to find the Postion of String
by pelagic (Priest) on Jul 16, 2004 at 14:18 UTC
Re: How to find the Postion of String
by gellyfish (Monsignor) on Jul 16, 2004 at 14:43 UTC

    In the spirit of showing the code:

    #!/usr/bin/perl use strict; use warnings; my $tmp = '<tr bgcolor=#CCFFFF><td><a href="/Tracking/SmcRel30Story">S +mcRel30Story</a></td>r'; use HTML::Parser; my $parser = HTML::Parser->new( start_h => [ \&gethref,"tag,attr" ]); + $parser->parse($tmp); sub gethref { my ( $tag, $attribs) = @_; if ( $tag eq 'a' && exists $attribs->{href} ) { if ( $attribs->{href} =~ /SmcRel30Story/ ) { print $attribs->{href}; } } }

    /J\

Re: How to find the Postion of String
by pbeckingham (Parson) on Jul 16, 2004 at 14:47 UTC

    Do what pelagic says.

    But if you are approaching this as a learning opportunity, then the following would work with caveats (note that your code contains unescaped double quotes, and therefore would not compile):

    #! /usr/bin/perl -w use strict; my $tmp = "<tr bgcolor=#CCFFFF><td><a href=\"/Tracking/SmcRel30Story\" +>SmcRel30Story</a></td>"; my ($path) = $tmp =~ m{<a # literal <a \s+ # definite whitespace href # literal href \s* # possible whitespace = # literal = \s* # possible whitespace " # literal double quote ( # capture the following [^"]+ # greedy string that does not contain + a quote ) # end capture " # literal double quote }imsx; # case-insensitive print $path, "\n";
    See how much work it is? And it still isn't complete - for example, it only allows for " double quote characters around the path and not single quote or missing quotes, it doesn't handle escaped quotes within the attribute, and doesn't allow for other attributes of the <A> tag to precede the href attribute, only returns the first match in the string, etc, etc.

    See how much work it is, and it still isn't good enough? Don't be tempted - do what pelagic says.

Re: How to find the Postion of String
by EdwardG (Vicar) on Jul 16, 2004 at 14:29 UTC

    CPAN modules are your best bet, since link extraction can be difficult.

    But if you don't want modules, this code might get you started reinventing the wheel.

    # To get the position of first occurance of "SmcRel30Story" if ($tmp =~ /SmcRel30Story/) { print 'position of SmcRel30Story is ', 1 + length $`, "\n"; } # To extract everything inside the href if ($tmp =~ /href="(.+?)"/i) { print "I found this inside the quotes: $1\n"; }