REMember has asked for the wisdom of the Perl Monks concerning the following question:

I'm not very good at the split command. I have
@lineTokens = split(/(.*)<(.*)=(.*)"(.*)>(.*)/, $lines[$l]);
What that line does is it reads in a line of HTML and parses it into smaller divisions. It gives me a token of "a href" to tell me that it is a link. Then the next token in line is the actual link. This works great, until I get to a line that has multiple links on it because I only get the last link for some reason. Does anyone know how to fix that problem? If you know a better way to parse the lines, that will work. I can't use any modules because I don't have the necessary permissions to install them.

Replies are listed 'Best First'.
Re: Parsing HTML
by Tanktalus (Canon) on Feb 01, 2005 at 21:56 UTC

    I'm going to suggest pursuing the modules. I know, I know. No permission. To that, I have two answers, depending on where you are:

    1. You're at work, and you are assigned a project which involves parsing HTML. My response has always been to my manager: help get these modules installed, or the cost (in effort) will be double or more. Someone else has already solved this part of my assignment, why spend company money redoing that?
    2. You're at school, and you're just playing with your school-sponsored unix account. You may still be able to get your sysadmin to install a module if you ask nicely. I do understand how unlikely that is. So, next best thing: install it to a local directory (perl Makefile.PL PREFIX=~/perllib), and then use "use lib '/home/me/perllib'" in your scripts, or "export PERL5LIB=~/perllib" in your environment. Better yet, get and install perl on your home computer - you'll have all the access you need there.
    But maybe that's because I'm a lazy arse who likes to shake up management once in a while ;-)

      I had this same annoying problem when trying to host a web-site with a particular stingy provider. Things get complicated when they disallow CPAN, and when you only have FTP access (so you cannot even try to run the make-file if you manually upload the whole module).

      In effect I had to install the module on my local box. and then FTP the necessary components to the "host" machine (just as you said in my local directory). It is the same conundrum, but complicated by the fact that you cannot generate anything remotely. I believe that once you put it in your local dir and then include the following lines in your code

      #/usr/bin/perl BEGIN { unshift(@INC, "<directory-path-of-the-modules>"); }
      It should recognize it fine

      I tried complaining to the provider but they said "In the name of security we can not allow you to do these things etc. ad-nauseum." Hopefully you aren't in this same situtation.
        use lib, Luke!

        Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')

        Don't fool yourself.
Re: Parsing HTML
by sh1tn (Priest) on Feb 01, 2005 at 21:53 UTC
    Sure, there are many better ways (see CPAN modules).
    But in this case you have to use non-greedy match:
    (.*?)
Re: Parsing HTML
by geektron (Curate) on Feb 01, 2005 at 22:15 UTC
    i agree with the previous notes. look into modules, and install them ( if needed ) in your homedir or elsewhere on the system where you have write permissions.
Re: Parsing HTML
by reneeb (Chaplain) on Feb 02, 2005 at 07:58 UTC
    Use HTML::Parser. It's a very good module to parse HTML - as the name mentioned.

    As code snippet to get all links of a HTML-string:
    #! /usr/bin/perl use strict; use warnings; use HTML::Parser; use Data::Dumper; my @links; my $string = qq~<a href="url1">linktext1</a> Ein anderer Text <a href="url2">linktext2</a> text~; my $p = HTML::Parser->new(); $p->handler(start => \&start_handler,"tagname,attr,self"); $p->parse($string); foreach my $link(@links){ print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n"; } sub start_handler{ return if(shift ne 'a'); my ($class) = shift->{href}; my $self = shift; my $text; $self->handler(text => sub{$text = shift;},"dtext"); $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq ' +a')},"tagname"); }
Re: Parsing HTML
by kprasanna_79 (Hermit) on Feb 02, 2005 at 14:22 UTC
    Hey
    @lineTokens = split(/(.*)<(.*)=(.*)"(.*)>(.*)/, $lines[$l]);
    I dont think it works fine, because see the below cases
    1.   when <a href> tag appears at last point of ur $line[$1] then this pattern match fails. I think i am right
    2.   if it does not match all the patterns use /g at last of pattern matching command.
    split(/(.*)<(.*)=(.*)"(.*)>(.*)/g

    3.   why cant u go for pattern match little bit easier to handle.
    --prasanna.k