Parsing HTML

REMember has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing HTML by Tanktalus (Canon) on Feb 01, 2005 at 21:56 UTC
I'm going to suggest pursuing the modules. I know, I know. No permission. To that, I have two answers, depending on where you are: You're at work, and you are assigned a project which involves parsing HTML. My response has always been to my manager: help get these modules installed, or the cost (in effort) will be double or more. Someone else has already solved this part of my assignment, why spend company money redoing that? You're at school, and you're just playing with your school-sponsored unix account. You may still be able to get your sysadmin to install a module if you ask nicely. I do understand how unlikely that is. So, next best thing: install it to a local directory (perl Makefile.PL PREFIX=~/perllib), and then use "use lib '/home/me/perllib'" in your scripts, or "export PERL5LIB=~/perllib" in your environment. Better yet, get and install perl on your home computer - you'll have all the access you need there. But maybe that's because I'm a lazy arse who likes to shake up management once in a while ;-)	[reply]
Re^2: Parsing HTML by Grundle (Scribe) on Feb 01, 2005 at 22:28 UTC
I had this same annoying problem when trying to host a web-site with a particular stingy provider. Things get complicated when they disallow CPAN, and when you only have FTP access (so you cannot even try to run the make-file if you manually upload the whole module). In effect I had to install the module on my local box. and then FTP the necessary components to the "host" machine (just as you said in my local directory). It is the same conundrum, but complicated by the fact that you cannot generate anything remotely. I believe that once you put it in your local dir and then include the following lines in your code `#/usr/bin/perl BEGIN { unshift(@INC, "<directory-path-of-the-modules>"); }` [download] It should recognize it fine I tried complaining to the provider but they said "In the name of security we can not allow you to do these things etc. ad-nauseum." Hopefully you aren't in this same situtation.	[reply] [d/l]
Re^3: Parsing HTML by polettix (Vicar) on May 18, 2005 at 17:19 UTC
use lib, Luke! Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))') Don't fool yourself.	[reply]
Re: Parsing HTML by sh1tn (Priest) on Feb 01, 2005 at 21:53 UTC
Sure, there are many better ways (see CPAN modules). But in this case you have to use non-greedy match: `(.*?)` [download]	[reply] [d/l]
Re: Parsing HTML by geektron (Curate) on Feb 01, 2005 at 22:15 UTC
i agree with the previous notes. look into modules, and install them ( if needed ) in your homedir or elsewhere on the system where you have write permissions.	[reply]
Re: Parsing HTML by reneeb (Chaplain) on Feb 02, 2005 at 07:58 UTC
Use HTML::Parser. It's a very good module to parse HTML - as the name mentioned. As code snippet to get all links of a HTML-string: #! /usr/bin/perl use strict; use warnings; use HTML::Parser; use Data::Dumper; my @links; my $string = qq~<a href="url1">linktext1</a> Ein anderer Text <a href="url2">linktext2</a> text~; my $p = HTML::Parser->new(); $p->handler(start => \&start_handler,"tagname,attr,self"); $p->parse($string); foreach my $link(@links){ print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n"; } sub start_handler{ return if(shift ne 'a'); my ($class) = shift->{href}; my $self = shift; my $text; $self->handler(text => sub{$text = shift;},"dtext"); $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq ' +a')},"tagname"); } [download]	[reply] [d/l]
Re: Parsing HTML by kprasanna_79 (Hermit) on Feb 02, 2005 at 14:22 UTC
Hey `@lineTokens = split(/(.)<(.)=(.)"(.)>(.)/, $lines[$l]);` [download] I dont think it works fine, because see the below cases 1. when `<a href>` tag appears at last point of ur `$line[$1]` then this pattern match fails. I think i am right 2. if it does not match all the patterns use /g at last of pattern matching command. `split(/(.)<(.)=(.)"(.)>(.)/g` [download] 3. why cant u go for pattern match little bit easier to handle. --prasanna.k	[reply] [d/l] [select]