get the html source of a webpage .

manjulakp has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have web page link -http://data.lexus.nl/home/data/LexusV8/pdf/Lexusdealerlijst.pdf I want the html source code of this link -I want to fetch the details of the store that are in the webpage link .How do i go about for this . Can some one help me on this . Infact i used LWP module to get the html source of a webpage . But I am unable to get the html source of pdf format . Please help REgards Manjula Regards Manjula.

Comment on get the html source of a webpage .

Replies are listed 'Best First'.
Re: get the html source of a webpage . by marto (Cardinal) on Apr 07, 2011 at 10:10 UTC
a PDF file isn't a webpage, nor is it HTML. I'm not sure if you're looking to extract plain text or hyperlinks from this PDF file. Depending what you want to do look at some of the PDF modules on cpan, for example CAM::PDF.	[reply]
Re: get the html source of a webpage . by Utilitarian (Vicar) on Apr 07, 2011 at 10:10 UTC
Hi Manjula, PDF (printable document format) and HTML (hyper-text markup language) are different formats addressing different needs (html is, or at least should be, display agnostic, whereas pdf is very strict about the format and layout of a document), do you need to convert between them or extract the content of the PDF ? `print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."`	[reply] [d/l]
Re: get the html source of a webpage . by wind (Priest) on Apr 07, 2011 at 22:00 UTC
CAM::PDF should be able to get you what you want. You'll just have the enhance on the below script to scan for the beginning and end of each store's contact info: `use CAM::PDF; use LWP::Simple qw(getstore); use strict; use warnings; my $url = 'http://data.lexus.nl/home/data/LexusV8/pdf/Lexusdealerlijst +.pdf'; my ($file) = $url =~ m{([^/]*)$}; getstore($url, $file) if ! -e $file; my $cam = CAM::PDF->new($file); print $cam->getPageText(1);` [download]	[reply] [d/l]