Update2: No idea what i was reading :$. I did not notice that you wanted contents stripping off HTML and links. I read it as "strip everything off except links". Oh well!

Anyways I added this part of the code into the program i have below -

my $x = 0; # Would be at the top before the loop my @stripped_html; $stripped_html[$x++] = $webcrawler->content( format => "text" ); # Loop back get more URLS and keep processing. map { print $_,$/; } @stripped_html

It seems to be giving the content without any HTML tags but it looks kind of funny for Google. I tried our PM site and works fine but they are not stored as an array of strings it puts the entire content in the first element.

output for google.com

GoogleWebááááImagesááááGroupsááááNewsááááFroogleááááLocaláááámoreá&#95 +59;áááAdvanced S earchááPreferencesááLanguage ToolsAdvertisingáPrograms - Business Solu +tions - Ab out Google⌐2005 Google - Searching 8,058,044,651 web pages

end update2

I think your problem is with the way you are using  $webcrawler. You ask for content it will give you the content. You stripped off everything but the content and put it in  @website_links but you are not using that?

I am confused. I don't think contents of  $webcrawler was changed by calling  links mehtod.

Anyways you need to have a link obect to print the links at least this is what i see in the docs

$mech->links() When called in a list context, returns a list of the links found in th +e last fetched page. In a scalar context it returns a reference to an + array with those links. Each link is a WWW::Mechanize::Link object.

I installed mechanize but Link does not seem to be getting installed for me. Will try to compile it and check it out aggain. Hopefully you can figure out the issue from here.

-SK

update: Here is a condensed version of your script

#!/usr/bin/perl -w use WWW::Mechanize; use URI; print "WEB CRAWLER AND HTML EXTRACTOR \n"; #Create an instance of the webcrawler my $webcrawler = WWW::Mechanize->new(); my $url_name = "http://www.google.com"; my $uri = URI->new($url_name); # Process the URL and make it a URI #Grab the contents of the URL given by the user $webcrawler->get($uri); die "Failed\n" unless $webcrawler->success(); # Check for return sta +tus # links() retuns a Link object. map { print ($_->url(),"\n"); } $webcrawler->links($uri);

Output

WEB CRAWLER AND HTML EXTRACTOR /imghp?hl=en&tab=wi&ie=UTF-8 http://groups-beta.google.com/grphp?hl=en&tab=wg&ie=UTF-8 /nwshp?hl=en&tab=wn&ie=UTF-8 /frghp?hl=en&tab=wf&ie=UTF-8 /lochp?hl=en&tab=wl&ie=UTF-8 /intl/en/options/ /advanced_search?hl=en /preferences?hl=en /language_tools?hl=en /ads/ /intl/en/services/ /intl/en/about.html

There are only two major things i changed from your code (others were used to reduce code size for testing)

1. Check for return status

2.  links() returns a Link object so use  url() method on it. Check out the map section (you can store it in an array and then do the printing if you want )


In reply to Re: HTML stripper in WWW::Mechanize doesn't seem to work by sk
in thread HTML stripper in WWW::Mechanize doesn't seem to work by lampros21_7

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.