myfrndjk has asked for the wisdom of the Perl Monks concerning the following question:

Hi i wish to scrape the content and store that in its respective names.When I prints the crawl content it doesn't print any special characters.All special characters are replaced by some junk values. for example (€)euro is printed as (-aA). I am scraping the site which is full of special characters and German language. So most of the crawled content are different from original content.Thanks in advance

use LWP::Simple; use File::Compare; use HTML::TreeBuilder::XPath; use LWP::UserAgent; open(FILE, "C:/Users/jk/Desktop/input/input.txt"); { while(<FILE>) { chomp; $url=$_; foreach ($url) { ($domain) = $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x; } do 'C:/Users/jk/Desktop/perl/mainsub.pl'; &domain_check(); my $ua = LWP::UserAgent->new(agent => "Mozilla/5.0"); my $req = HTTP::Request->new(GET => "$url"); my $res = $ua->request($req); die("error") unless $res->is_success; my $xp = HTML::TreeBuilder::XPath->new_from_content($res->content) +; my @node = $xp->findnodes_as_strings("$xpath"); die("node doesn't exist") if $#node == -1; foreach(<@node>) { $death=$_; open HTML ">C:/Users/jk/Desktop/fun/perl/$site.html"; print HTML "$death\n"; } } }

subroutine

use LWP::Simple; use File::Compare; use HTML::TreeBuilder::XPath; use LWP::UserAgent; sub domain_check { sub domain_check { if($domain eq 'goo.eu') { $competitor = 'goo.eu'; $xpath ='//p/strong' } if ($domain eq 'mov.it') { $competitor = 'mov.it'; $xpath = '//div//table//td'; } elsif ($domain eq 'lot.it') { $competitor = 'lot.it'; $xpath = '//div//table'; } }

Replies are listed 'Best First'.
Re: Perl prints only last line of array
by RMGir (Prior) on Jun 22, 2014 at 00:09 UTC
    I think that your problem is that you're re-creating the file for every line.

    Try moving the "open HTML" line out above the loops...

    # You don't want to do this for each line! open HTML ">C:/Users/jk/Desktop/fun/perl/$site.html"; foreach(<@node>) { $death=$_; print HTML "$death\n"; }

    Mike

      Hi Mike, Thanks for your help ! I modified the code as you suggested it works fine.But now i have another issue .The crawl content doesn't contain any special characters.All special characters are replaced by some junk values. for example (€)euro is printed as (-aA). I am scraping the site which is full of special characters and German language. most of the crawled content are different from original content. Thanks again, jk

Re: Perl not printing any special characters in array
by hippo (Archbishop) on Jun 22, 2014 at 15:03 UTC
    All special characters are replaced by some junk values. for example (€)euro is printed as (-aA).

    This suggests that you have forgotten to decode your input or to encode your output or both. Have you read perlunitut and perlunifaq?

      Hi, Thanks for your suggestion,but i don't know where to add those in my code.I tried but still no change in result.can you tell me where to add those encode/decode in my code. Thanks

        Useless reply, myfrndjk; show us the code you "tried" (presumably, 'added') and tell us, in detail, how it failed.

        .

        Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
        1. code
        2. verbatim error and/or warning messages
        3. a coherent explanation of what "doesn't work actually means.

        check Ln42!

Re: Perl prints only last line of array
by AnomalousMonk (Archbishop) on Jun 22, 2014 at 10:02 UTC
    ($domain) = $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x;

    Just a side note: The regex quoted above is unlikely to be doing what you expect.

    Update: I'm not familiar with URL matching in general, but I cannot imagine this problem has not already been addressed in a Perl module — or modules! Maybe search CPAN or MetaCPAN with terms like  URL regex

      Hi, Thanks for you explanation.But my regex is working fine after a code change suggested by mike it works fine. Thanks JK

        c:\@Work\Perl\monks>perl -wMstrict -le "my $url = q{is 'www' really a domain?!?}; print qq{($1)} if $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x; " ( really a domain?!)
Re: Perl prints only last line of array
by Anonymous Monk on Jun 22, 2014 at 00:30 UTC
    What do you think  foreach(<@node>) does?

      Hi, I am new to PERL.What I thought is for every url I have to open new HTML.Now I understood it works for every line.So it replaced all the previous words and left the last line.Correct me if I am wrong. Thanks jk

        see readline and glob, <> is used for both reaedline and glob, its not used for iterating over an array