myfrndjk has asked for the wisdom of the Perl Monks concerning the following question:

After processing all url it will display additional"invalid url" string in else loop for example:If i proces totally 5 urls 3-valid and 2-invalid i get 7 results with 2-additional "invalid string"

use LWP::Simple; use File::Compare; use HTML::TreeBuilder::XPath; use LWP::UserAgent; use Win32::Console::ANSI; use Term::ANSIColor; sub crawl_content { { open(FILE, "C:/Users/jeyakuma/Desktop/shipping project/input/input.txt +"); { while(<FILE>) { chomp; $url=$_; foreach ($url) { ($domain) = $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x; } do 'C:/Users/jeyakuma/Desktop/perl/mainsub.pl'; &domain_check(); my $ua = LWP::UserAgent->new( agent => "Mozilla/5.0" ); my $req = HTTP::Request->new( GET => "$url" ); my $res = $ua->request($req); if ( $res->is_success ) { print "working on $domain\n"; binmode ":utf8"; my $xp = HTML::TreeBuilder::XPath->new_from_url($url); my @node = $xp->findnodes_as_string("$xpath") or print + "couldn't find the node\n" ; open HTML, '>:encoding(cp1252)',"C:/Users/jeyakuma/Des +ktop/shipping project/data_$date/$competitor.html"; foreach(<@node>) { print HTML @node; close HTML ; } } else { print color("green"), "$domain Invalid url\n", colo +r("reset") and open HTML,">C:/Users/jeyakuma/Desktop/log.txt"; print + HTML " $domain Invalid URL"; } } } } } do 'C:/Users/jeyakuma/Desktop/perl/comparefinal.pl'; compare_result(); }

output

footwedrer.eu Invalid url

working on autozona.it

dantae.eu Invalid url

working on infanziabimbo.it

working on footlocker.eu

Invalid url

Invalid url

Replies are listed 'Best First'.
Re: if/else loop prints extra values
by kcott (Archbishop) on Jun 28, 2014 at 08:41 UTC

    G'day myfrndjk,

    You have a number of issues with your regex.

    ($domain) = $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x;

    As $domain is assigned at the start for each iteration, and used within the body of the loop, it would make sense to this get this part fixed first.

    I'm guessing that, if the input was "www.example.com.au", the expected output should be either "example.com.au" or "example.com". Please clarify. (FYI: your code produces "example.co", see below.)

    [Please take a look at the guidelines in "How do I post a question effectively?" for information on useful materials to include with your post. (In this specific instance, sample input and expected output would have been on the list.)]

    This test code:

    #!/usr/bin/env perl -l use strict; use warnings; my $url = 'www.example.com.au'; my ($domain) = $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x; print '$domain=[', defined $domain ? $domain : '<undef>', ']';

    produces this output:

    $domain=[example.co]

    Adding these additional lines of code:

    my ($alpha_num, $any_three, $final_dot) = $url =~ m|www.([A-Z a-z 0-9]+)(.{3})(.)|x; print '$alpha_num=[', defined $alpha_num ? $alpha_num : '<undef>', ']' +; print '$any_three=[', defined $any_three ? $any_three : '<undef>', ']' +; print '$final_dot=[', defined $final_dot ? $final_dot : '<undef>', ']' +;

    and the output now shows which parts of the regex are capturing which parts of the domain:

    $alpha_num=[example] $any_three=[.co] $final_dot=[m]

    The dot ('.') (meta)character is special in regexes: matching any character except newline [including newline if the \s modifier is used].

    You seem to have used it, expecting a literal dot, in "m|www.". I'm not sure what's intended with ".{3}).", hence the request for clarification earlier. Anyway, this problem needs fixing.

    You also have a problem with spaces in "[A-Z a-z 0-9]". I suspect this is the result of a misunderstanding about the \x modifier:

    "/x tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class." [my emphasis]

    Decide whether you want domains with spaces or not; modify the character class to have no spaces or just one space.

    [See also: perlrequick, perlretut, perlre, strict, warnings, autodie, open()]

    -- Ken

      ... issues with your regex.

      ($domain) = $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x;

      kcott: Hi Ken! The discussion stemming from Re: Perl prints only last line of array indicates the regex in question is working just fine for myfrndjk although I share your puzzlement as to how it could. Anyhoo... ++ for a valiant explanatory effort.

        G'day AnomalousMonk,

        [Sorry for the late reply. As you may have noticed, I haven't been here much in recent weeks. Various other things are taking up my time at the moment — none of which are any cause for concern :-)]

        I found it curious that mangling 5 valid URLs would result in 3 different valid URLs.

        I found it curiouser that those mangled URLs all turned out to be shopping sites.

        I found it curiousest that some of those names sounded familiar: perhaps from "Nodes To Consider".

        -- Ken

      thanks for your suggestion

Re: if/else loop prints extra values
by Anonymous Monk on Jun 27, 2014 at 14:32 UTC

    Whenever you print out the problem values, put some sort of marker around them so that you can see the whitespace: eg: print "\]$problem$\[ <-- this was bad!\n";

    Also, note that "" is not a valid URL. You probably have some empty strings / blank lines in your input.