Checking URL's and deleting element arrays without HTML

lampros21_7 has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all the monks, am having some trouble with a piece of code i have written which doesn't seem to do what its supposed to.

I have an array of URL's which i want to check and remove the elements that haven't got plain text or HTML in them(a URL ending with .pdf for example should be deleted). Here is what i have written:

 
use LWP;
 
my $browser = LWP::UserAgent->new();
my $index = 0;
 my $response = $browser->get($url_name);
 
 foreach (@links) {
     $response = $browser->get($_);
     if (($response->content_type eq 'text/html') || ($response->conte
+nt_type eq 'text/plain')) {
         $index++;
     }
     else {
         splice(@links, $index, 1);
     }                
 }
 
 foreach (@links) {
  print "$_\n";
}
[download]

The idea is that pages with the content i need (plain text or HTML) will remain in the array and the loop will move onto the next element to check that. If the loop finds an element to be deleted it should use the splice function to delete it(which means after deleting an element the next element in the array will take its place e.g. if i delete $links5 then $links6 will become $links5 and so on ,so there's no need to move in the index one up. Hope this made sense).

Hope someone can point out what am doing wrong.Thanks

Comment on Checking URL's and deleting element arrays without HTML Download Code

Replies are listed 'Best First'.
Re: Checking URL's and deleting element arrays without HTML by fizbin (Chaplain) on Oct 24, 2005 at 13:19 UTC
You're trying to select out of a list those items that match a certain criterion. Why not use the perl built-in that means "select those items out of the list that match this"? Namely, `grep`: `use LWP; my $browser = LWP::UserAgent->new(); @links = grep { my $response = $browser->head($_); $response->content_type =~ m[^text/(html\|plain)$]} @links; foreach (@links) { print "$_\n"; }` [download] If you wanted to, you don't even need the "response" variable. Note that you should probably also include the types `text/xml` and `application/xhtml+xml` in your list of what types are "text". (Servers should be serving xhtml documents as `application/xhtml+xml` these days) -- `@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/` [download]	[reply] [d/l] [select]
Re: Checking URL's and deleting element arrays without HTML by Zaxo (Archbishop) on Oct 24, 2005 at 13:17 UTC
That's a clever way to handle the indexing problem, but you're still pulling the rug out from under your loop array. What happens when the list of aliases refers to something off beyond the new end of the array? Splicing out elements from the loop array causes the skips. Copying downward to fill the gap makes the next element drop out of the loop. The top of the array moves down but aliased references to its elements remain unchanged. Here's a safe way to remove elements from an array in a for loop. Use indexes and work from the top of the array down: `foreach my $index (reverse 0 .. $#links) { $response = $browser->get($links[$index]); splice(@links, $index, 1) unless $response->content_type eq 'text/html' \|\| $response->content_type eq 'text/plain'; }` [download] It would be more economical to work with `head` instead of `get`, but many sites deny head requests for no good reason. Update: fizbin's grep suggestion is most likely a better way to do what you want. ++frodo72's addition of checking the $response for success. After Compline, Zaxo	[reply] [d/l]
Re: Checking URL's and deleting element arrays without HTML by polettix (Vicar) on Oct 24, 2005 at 13:20 UTC
Untested: `my @filtered_links = grep { my $response = $browser->get($_); # Better: ->head($_) $response->is_success && (($response->content_type eq 'text/html') \|\| ($response->content_type eq 'text/plain')) } @links;` [download] Update: fizbin's is better, use `head` instead of `get`. Anyway, I'd retain the check on `is_success` for documentation and readability reasons. Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply] [d/l] [select]