perlpoda has asked for the wisdom of the Perl Monks concerning the following question:

Hey,

I'm quite new to the perl world and managed to code a small programm checking links with the LWP simple module.

The small code made a .csv file with all the links not working. I have quite a few (>1000) of them now in the csv file - line by line.

I have the original file with all the links in a .csv file with tabulated spaces.

What I want to do is that the original file is searched for all the not working links and deleting the whole line.

Is there a way to do this with an easy perl script? Big thanks in advance.

Regards, Rob

Replies are listed 'Best First'.
Re: Find & Delete by comparing two files
by kcott (Archbishop) on Sep 11, 2012 at 07:14 UTC

    G'day perlpoda,

    Welcome to the monastery.

    Without any code or data, you're only going to get vague responses. Follow these guidelines for a better answer: How do I post a question effectively?

    Here's a basic technique that does the sort of thing you appear to want:

    $ perl -Mstrict -Mwarnings -E ' my @identified_bad_links = (qw{a c e}); my %bad_link = map { $_ => 1 } @identified_bad_links; my @all_links = (qw{a b c d e f g}); say $_ for grep { ! $bad_link{$_} } @all_links; ' b d f g

    -- Ken

Re: Find & Delete by comparing two files
by Athanasius (Archbishop) on Sep 11, 2012 at 08:10 UTC

    As Anonymous Monk and kcott have said, the requirements are unclear.

    A guess: You need to discover which links are broken, and want to automate the discovery process using Perl. In that case, the 2009 thread Detect Broken links should help.

    Athanasius <°(((><contra mundum

      Hello, thanks for getting back to my problem. Ok, let's see - I have this pretty ugly code I managed to write and it is working. I'm sure there are nicer ways to achieve my goal but I'm a complete beginner - sorry:
      #!/usr/bin/perl use LWP::Simple; $elem1 = "http://www.test.de/subfolder/"; $elem3 = "http://www.test.de/subfolder/de/"; # --------- Logfile ---------------------------- my $delimiter = "\t"; my $logfile = "errorlog.txt"; my $datum = localtime(); my $logmsg = "$datum $ENV{USER} Broken Links"; open LOGFILE, ">$logfile" or die $!; print LOGFILE $logmsg, "\n"; # --------- Open File ----------------------- open(READ,"Linkfile.csv") or die $!; while (my $line = <READ>){ if($line=~/\d+\t/){ my @liste = split($delimiter,$line); push(@urls,$liste[2]); } } close READ; # ---------------------------------------------- foreach (@urls) { chomp($_); if (head($_)) { print $_." is working.\n\n"; } else { print $_." is broken.\n\n"; open LOGFILE, ">>$logfile" or die $!; print LOGFILE "\n".$_." is broken."; close LOGFILE; $tmp_url = "$_."; @array=split(/\?/,$_); $myString = $array[0]; if ($myString =~ m#http:\/\/www.test.de\/subfolder\/de\/#) { $myString =~ s#http:\/\/www.test.de\/subfolder\/de\/##; $newUrl = $elem1.$myString; } else { $myString =~ s#http:\/\/www.test.de\/subfolder\/##; $newUrl = $elem3.$myString; print $newUrl." generated \n\n"; } if (head($newUrl)) { print $newUrl." is working.\n\n"; open LOGFILE, ">>$logfile" or die $!; print LOGFILE "\n".$newUrl." is working.\n"; close LOGFILE; } else { print $newUrl." isn't working, too.\n\n "; open LOGFILE, ">>$logfile" or die $!; print LOGFILE "\n".$newUrl." isn't working, too.\n"; close LOGFILE; } } }

      Basicaly it does the following: Open the File, gettinge the 3. tabulated text and testing it if the server responds with a 200 (all ok) or 404 (not found.)

      If the Link is broken, it is testing it if there's only a problem with the subfolder and generating a new url. Testing the head of this link.

      So I have a file now with all the broken links - what I want to do is to delete these broken links in my original file.

      PROBLEM is that I don't want to delete just the link but the whole line in the original file.

      I'm sorry if I can't describe it better - English is not my mother tongue ;)

      Thanks in advance. Regards, Robert

        Hello again perlpoda,

        Glad your code is working. Here are a few ways to improve it, in addition to the suggestions made by nemesdani:

        1. Always begin your scripts with:

          use strict; use warnings;

          strict will force you to pay attention to the scope of your variables, which is a good thing.

        2. Prefer lexical filehandles, and use the 3-argument form of open:

          open(my $log, '>', $logfile) or die "Cannot open file '$logfile' for w +riting: $!";
        3. As nemesdani noted, it is better to avoid opening and closing files any more than necessary. In this case, you open LOGFILE for writing and then don’t close it. So later, in the foreach loop, there is no need to open it again for appending: it’s still open, just write to it! Leave it open within the loop, then close it explicitly — once — after the loop.

        4. Add comments. For example, what is the code re-writing $newUrl all about? I have no idea, and chances are neither will you — when you come back to this script in, say, 6 months’s time.

        nemesdani has given you some good ideas about deleting whole lines. You’re making progress, keep going!

        Athanasius <°(((><contra mundum

        A few general suggestions (I haven't read your code thoroughly, sorry):
        Pack your things (e.g. If the Link is broken, it is testing it if there's only a problem with the subfolder and generating a new url. Testing the head of this link) together in subroutines, your code will be clearer, more scalable.

        Open and write to files once, don't open them every time. (time, performance)

        About the question: If you find a broken link, you could save the line numbers in an array, and after you checked each line, you can delete the lines.
        One solution that comes into my head is with Tie::File
        Example of deleting the last line from a file, stolen from the Cookbook (hellyea, I am lazy):
        use Tie::File; tie @lines, Tie::File, $file or die "can't update $file: $!"; delete $lines[-1];

        I'm too lazy to be proud of being impatient.
Re: Find & Delete by comparing two files
by Anonymous Monk on Sep 11, 2012 at 07:09 UTC

    That isn't very clear, especially this part

    What I want to do is that the original file is searched for all the not working links and deleting the whole line.

    Since the file is made up of not working links, just delete the file?