comment on

Link rot, what a pain. Here's something very quickly thrown together based upon your expanded criteria.

Create a new directory, copy your htm files in there.
Install the required modules, run the following from the command prompt: cpanm Mojolicious Path::Tiny.
Download the code below to the same location. Run the code.

This code reads the content of each htm file in a directory, parses it with Mojo::DOM, finds all links, checks each URL with Mojo::UserAgent , if it looks like it's dead it'll remove the parent HTML element. Saving the file after.

Example HTML:

<html>
<head>
<title>test</title>
</head>
<body>
<ul>
<li><a href="http://perlmonks.org">perlmonks</a></li>
<li><a href="http://archive.org">archnive.org</a></li>
<li><a href="http://sitedoesnotexist9999.net">fakesite</a></li>
</ul>
</body>
</html>
[download]

Perl code:

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';
use Path::Tiny;
use Mojo::DOM;
use Mojo::UserAgent;

# get current directory
my $dir = Path::Tiny->cwd;

# for each html file
for ( $dir->children( qr/\.htm$/ ) ){

  # read the contents into a variable
  my $html = path( $_->basename )->slurp;

  # get the dom
  my $dom = Mojo::DOM->new( $html );

  # find all links
  for( $dom->find('a')->each ){
    
    # get target href
    my $url = $_->attr('href');
    say "checking link $url";
    
    # use Mojo::UserAgent to check if link is alive
    my $ua  = Mojo::UserAgent->new;
    my $res;
    eval { $res = $ua->max_redirects(5)->head( $url )->result };

    # if an error is thrown
    if ( $@ ){
      warn "$url seems dead, removing parent";
      $_->parent->remove;
    } 

    # play nice
    sleep(10);
    
  }
  # save file  
  path( $_->basename )->spew($dom->content);
}
[download]

Example HTML after running program:

<html>
<head>
<title>test</title>
</head>
<body>
<ul>
<li><a href="http://perlmonks.org">perlmonks</a></li>
<li><a href="http://archive.org">archnive.org</a></li>

</ul>
</body>
</html>
[download]

Since I don't have an example of what you're actually using, and things like [404 Not Found] don't often make sense to keep around, I removed them, however simply using the replace method rather than remove on the parent does exactly what you want:

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';
use Path::Tiny;
use Mojo::DOM;
use Mojo::UserAgent;

# get current directory
my $dir = Path::Tiny->cwd;

# for each html file
for ( $dir->children( qr/\.htm$/ ) ){

  # read the contents into a variable
  my $html = path( $_->basename )->slurp;

  # get the dom
  my $dom = Mojo::DOM->new( $html );

  # find all links
  for( $dom->find('a')->each ){
    
    # get target href
    my $url = $_->attr('href');
    say "checking link $url";
    
    # use Mojo::UserAgent to check if link is alive
    my $ua  = Mojo::UserAgent->new;
    my $res;
    eval { $res = $ua->max_redirects(5)->head( $url )->result };

    # if an error is thrown
    if ( $@ ){
      warn "$url seems dead, updating link";
      $_->replace('[404 Not Found]');
      
    } 

    # play nice
    sleep(10);
    
  }
  
  # save file
  path( $_->basename )->spew($dom->content);
}
[download]

Which outputs:

<html>
<head>
<title>test</title>
</head>
<body>
<ul>
<li><a href="http://perlmonks.org">perlmonks</a></li>
<li><a href="http://archive.org">archnive.org</a></li>
<li>[404 Not Found]</li>
</ul>
</body>
</html>
[download]

There's a 10 second sleep in there, don't batter URLs. There is room for optimisation, for example if the same URL occurs more than once per file, a list of tested working URLs etc, but I'll leave that as an exercise for you.

Update: small addition.

In reply to Re: Batch remove URLs by marto
in thread Batch remove URLs by bobafifi

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Welcome to the Monastery
	PerlMonks