ww has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to find ALL html links in all *.ht* files in a tree, using file::find. My script (find_cmcl.pl, longish and undoubtedly clumsy but available if that's useful) does find all the links in ONLY ONE file in the lowest subdirectory when I feed it:
c:\foo\ find_cmcl.pl ./
Shaamefacedly, I've cargoculted (from suprperl's introductory Beginners guide to File::Find) and from the pod's examples but none of that's helped either solve the problem OR improve my understanding. And I'm finding the pod incomprehensible -- largely because it assumes more knowledge than I have so far acquired. (ditto, of course, the code itself)
Regretably, this has gone past a learning experience. Management suddenly wants all links to all non-federal govt or non-state-govt sites removed, and yesterday, damnit! (ie, Red Cross, USO, and Joe's_Pool_Hall_with_discounts_for_MyAgency'sEmployees ALL have to go.)
So, much as I respect RTFM, can someone point me to an alternate Tutorial or guidance or, mayhaps, utter/author a few simple lines of wisdom to further study and understanding.I really don't want a solution, which is part of why I posted no code ( ...well, I sorta' do, but that's not gonna' be as effective as actually finding something written for my level of understanding).
TIAUPDATE, 23 Jan 05
Solved, with many thanks to all who provided help. Working code follows, but be sure to read Joost, holi, scooterm, borisz and McMahon, below. Their insights may help you, if you too are having trouble with the docs!
#c:/Perl/bin -w use File::Find; use strict; # PURPOSE: find_http.pl finds ALL links in which "http" appears insid +e "<a href.../a>" # in all w3c 4.01 compliant *.ht* files in the ARGV[0] directory and i +ts subdirs; in other # words, finds absolute links, including both offsite links and those + local links written # in absolute notation, prepended in output by # the line number in wh +ich the link is found. # # ACKNOWLEDGEMENTS: thanks to ikegami, Joost, holli, Scooterm, zaxo, c +orion and other PM, # for their tutelage (but who bear NO responsibility for non-idiomatic + constructs or errors). # Readers should also note that holli warns emphatically NOT to use re +gex # to parse html ("because it is fatally error-prone, unneccesary (ther +e is HTML::Parser)..." and that he says "fatally error-prone" especia +lly when applied "to ill-formed html." The warning's been taken to h +eart, but was laid aside for # the purposes this drill (schodckwm 1/22/05) use vars qw( $dirname $file @filenames $input @input $linecounter $par +t @parts $href $offsite $link @found $subdir ); $linecounter = "0"; $href = qr/<a /i; # Any link with begin "<a " (sub getht +ml, Line 71) $offsite = qr%http:[\/]{2}%i; # find links which include "http" (Lin +e 79) if (@ARGV == 1) { # Get the name of the directory to scan $dirname = $ARGV[0]; } else { print STDERR ("\tUsage: $0 dirname > outfile.txt\n\twhere dirname c +an be a relative or absolute path\n\tEven under MS Windows, use of *n +ix-style '/' forward slashes in path is recommended.\n\ti.e., from d: +, 'd:\/foo' or '.\/foo' \(or '.\/long\/path\/to\/target'\)"); exit(1) +; } # call &process_file for each file in the directory (& subdirs) in $AR +GV[0] find \&process_file, "$dirname"; &gethtml; exit(); #### sub process_file { if ($_ =~ /(.+)(ht[ml]{1,2})/ ) { $subdir = $File::Find::dir . "/"; # see below at # 1: push @filenames, $subdir . $_; } else { return (@filenames); } } # 1: (from c:/Perl/lib/file/find.pm) # $File::Find::dir is the current directory name, # also from find.pm: # $_ is the current filename within that directory #### sub gethtml { print "\n\t Files found:\n"; foreach $file(@filenames) { print "\n\t found: $file"; open INFILE, "$file" or warn "\tCan't open $file: $! "; push @found, "\n\n\t" . $file ."\n"; @input = <INFILE>; # slurp the whole file to @input $linecounter = 0; # reset linecounter for next file foreach $input(@input) { $linecounter++; @parts = split m!(</a>)!i, $input; foreach $part(@parts) { if ( $part =~ m% ( # CAPTURE to $1 $href # START ON '<a href="' OR '<a href="mail +to:', etc .+ # one or more of anything not a newline # NB: the close tag already eaten by spl +it ) # close capture %gix # end match (g not necessary) ) # end paren of 'if' test { my $link_a = $1; # store the match of (any) link if ( $link_a =~ m%($offsite)% ) { # check for http:// NB +: This regex # does NOT constrain th +e position # of $offsite in the <a + href...</a> # because the link may +be formatted, for # example, <a name="foo +" href="http.... $link = $link_a . "</a>"; # add the close tag push @found, $linecounter; push @found , $link; } # end if $link_a... } # end if $part... } # end foreach #part(@parts) } #end foreach $input(@input) close INFILE; } # end foreach $file(@filena +mes) &print (@found); } # end sub gethtml ##### sub print { print "\n\n\t Found these \"http\" Links \n"; foreach $link(@found) { my $out = $link . " "; if ( $link !~ /^\d*$/) { $out = $out . "\n"; } print $out; } } #ENDNOTES: With ActiveState perl 5.8.4, this extracts all "http" lin +ks from a # local mirror (Xitami on E:) of a ~1600 page website and writes them +to a # local ATA drive (F: on a P4, 2.4GHz, w2k box) in ~16 seconds. # The pages searched range from trivial to ~2400 lines of 4.01 html # # It would be non-trivial to output to html. However, reformatting # "Line_number <a href="whatever">rendered link</a>" to be displayed p +roperly by # a browser beyond the scope of this exercise
|
|---|