ww has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to find ALL html links in all *.ht* files in a tree, using file::find. My script (find_cmcl.pl, longish and undoubtedly clumsy but available if that's useful) does find all the links in ONLY ONE file in the lowest subdirectory when I feed it:

        c:\foo\ find_cmcl.pl ./

Shaamefacedly, I've cargoculted (from suprperl's introductory Beginners guide to File::Find) and from the pod's examples but none of that's helped either solve the problem OR improve my understanding. And I'm finding the pod incomprehensible -- largely because it assumes more knowledge than I have so far acquired. (ditto, of course, the code itself)

Regretably, this has gone past a learning experience. Management suddenly wants all links to all non-federal govt or non-state-govt sites removed, and yesterday, damnit! (ie, Red Cross, USO, and Joe's_Pool_Hall_with_discounts_for_MyAgency'sEmployees ALL have to go.)

So, much as I respect RTFM, can someone point me to an alternate Tutorial or guidance or, mayhaps, utter/author a few simple lines of wisdom to further study and understanding.

I really don't want a solution, which is part of why I posted no code ( ...well, I sorta' do, but that's not gonna' be as effective as actually finding something written for my level of understanding).

TIA

UPDATE, 23 Jan 05
Solved, with many thanks to all who provided help. Working code follows, but be sure to read Joost, holi, scooterm, borisz and McMahon, below. Their insights may help you, if you too are having trouble with the docs!

#c:/Perl/bin -w use File::Find; use strict; # PURPOSE: find_http.pl finds ALL links in which "http" appears insid +e "<a href.../a>" # in all w3c 4.01 compliant *.ht* files in the ARGV[0] directory and i +ts subdirs; in other # words, finds absolute links, including both offsite links and those + local links written # in absolute notation, prepended in output by # the line number in wh +ich the link is found. # # ACKNOWLEDGEMENTS: thanks to ikegami, Joost, holli, Scooterm, zaxo, c +orion and other PM, # for their tutelage (but who bear NO responsibility for non-idiomatic + constructs or errors). # Readers should also note that holli warns emphatically NOT to use re +gex # to parse html ("because it is fatally error-prone, unneccesary (ther +e is HTML::Parser)..." and that he says "fatally error-prone" especia +lly when applied "to ill-formed html." The warning's been taken to h +eart, but was laid aside for # the purposes this drill (schodckwm 1/22/05) use vars qw( $dirname $file @filenames $input @input $linecounter $par +t @parts $href $offsite $link @found $subdir ); $linecounter = "0"; $href = qr/<a /i; # Any link with begin "<a " (sub getht +ml, Line 71) $offsite = qr%http:[\/]{2}%i; # find links which include "http" (Lin +e 79) if (@ARGV == 1) { # Get the name of the directory to scan $dirname = $ARGV[0]; } else { print STDERR ("\tUsage: $0 dirname > outfile.txt\n\twhere dirname c +an be a relative or absolute path\n\tEven under MS Windows, use of *n +ix-style '/' forward slashes in path is recommended.\n\ti.e., from d: +, 'd:\/foo' or '.\/foo' \(or '.\/long\/path\/to\/target'\)"); exit(1) +; } # call &process_file for each file in the directory (& subdirs) in $AR +GV[0] find \&process_file, "$dirname"; &gethtml; exit(); #### sub process_file { if ($_ =~ /(.+)(ht[ml]{1,2})/ ) { $subdir = $File::Find::dir . "/"; # see below at # 1: push @filenames, $subdir . $_; } else { return (@filenames); } } # 1: (from c:/Perl/lib/file/find.pm) # $File::Find::dir is the current directory name, # also from find.pm: # $_ is the current filename within that directory #### sub gethtml { print "\n\t Files found:\n"; foreach $file(@filenames) { print "\n\t found: $file"; open INFILE, "$file" or warn "\tCan't open $file: $! "; push @found, "\n\n\t" . $file ."\n"; @input = <INFILE>; # slurp the whole file to @input $linecounter = 0; # reset linecounter for next file foreach $input(@input) { $linecounter++; @parts = split m!(</a>)!i, $input; foreach $part(@parts) { if ( $part =~ m% ( # CAPTURE to $1 $href # START ON '<a href="' OR '<a href="mail +to:', etc .+ # one or more of anything not a newline # NB: the close tag already eaten by spl +it ) # close capture %gix # end match (g not necessary) ) # end paren of 'if' test { my $link_a = $1; # store the match of (any) link if ( $link_a =~ m%($offsite)% ) { # check for http:// NB +: This regex # does NOT constrain th +e position # of $offsite in the <a + href...</a> # because the link may +be formatted, for # example, <a name="foo +" href="http.... $link = $link_a . "</a>"; # add the close tag push @found, $linecounter; push @found , $link; } # end if $link_a... } # end if $part... } # end foreach #part(@parts) } #end foreach $input(@input) close INFILE; } # end foreach $file(@filena +mes) &print (@found); } # end sub gethtml ##### sub print { print "\n\n\t Found these \"http\" Links \n"; foreach $link(@found) { my $out = $link . " "; if ( $link !~ /^\d*$/) { $out = $out . "\n"; } print $out; } } #ENDNOTES: With ActiveState perl 5.8.4, this extracts all "http" lin +ks from a # local mirror (Xitami on E:) of a ~1600 page website and writes them +to a # local ATA drive (F: on a P4, 2.4GHz, w2k box) in ~16 seconds. # The pages searched range from trivial to ~2400 lines of 4.01 html # # It would be non-trivial to output to html. However, reformatting # "Line_number <a href="whatever">rendered link</a>" to be displayed p +roperly by # a browser beyond the scope of this exercise
PM ++!

Replies are listed 'Best First'.
Re: Tutorial on File::Find even more basic than "Beginners Guide"
by holli (Abbot) on Jan 20, 2005 at 23:10 UTC
    a first quickshot. this can be finetuned to better fit your needs:
    #always use strict; #load modules use File::Find; require HTML::LinkExtor; #create a HTML::LinkExtor-instance for later use my $links = HTML::LinkExtor->new ( # first argument is a subroutine that will # be called for every link in the html # the object parses sub { # $tag can contain "a" or "img" # %links contains the "attributes" of the link my ($tag, %links) = @_; #print if we have a "a"-link that is not #page internal (no "#") print "$links{href}\n" if $tag eq "a" && $links{href} =~ /^[^#]/ ; } ); #find all html-files in a tree find ( #first argument is the sub that will be called #for every file AND directory found sub { # check if we have file that has htm or html-suffix if ( -f $File::Find::name && if $File::Find::name =~ /\.htm(l) +?/ ) { #if so, parse it for links print "$File::Find::name contains:\n"; $links->parse_file($File::Find::name); } } , "c:/perl" );
    Learn by examining code. You should change "c:/perl" to the path you need.

    p.s. what is wrong with the docs of "File::Find"? They belong to the better ones.

    Update:
    Added comments

    holli, regexed monk
      Holli:

      suspect the issue is NOT with the docs, but rather with this noob's overreaching. For example, both pod and the ref'ed tutorial seem (to me) to say, given a start_dir from the cli, F:F iterate thru all subdirs, id'ing those whose name matches a regex in my processing sub. But stepping thru my code with -d tells me I've misunderstood something. Comes back to skill level of this reader.

      For the rest, thank you very much. I will study and understand... soon, I hope. <G>

Re: Tutorial on File::Find even more basic than "Beginners Guide"
by dimar (Curate) on Jan 20, 2005 at 23:32 UTC

    When I first learned how to use File::Find, it took some getting used to. The documentation and everything else about it was fine, nevertheless I wrote a wrapper subroutine because I figured it could be made more readable to anyone who read my code but did not understand File::Find (or even perl for that matter).

    Here is an example, it might help you, then again it might not.

    ### begin_: init perl use strict; use warnings; use File::Basename; use File::Find; ### begin_: process all files in the dir tree my $oDataTable = dirTreeToDataTable("c:/temp"); for my $oDataRec (@{$oDataTable}){ next unless $oDataRec->{extension} eq '.html'; ProcessTheFile($oDataRec->{path}); }
Re: Tutorial on File::Find even more basic than "Beginners Guide"
by borisz (Canon) on Jan 20, 2005 at 23:17 UTC
    I think it is all in perldoc File::Find. Even if you do not want a solution, here is one ;-)
    use File::Find; use HTML::LinkExtor; my @files; find( { wanted => sub { -f && /\.ht/ && push @files, $File::Find::name } }, @ARGV ); my $p = HTML::LinkExtor->new( sub { my ( $tag, %attr ) = @_; return if $tag ne 'a'; print $attr{href}, "\n"; } ); $p->parse_file($_) for (@files);
    Call it this way: perl script.pl dir ...
    Boris
Re: Tutorial on File::Find even more basic than "Beginners Guide"
by McMahon (Chaplain) on Jan 20, 2005 at 23:04 UTC
    Would this help? It's Windowsish, but'll still work. Of course, you could probably use real grep, too.


    use warnings; use strict; use File::Find; open (OUT, ">C://strings.txt"); my $search_string = 'http'; find( \&grep, "C:\\Program Files\\whatever" ); sub grep { my $file = $File::Find::name; if ( -f $file && -r _ ) { open( my $fh, "<", $file ) or return; while ( my $rec = <$fh> ) { print OUT "$rec" #in $_\n" if $rec =~ /$search_string/; } } }
      McMahon:

      I guess Joost's critique was fairer than I realized. If I understand your code correctly, this addresses the flip-side of my problem, but I will now compare your sub with docs, in the strong suspicion that the example WILL help me understand said docs/tutorials/whatever.

      Thank You!
Re: Tutorial on File::Find even more basic than "Beginners Guide"
by Joost (Canon) on Jan 20, 2005 at 22:56 UTC
      I think you may have missed the last graf: I am NOT seeking a solution; I'm seeking a reference to a tutorial on File::Find better suited to my level of understanding of perl than I have so far located.

      As to the point of finding all .html links (such as <a href="foo.bar...., <a name="foo style="xxxx" href=".... and so on, I invite your attention to the para referring to management's directive to remove links to a wide variety of sites (which were previously "approved.")

      however, in all fairness, I don't like to see homework or no-work posted either. Update (23 Jan 05): Working code may be found in the grandparent, Tutorial on File::Find even more basic than "Beginners Guide".

        Ok, well, File::Find is not that hard:
        use File::Find; use strict; sub process_file { return unless -f _; # skip processing unless $_ is really a file (no +t a dir) # some code that # does stuff with $_ (contains the "current file") } # call &process_file recursively for each file in /some/directory find \&process_file, "/some/directory";
        The rest of File::Find is just "specifics" that you don't need for the problem you're trying to solve.

        update: Try using HTML::LinkExtor - something like

        use HTML::LinkExtor; my $p = HTML::LinkExtor->new(); sub process_file { return unless /\.html?$/i; # skip unless *.htm / *.html file return unless -f _; # skip processing unless $_ is really a file (no +t a dir) $p->parse_file($_); print $p->links; }