in reply to Script (ovid's) explanation
If you are new to Perl, you will probably need to read through this a few times to get the full meaming. Sorry about that.
The easiest way to explain this is to break it down with line numbers and do a quick walk-through. The first thing that you will notice is that I have added the indentation back in. This is important. Proper indentation allows an experienced programmer the opportunity to merely glance at an expression and know it's scope. Without indentation, it can be difficult to determine whether or not a given statement is in the while, for, or if block that you're expecting it to be. Poor indentation can introduce bugs that are difficult to spot.
Now, on to the code:
01: use strict; 02: use File::Find; 03: use HTML::TokeParser; 04: 05: my $bak_ext = '.bak'; 06: my $root_dir = '/temp'; 07: 08: find(\&wanted, $root_dir); 09: 10: sub wanted { 11: # if the extension fits... 12: if ( /\.html?$/i ) { 13: print "Processing $_\n"; 14: my $new = $_; 15: my $bak = $_ . $bak_ext; 16: rename $_, $bak or die "Cannot rename $_ to $bak: $!"; 17: 18: open NEW, "> $new" or die "Cannot open $new for writing: +$!"; 19: + #WHAT IS THE + DOING? 20: #I DONT UNDERSTAND THIS TOKEN PART 21: my $p = HTML::TokeParser->new( $bak ); #IS new( $bak ) A +FUNCTION 22: # AND IF SO WHAT IS IT DOING? 23: while ( my $token = $p->get_token ) { 24: 25: # this index is the 'raw text' of the token 26: #I AM LOST ON THIS PART ALTHOUGH I UNDERSTAND IT IS 27: #AN IF ELSE STATEMENT WHAT IS THE 'T' AND 1 AND -1 DO +ING?? 28: my $text_index = $token->[0] eq 'T' ? 1 : -1; 29: 30: # it's both a start tag and a meta tag 31: #PLEASE EXPLAIN THIS PART 32: if ( $token->[0] eq 'S' and $token->[1] eq 'meta' ) + { 33: $token->[ $text_index ] =~ s/AA\.//g; 34: } 35: #I DONT UNDERSTAND THIS PART. 36: print NEW $token->[ $text_index ]; 37: } 38: close NEW; 39: } else { 40: print "Skipping $_\n"; 41: } 42: }
Line 1 tells Perl that we want to use some good programming practices such as predeclaring variables, not using things called 'soft references' and not using 'barewords' for subroutines unless the subroutines are predeclared. See "perldoc strict" for more information.
Lines 2 and 3 pull in the two modules that I want to use. File::Find is the standard module for recursively traversing directories. Most alternatives to this module are broken. HTML::TokeParser is a module that allows us to properly parse HTML. There are several good alternatives here, but I happen to be familiar with this one. Newer programmers often use regular expressions to parse HTML (I've been guilty of that), but there solutions are usually extremely flawed. Here is some sample HTML that this module will handle, but most regular expressions will have problems with:
<body bgcolor=#000000 text="white"> < input type="hidden" name="weird indenting is legal" value=??? > <input type=text name="foobar" value='>>> watch out for angle brackets +'>
Line 5 was a boo-boo on my part. This was just a quick hack. I either should have passed that into my subroutine or defined it as a constant at the top of the program. Subroutines should rarely, if ever, rely on variables declared outside of themselves. This makes it hard to find out how changes to those variables might affect the subroutines. In large programs, many of these will cause you a problem.
Line 6 is the root directory that you want to search. If this was a larger program, having this in one variable would make it easy to reference in more than one place, if necessary. For example, if we wanted to print out a report, this would be handy (following the idea that we never want to duplicate information as that forces us to synchronize things). As it stands, it's probably superfluous.
Line 8 is the File::Find routine that we're calling. There are a variety of ways to use this module. This seemed the easiest for your purposes. The first argument is the subroutine that is called when a file or directory is found. Note that when the subroutine is called, the name of the file or directory is stored in the special $_ variable. See "perldoc File::Find" for more information on this very useful module.
Line 12 has a regular expression telling me the file extensions that I want to match. Usually, a regular expression follows a variable and a binding operator like so:
$foo =~ /bar/;If the variable and binding operator (=~) are not present, then the regex is matching against $_ which, as I noted above, is the name of the current file or directory. I suspect that we should also check to see if it's a directory because a directory called "my.html" is going to have funny results here :)
The lines through 18 are self-explanatory. Note the "or die $!" on the end of the file open. If we didn't have that, any failed attempt to open the file (not having permissions, for example) would be ignored at the program would continue to run and you wouldn't know what went wrong.
Line 19: that plus sign did not exist in my program. However, Perlmonks will wrap long lines and use a red plus sign (+) to show where the lines have been wrapped. To avoid that, either get an account and log in (which will allow you to customize the length at which lines wrap), or click the "d/l code" link at the bottom of the node.
Line 21 is object-oriented programming magic. We're creating a new HTML::TokeParser object, $p, using the backup of the current file, $bak, as an argument to the object's constructor. HTML::TokeParser parses the HTML document into a stream of tokens and tokens can be handed to you one at a time for analysis.
Line 23 is getting the next token from the HTML::TokeParser object.
Line 28 is a bit confusing:
28: my $text_index = $token->[0] eq 'T' ? 1 : -1;
Note: the "->" allows us to dereference a reference. Since $token contains an array reference, $token->[0] allows us to dereference the array and get the first element (remembering that array indices start at zero).
To understand what's going on here, we have to do two things. First, read perldoc HTML::TokeParser. From that, we get a clue as to the structure of the tokens returned:
["S", $tag, $attr, $attrseq, $text] ["E", $tag, $text] ["T", $text, $is_data] ["C", $text] ["D", $text] ["PI", $token0, $text]
What the heck does that mean? Well, by a careful reading of the documentation, we learn that each token contains a reference to an array and the first element in each array reference tells us what type of token we have (for example, "S" means we have a start tag). The following elements contain more information about the tag. The $text element is the exact text of the returned token. Here's how a token for a meta tag might look:
[ 'S', 'meta', { 'content' => 'Web data ', 'name' => 'doc' }, [ 'name', 'content' ], '<META NAME="doc" CONTENT="Web data ">' ];
Note: see perlreftut for more information on references.
The first element (token type) identifies this as a start tag token. The second element identifes the tag type as meta. The third element is a hashref containing all of the attributes and their values and the fourth is an array ref containing the sequence of said attributes. The last element, the one we're interested in, is the exact text of the tag.
Remember that to get the last element of an array, we can always use -1 as the index. In the case of the text token ("T"), we see the the raw text of the token is stored at position 1. All other tokens have the raw text as the last element. So, since we want the raw text, if we have a text token we set the $text_index to 1, otherwise we set it to -1.
Confused? I certainly was when I started using the module. I've written an alternate interface that allows us to avoid memorizing those indices, but it would have been too confusing (and overkill) to have included it in your program.
Lines 32 through 34 should be self-explanatory, now:
32: if ( $token->[0] eq 'S' and $token->[1] eq 'meta' ) + { 33: $token->[ $text_index ] =~ s/AA\.//g; 34: }
In other words, if this this a "start" token ($token->[0] eq 'S' and it's a meta tag ($token->[1] eq 'meta'), then we need to strip the "AA." from the actual text ($token->[ $text_index ]) of the tag.
Line 36 merely prints the text to the new file.
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: (Ovid) Re: Script (ovid's) explanation
by theguvnor (Chaplain) on Feb 27, 2002 at 00:25 UTC | |
by Ovid (Cardinal) on Feb 27, 2002 at 00:33 UTC | |
|
Re: (Ovid) Re: Script (ovid's) explanation
by Anonymous Monk on Feb 27, 2002 at 16:11 UTC |