(Ovid) Re: Script (ovid's) explanation

If you are new to Perl, you will probably need to read through this a few times to get the full meaming. Sorry about that.

The easiest way to explain this is to break it down with line numbers and do a quick walk-through. The first thing that you will notice is that I have added the indentation back in. This is important. Proper indentation allows an experienced programmer the opportunity to merely glance at an expression and know it's scope. Without indentation, it can be difficult to determine whether or not a given statement is in the while, for, or if block that you're expecting it to be. Poor indentation can introduce bugs that are difficult to spot.

Now, on to the code:

01:  use strict;
02:  use File::Find;
03:  use HTML::TokeParser;
04:  
05:  my $bak_ext  = '.bak';
06:  my $root_dir = '/temp';
07:  
08:  find(\&wanted, $root_dir);
09:  
10:  sub wanted {
11:      # if the extension fits...
12:      if ( /\.html?$/i ) {
13:          print "Processing $_\n";
14:          my $new = $_;
15:          my $bak = $_ . $bak_ext;
16:          rename $_, $bak or die "Cannot rename $_ to $bak: $!";
17:  
18:          open NEW, "> $new" or die "Cannot open $new for writing: 
+$!";
19:              +  #WHAT IS THE + DOING?
20:          #I DONT UNDERSTAND THIS TOKEN PART
21:          my $p = HTML::TokeParser->new( $bak ); #IS new( $bak ) A 
+FUNCTION
22:          # AND IF SO WHAT IS IT DOING?
23:          while ( my $token = $p->get_token )    {
24:  
25:              # this index is the 'raw text' of the token
26:              #I AM LOST ON THIS PART ALTHOUGH I UNDERSTAND IT IS
27:              #AN IF ELSE STATEMENT WHAT IS THE 'T' AND 1 AND -1 DO
+ING??
28:              my $text_index = $token->[0] eq 'T' ? 1 : -1;
29:  
30:              # it's both a start tag and a meta tag
31:              #PLEASE EXPLAIN THIS PART
32:              if ( $token->[0] eq 'S' and $token->[1] eq 'meta' )  
+  {
33:                  $token->[ $text_index ] =~ s/AA\.//g;
34:              }
35:              #I DONT UNDERSTAND THIS PART.
36:              print NEW $token->[ $text_index ];
37:          }
38:          close NEW;
39:      } else {
40:          print "Skipping $_\n";
41:      }
42:  }
[download]

Line 1 tells Perl that we want to use some good programming practices such as predeclaring variables, not using things called 'soft references' and not using 'barewords' for subroutines unless the subroutines are predeclared. See "perldoc strict" for more information.

Lines 2 and 3 pull in the two modules that I want to use. File::Find is the standard module for recursively traversing directories. Most alternatives to this module are broken. HTML::TokeParser is a module that allows us to properly parse HTML. There are several good alternatives here, but I happen to be familiar with this one. Newer programmers often use regular expressions to parse HTML (I've been guilty of that), but there solutions are usually extremely flawed. Here is some sample HTML that this module will handle, but most regular expressions will have problems with:

<body bgcolor=#000000 text="white">

<
    input
    type="hidden"
    name="weird indenting is legal"
    value=???
>

<input type=text name="foobar" value='>>> watch out for angle brackets
+'>
[download]

Line 5 was a boo-boo on my part. This was just a quick hack. I either should have passed that into my subroutine or defined it as a constant at the top of the program. Subroutines should rarely, if ever, rely on variables declared outside of themselves. This makes it hard to find out how changes to those variables might affect the subroutines. In large programs, many of these will cause you a problem.

Line 6 is the root directory that you want to search. If this was a larger program, having this in one variable would make it easy to reference in more than one place, if necessary. For example, if we wanted to print out a report, this would be handy (following the idea that we never want to duplicate information as that forces us to synchronize things). As it stands, it's probably superfluous.

Line 8 is the File::Find routine that we're calling. There are a variety of ways to use this module. This seemed the easiest for your purposes. The first argument is the subroutine that is called when a file or directory is found. Note that when the subroutine is called, the name of the file or directory is stored in the special $_ variable. See "perldoc File::Find" for more information on this very useful module.

Line 12 has a regular expression telling me the file extensions that I want to match. Usually, a regular expression follows a variable and a binding operator like so:

$foo =~ /bar/;

If the variable and binding operator (=~) are not present, then the regex is matching against $_ which, as I noted above, is the name of the current file or directory. I suspect that we should also check to see if it's a directory because a directory called "my.html" is going to have funny results here :)

The lines through 18 are self-explanatory. Note the "or die $!" on the end of the file open. If we didn't have that, any failed attempt to open the file (not having permissions, for example) would be ignored at the program would continue to run and you wouldn't know what went wrong.

Line 19: that plus sign did not exist in my program. However, Perlmonks will wrap long lines and use a red plus sign (+) to show where the lines have been wrapped. To avoid that, either get an account and log in (which will allow you to customize the length at which lines wrap), or click the "d/l code" link at the bottom of the node.

Line 21 is object-oriented programming magic. We're creating a new HTML::TokeParser object, $p, using the backup of the current file, $bak, as an argument to the object's constructor. HTML::TokeParser parses the HTML document into a stream of tokens and tokens can be handed to you one at a time for analysis.

Line 23 is getting the next token from the HTML::TokeParser object.

Line 28 is a bit confusing:

28:              my $text_index = $token->[0] eq 'T' ? 1 : -1;
[download]

Note: the "->" allows us to dereference a reference. Since $token contains an array reference, $token->[0] allows us to dereference the array and get the first element (remembering that array indices start at zero).

To understand what's going on here, we have to do two things. First, read perldoc HTML::TokeParser. From that, we get a clue as to the structure of the tokens returned:

  ["S",  $tag, $attr, $attrseq, $text]
  ["E",  $tag, $text]
  ["T",  $text, $is_data]
  ["C",  $text]
  ["D",  $text]
  ["PI", $token0, $text]
[download]

What the heck does that mean? Well, by a careful reading of the documentation, we learn that each token contains a reference to an array and the first element in each array reference tells us what type of token we have (for example, "S" means we have a start tag). The following elements contain more information about the tag. The $text element is the exact text of the returned token. Here's how a token for a meta tag might look:

[
  'S',
  'meta',
  {
    'content' => 'Web data ',
    'name' => 'doc'
  },
  [
    'name',
    'content'
  ],
  '<META NAME="doc" CONTENT="Web data ">'
];
[download]

Note: see perlreftut for more information on references.

The first element (token type) identifies this as a start tag token. The second element identifes the tag type as meta. The third element is a hashref containing all of the attributes and their values and the fourth is an array ref containing the sequence of said attributes. The last element, the one we're interested in, is the exact text of the tag.

Remember that to get the last element of an array, we can always use -1 as the index. In the case of the text token ("T"), we see the the raw text of the token is stored at position 1. All other tokens have the raw text as the last element. So, since we want the raw text, if we have a text token we set the $text_index to 1, otherwise we set it to -1.

Confused? I certainly was when I started using the module. I've written an alternate interface that allows us to avoid memorizing those indices, but it would have been too confusing (and overkill) to have included it in your program.

Lines 32 through 34 should be self-explanatory, now:

32:              if ( $token->[0] eq 'S' and $token->[1] eq 'meta' )  
+  {
33:                  $token->[ $text_index ] =~ s/AA\.//g;
34:              }
[download]

In other words, if this this a "start" token ($token->[0] eq 'S' and it's a meta tag ($token->[1] eq 'meta'), then we need to strip the "AA." from the actual text ($token->[ $text_index ]) of the tag.

Line 36 merely prints the text to the new file.

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Comment on (Ovid) Re: Script (ovid's) explanation Select or Download Code

Replies are listed 'Best First'.
Re: (Ovid) Re: Script (ovid's) explanation by theguvnor (Chaplain) on Feb 27, 2002 at 00:25 UTC
Ovid: is your alternative interface to HTML::TokeParser in CPAN? ..Guv	[reply]
(Ovid) Re(3 or 4): Script (ovid's) explanation by Ovid (Cardinal) on Feb 27, 2002 at 00:33 UTC
No. It has some limitations that doesn't make it a general replacement (I can't recall what those limitations are, though). However, for most small projects, it should be fine. I've rummaged around and found a version (not sure if this is the one I was intending. If you want to play around, here it is. Full POD with code samples is included. ################## package HTML::TokeParser::Easy; ################## use strict; use Carp; use HTML::TokeParser; use vars qw/ @ISA $VERSION $AUTOLOAD /; $VERSION = '1.0'; @ISA = qw/ HTML::TokeParser /; use constant START_TAG => 'S'; use constant END_TAG => 'E'; use constant TEXT => 'T'; use constant COMMENT => 'C'; use constant DECLARATION => 'D'; use constant PROCESS_INSTRUCTION => 'PI'; my %token_spec = ( S => { _name => 'START_TAG', tag => 1, attr => 2, attrseq => 3, text => 4 }, E => { _name => 'END_TAG', tag => 1, text => 2 }, T => { _name => 'TEXT', text => 1 }, C => { _name => 'COMMENT', text => 1 }, D => { _name => 'DECLARATION', text => 1 }, PI => { _name => 'PROCESS_INSTRUCTION', token0 => 1, text => 2 } ); sub AUTOLOAD { no strict 'refs'; my ($self, $token) = @_; # was it an is_... method? if ( $AUTOLOAD =~ /.::is_(\w+)/ ) { my $token_type = uc $1; my $tag = &$token_type; { $AUTOLOAD } = sub { return $_[ 1 ]->[ 0 ] eq $tag ? 1 : 0 } +; return &$AUTOLOAD; } elsif ( $AUTOLOAD =~ /.::return_(\w+)/ ) { # was it a return_... method? my $token_attr = $1; { $AUTOLOAD } = sub { my $attr = $_[ 1 ]->[ 0 ]; if ( exists $token_spec{ $attr } and exists $token_spe +c{ $attr }{ $token_attr } ) { return $_[ 1 ]->[ $token_spec{ $attr }{ $token_att +r } ]; } else { if ( ! exists $token_spec{ $attr } ) { carp "No such token: '$attr'"; } else { carp "No such attribute: '$token_attr' for $to +ken_spec{ $attr }{ _name }"; } } }; return &$AUTOLOAD; } else { # Yo! You can't do that! croak "No such method: $AUTOLOAD"; } } sub DESTROY {}; __END__ =head1 NAME HTML::TokeParser::Easy - simple method access to TokeParser tokens (no + more memorizing array indices). =head1 SYNOPSIS use HTML::TokeParser::Easy; my $p = HTML::TokeParser::Easy->new( $somefile ); while ( my $token = $parser->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next if ! $parser->is_text( $token ); print $parser->return_text( $token ); } =head1 DESCRIPTION C<HTML::TokeParser> is a fairly common method of parsing HTML. Howeve +r, the tokens returned are not exactly intuitive to parse: ["S", $tag, $attr, $attrseq, $text] ["E", $tag, $text] ["T", $text, $is_data] ["C", $text] ["D", $text] ["PI", $token0, $text] To simplify this, C<HTML::TokeParser::Easy> allows the user ask more i +ntuitive (read: more self-documenting) questions about the tokens returned. Specifically, +there are 6 C<is_foo> type methods and 6 C<return_bar> type methods. The C<is_> methods all +ow you to determine the token type and the C<return_> methods get the data that you need. Since this is a subclass of C<HTML::TokeParser>, all C<HTML::TokeParse +r> methods are available. To truly appreciate the power of this module, please read the document +ation for C<HTML::TokeParser> and C<HTML::Parser>. The following will be brief descriptions of the available methods foll +owed by examples. =head1 C<is_> Methods =head2 Note: Due to the way that C<AUTOLOAD> has been coded, the portion of the C<i +s_> methods after the C<is_> part is case-insensitive. For example, the following lines of +code are identical: $parser->is_start_tag( $token ); $parser->is_START_TAG( $token ); $parser->is_stArt_tAg( $token ); Yes, I could have done something about that, but why bother? =over 4 =item 1 C<is_start_tag> Use this to determine if you have a start tag. =item 2 C<is_end_tag.> Use this to determine if you have an end tag. =item 3 C<is_text> Use this to determine if you have text. Note that this is I<not> to b +e confused with the C<return_text> method described below! C<is_text> will identify text +that the user typically sees display in the Web browser. =item 4 C<is_comment> Are you still reading this? Nobody reads POD. Don't you know you're +supposed to go to CLPM, ask a question that's answered in the POD and get flamed? It's a rite + of passage. Really. C<is_comment> is used to identify comments. See the HTML::Parser docu +mentation for more information about comments. There's more than you might think. =item 5 C<is_declaration> This will match the DTD at the top of your HTML. (You I<do> use DTD's, + don't you?) =item 6 C<is_process_instruction> Process Instructions are from XML. This is very handy if you need to +parse out PHP and similar things with a parser. =back =head1 The 6 C<return_> methods =head2 Note: As noted for the 6 C<is_> methods, these methods are case-insensitive +after the C<return_> part. =over 4 =item 1 C<return_tag> Do you have a start tag or end tag? This will return the type (lower +case). =item 2 C<return_attr> If you have a start tag, this will return a hash ref with the attribut +e names as keys and the values as the values. =item 3 C<return_attrseq> For a start tag, this is an array reference with the sequence of the a +ttributes, if any. =item 4 C<return_text> This is the exact text of whatever the token is representing. =item 5 C<return_is_data> This text is in a CDATA section. =item 6 C<return_token0> For processing instructions, this will return the token found immediat +ely after the opening tag. For \<?php, "php" will be returned. =back =head1 Examples =head2 Finding comments For some strange reason, your Pointy-Haired Boss (PHB) is convinced th +at the graphics department is making fun of him by embedding rude things about him in HTML commen +ts. You need to get all HTML comments from the HTML. use strict; use HTML::TokeParser::Easy; my @html_docs = glob( ".html" ); open PHB, "> phbreport.txt" or die "Cannot open phbreport for writing +: $!"; foreach my $doc ( @html_docs ) { print "Processing $doc\n"; my $p = HTML::TokeParser::Easy->new( $doc ); while ( my $token = $p->get_token ) { next if ! $p->is_comment( $token ); print PHB $p->return_text( $token ), "\n"; } } close PHB; =head2 Stripping Comments Uh oh. Turns out that your PHB was right for a change. Many of the c +omments in the HTML weren't very polite. Since your entire graphics department was just fired, it + falls on you need to strip those comments from the HTML. use strict; use HTML::TokeParser::Easy; my $new_folder = 'no_comment/'; my @html_docs = glob( ".html" ); foreach my $doc ( @html_docs ) { print "Processing $doc\n"; my $new_file = "$new_folder$doc"; open PHB, "> $new_file" or die "Cannot open $new_file for writing +: $!"; my $p = HTML::TokeParser::Easy->new( $doc ); while ( my $token = $p->get_token ) { next if $p->is_comment( $token ); print PHB $p->return_text( $token ); } close PHB; } =head2 Changing form tags Your company was foo.com and now is bar.com. Unfortunately, whoever w +rote your HTML decided to hardcode "http://www.foo.com/" into the C<action> attribute of the for +m tags. You need to change it to "http://www.bar.com/". use strict; use HTML::TokeParser::Easy; my $new_folder = 'new_html/'; my @html_docs = glob( "*.html" ); foreach my $doc ( @html_docs ) { print "Processing $doc\n"; my $new_file = "$new_folder$doc"; open FILE, "> $new_file" or die "Cannot open $new_file for writin +g: $!"; my $p = HTML::TokeParser::Easy->new( $doc ); while ( my $token = $p->get_token ) { if ( $p->is_start_tag( $token ) and $p->return_tag( $token ) +eq 'form' ) { my $form_tag = new_form_tag( $p->return_attr( $token ), $ +p->return_attrseq( $token ) ); print FILE $form_tag; } else { print FILE $p->return_text( $token ); } } close FILE; } sub new_form_tag { my ( $attr, $attrseq ) = @_; if ( exists $attr->{ action } ) { $attr->{ action } =~ s/www\.foo\.com/www.bar.com/; } my $tag = ''; foreach ( @$attrseq ) { $tag .= "$_=\"$attr->{ $_ }\" "; } $tag = "<form $tag>"; return $tag; } =head1 COPYRIGHT Copyright (c) 2001 Curtis "Ovid" Poe. All rights reserved. This program is free software; you may redistribute it and/or modify i +t under the same terms as Perl itself =head1 AUTHOR Curtis "Ovid" Poe L<poec@yahoo.com> =head1 BUGS 2001/10/04 There are no known bugs at this time. Use of C<$HTML::Parser::VERSION> which is less than 3.25 may result in + incorrect behavior as older versions do not always handle XHTML correctly. It is the programmer's + responsibility to verify that the behavior of this code matches the programmer's needs. Address bug reports and comments to: L<poec@yahoo.com>. When sending +bug reports, please provide the version of HTML::Parser, HTML::TokeParser, HTML::To +keParser::Easy, the version of Perl, and the version of the operating system you are u +sing. =head1 BUGS 2001/10/04 There are no known bugs at this time. Use of C<$HTML::Parser::VERSION> which is less than 3.25 may result in + incorrect behavior as older versions do not always handle XHTML correctly. It is the programmer's + responsibility to verify that the behavior of this code matches the programmer's needs. =cut [download] Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: (Ovid) Re: Script (ovid's) explanation by Anonymous Monk on Feb 27, 2002 at 16:11 UTC
Thanks for the time and explanation on this script!	[reply]