in reply to Understanding Regular Expressions

First, I should say that the best way to do this is with one of the many already-written HTML parsers, like HTML::TokeParser::Simple or HTML::Tree. The above will work for some but not all HTML. It's fine to use if you have control over the HTML coming from the server and you know you won't do anything weird, but in any other circumstances it's much better to use an already-written, alread-tested, and already-debugged (well, mostly) module, instead of trying it yourself, and making the same mistakes that the modules' authors did on their first try.

That said, learning more about regular expressions is a laudable goal, so here's some quick explanations of what the REs in this code do.

sub do_bondagefiles { my ($url, $html) = @_; $_ = $html; 1 while (s@<!--.*?-->@ @gsi); # lose comments # Search-and-replace, with @ seperating the search part, # the replace part, and the search options. # <!-- matches itself as a literal string. # Same for -->. These are the HTML comment characters. # .*? matches anything in between the HTML # comment characters. The dot means "any character", # the * means "0 or more of", and the ? means "the # shortest match", instead of the default of the longest. # After the @ is the next argument, the replace string. # It's a single space. So <!-- anything --> will be # replaced by a single space. # After the next @ is the final argument to the RE, # the options. Options here are g, s, and i. g means # "global"; if you find the same match multiple times, # replace all of them. s means treat newlines as regular # characters, instead of treating them specially. i # means case insensitive search, which doesn't matter, # since all of the characters in the search are symbols, # which don't have a case. s/[\r\n]+/ /gs; s@^.*?(<A HREF=\"[^\"]*article\.cgi)\b@$1@is || error ("unable to trim head in $url"); # Search for the beginning of the string (^), followed by # any number of characters (shortest match) (.*?), # Save this part of the match (the parentheseized part) # * the literal string <A HREF=", followed by # * zero or more non-quote characters, followed by # * the literal string article.cgl # followed by a word-boundary. # Replace all of this with just the saved part. # Search treats newlines as normal characters, and # is case-insensitive. s@<[^<>]*\bblacktri\.gif\b.*$@@is || error ("unable to trim tail in $url"); # Search for the literal character <, followed # by a string of 0 or more characters which are # neither < nor >, followed by a word boundary, # followed by the literal string blacktri.gif, # followed by another word boundary, followed by # zero or more of any character, followed by the # end of the string. Replace with an empty string. # Search treats newline as normal characters, and is case # insensitive. s@(<A\b[^<>]*\bHREF\b)@\n\001\001\001\n$1@gi; # Save this: # * The literal string <A followed by # * a word boundary, followed by # * zero or more characters which are neither < nor >, # followed by # * the literal string HREF, followed by # * another word boundary. # Replace this with a newline character, three # characters with character code 1, another newline, # and the captured string. # Case treats newlines as normal characters, and is case # insensitive. my @sec1 = split (/\n\001\001\001\n/s); my @sec2 = (); foreach (@sec1) { next if (m/^\s*$/s); s@^\s*<A\b[^<>]*?\bHREF=\"([^<>\"]+)\"[^<>]*>\s*(.*?)\s*</A>\s*@@i +s || error ("unparsable entry (url) in $url"); # Search for the beginning of the string, followed # by <A, followed by a word-boundary character, then # zero or more characters which are neither < nor > # (taking the shortest match), then another word # boundary character, then the string HREF=" # Save into register 1: # * one or more characters which are none of # <, >, or ". # Then look for a quote, followed by zero or more # characters which are neither < nor >, followed by # a > character, followed by zero or more spaces. # Save into register 2: # * Zero or more characters (shortest match) # followed by zero or more spaces, followed by </A>, # followed by zero or more spaces. # Replace with the empty string. # Search is case-insensitive, and newlines are treated # as regular characters. # * my $eurl = $1; # $1 is register 1 from the above RE. my $title = $2; # $2 is register 2 from the above RE. my $date = ''; my $body = $_; $body =~ s@<[^<>]*>@@g; # lose tags in body push @sec2, ($eurl, $date, $title, $body); } return @sec2; }

Update: Coby Pendant's clarifications about \b are correct. It represents the space between two characters, and the phrase "word boundary character" is somewhat misleading.

Replies are listed 'Best First'.
Re: Re: Understanding Regular Expressions
by Cody Pendant (Prior) on Dec 28, 2003 at 01:46 UTC
    You've done very detailed work there sgifford, but I'm a bit nervous about you using the phrase "word boundary character". Thinking of \b and similar things as characters has got me into a lot of trouble in the past.

    I don't know the perfect phrase to describe it, however, and "zero-width assertion" has never really appealed to me, so I'd rather just call it a "word boundary" and explain that as "place between a word-character and a non-word character".



    ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print