Understanding Regular Expressions

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am a newbie on perl and trying to understand regular expression usage, how it works and also trying to accomplish a task of converting an html page into an RSS feed. I came accross a script which does conversion for known sites and one could modify the script to add more sites. All one needs to do is add the details of formatting to capture.

A portion of the code written in the program is detailed below

Parameters Passed
$url = http://www.sfgate.com/examiner/bondage/
$html = Is the content of the URL specified

Can someone help me document each section of the following code. And If had to do something like this for the URL http://bsewebx.bseindia.com/qresann/announce.asp what would the code look like and what would be the changes.

sub do_bondagefiles {
  my ($url, $html) = @_;

  $_ = $html;

  1 while (s@<!--.*?-->@ @gsi);  # lose comments

  s/[\r\n]+/ /gs;

  s@^.*?(<A HREF=\"[^\"]*article\.cgi)\b@$1@is ||
    error ("unable to trim head in $url");
  s@<[^<>]*\bblacktri\.gif\b.*$@@is ||
    error ("unable to trim tail in $url");

  s@(<A\b[^<>]*\bHREF\b)@\n\001\001\001\n$1@gi;

  my @sec1 = split (/\n\001\001\001\n/s);
  my @sec2 = ();
  foreach (@sec1) {
    next if (m/^\s*$/s);

    s@^\s*<A\b[^<>]*?\bHREF=\"([^<>\"]+)\"[^<>]*>\s*(.*?)\s*</A>\s*@@i
+s ||
      error ("unparsable entry (url) in $url");
    my $eurl  = $1;
    my $title = $2;
    my $date  = '';
    my $body  = $_;

    $body =~ s@<[^<>]*>@@g;  # lose tags in body

    push @sec2, ($eurl, $date, $title, $body);
  }

  return @sec2;
}
[download]

Thanks for helping this newbie. Cheers.

Comment on Understanding Regular Expressions Download Code

Replies are listed 'Best First'.
Re: Understanding Regular Expressions by sgifford (Prior) on Dec 27, 2003 at 07:34 UTC
First, I should say that the best way to do this is with one of the many already-written HTML parsers, like HTML::TokeParser::Simple or HTML::Tree. The above will work for some but not all HTML. It's fine to use if you have control over the HTML coming from the server and you know you won't do anything weird, but in any other circumstances it's much better to use an already-written, alread-tested, and already-debugged (well, mostly) module, instead of trying it yourself, and making the same mistakes that the modules' authors did on their first try. That said, learning more about regular expressions is a laudable goal, so here's some quick explanations of what the REs in this code do. sub do_bondagefiles { my ($url, $html) = @_; $_ = $html; 1 while (s@<!--.?-->@ @gsi); # lose comments # Search-and-replace, with @ seperating the search part, # the replace part, and the search options. # <!-- matches itself as a literal string. # Same for -->. These are the HTML comment characters. # .? matches anything in between the HTML # comment characters. The dot means "any character", # the * means "0 or more of", and the ? means "the # shortest match", instead of the default of the longest. # After the @ is the next argument, the replace string. # It's a single space. So <!-- anything --> will be # replaced by a single space. # After the next @ is the final argument to the RE, # the options. Options here are g, s, and i. g means # "global"; if you find the same match multiple times, # replace all of them. s means treat newlines as regular # characters, instead of treating them specially. i # means case insensitive search, which doesn't matter, # since all of the characters in the search are symbols, # which don't have a case. s/[\r\n]+/ /gs; s@^.?(<A HREF=\"[^\"]article\.cgi)\b@$1@is \|\| error ("unable to trim head in $url"); # Search for the beginning of the string (^), followed by # any number of characters (shortest match) (.?), # Save this part of the match (the parentheseized part) # the literal string <A HREF=", followed by # * zero or more non-quote characters, followed by # * the literal string article.cgl # followed by a word-boundary. # Replace all of this with just the saved part. # Search treats newlines as normal characters, and # is case-insensitive. s@<[^<>]\bblacktri\.gif\b.$@@is \|\| error ("unable to trim tail in $url"); # Search for the literal character <, followed # by a string of 0 or more characters which are # neither < nor >, followed by a word boundary, # followed by the literal string blacktri.gif, # followed by another word boundary, followed by # zero or more of any character, followed by the # end of the string. Replace with an empty string. # Search treats newline as normal characters, and is case # insensitive. s@(<A\b[^<>]\bHREF\b)@\n\001\001\001\n$1@gi; # Save this: # The literal string <A followed by # * a word boundary, followed by # * zero or more characters which are neither < nor >, # followed by # * the literal string HREF, followed by # * another word boundary. # Replace this with a newline character, three # characters with character code 1, another newline, # and the captured string. # Case treats newlines as normal characters, and is case # insensitive. my @sec1 = split (/\n\001\001\001\n/s); my @sec2 = (); foreach (@sec1) { next if (m/^\s$/s); s@^\s<A\b[^<>]?\bHREF=\"([^<>\"]+)\"[^<>]>\s(.?)\s</A>\s@@i +s \|\| error ("unparsable entry (url) in $url"); # Search for the beginning of the string, followed # by <A, followed by a word-boundary character, then # zero or more characters which are neither < nor > # (taking the shortest match), then another word # boundary character, then the string HREF=" # Save into register 1: # * one or more characters which are none of # <, >, or ". # Then look for a quote, followed by zero or more # characters which are neither < nor >, followed by # a > character, followed by zero or more spaces. # Save into register 2: # * Zero or more characters (shortest match) # followed by zero or more spaces, followed by </A>, # followed by zero or more spaces. # Replace with the empty string. # Search is case-insensitive, and newlines are treated # as regular characters. # * my $eurl = $1; # $1 is register 1 from the above RE. my $title = $2; # $2 is register 2 from the above RE. my $date = ''; my $body = $_; $body =~ s@<[^<>]>@@g; # lose tags in body push @sec2, ($eurl, $date, $title, $body); } return @sec2; } [download] Update:* Coby Pendant's clarifications about `\b` are correct. It represents the space between two characters, and the phrase "word boundary character" is somewhat misleading.	[reply] [d/l] [select]
Re: Re: Understanding Regular Expressions by Cody Pendant (Prior) on Dec 28, 2003 at 01:46 UTC
You've done very detailed work there sgifford, but I'm a bit nervous about you using the phrase "word boundary character". Thinking of \b and similar things as characters has got me into a lot of trouble in the past. I don't know the perfect phrase to describe it, however, and "zero-width assertion" has never really appealed to me, so I'd rather just call it a "word boundary" and explain that as "place between a word-character and a non-word character". `($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print` [download]	[reply] [d/l]
Re: Understanding Regular Expressions by xenchu (Friar) on Dec 26, 2003 at 19:05 UTC
I can't be of too much help but I will point one thing out. ASP is not HTML and you will need to research the regular expressions needed. The regexps used for HTML will not work for ASP. If you have specific question about a regexp I am sure the perl monks will be happy to work with you. But I don't think anyone will want to write your documentation for you. Also, I suggest you look at perlretut and perlrequick to set up your documentation. If you get your documentation from someone else you won't understand how to do it and you won't further your understanding of regular expressions either. xenchu `The Needs of the World and my Talents run parallel to infinity.`	[reply] [d/l]
Re: Re: Understanding Regular Expressions by Anonymous Monk on Dec 27, 2003 at 01:34 UTC
Yeah you are quite right ASP is not html but here I mean to the actual resultant output of this link "http://bsewebx.bseindia.com/qresann/announce.asp" which is a html page. Thanks	[reply]
Re: Re: Understanding Regular Expressions by BUU (Prior) on Dec 26, 2003 at 20:09 UTC
ASP is not HTML and True, but the page he links to is html.	[reply]