sfrattura has asked for the wisdom of the Perl Monks concerning the following question:

$catreg = "^(<[A-Za-z0-9]+>)*<H1>(<[A-Za-z0-9]+>)*([A-Za-z0-9]+)";
Im a newbie...so if you respond, speak in English, not PERL, please. Be gentle!

The script Above is used to parse through at HTML document. In practice, what it is doing is getting the text between <H1> and </H1>. For instance, if it found <H1>ABCD</H1> , it would isolate and return ABCD.

This works most of the time, but for some combinations of letters, it does not work. Its very very wierd.

So, my question is..can someone explains what this line of code does? Again it is below.

$catreg = "^(<[A-Za-z0-9]+>)*<H1>(<[A-Za-z0-9]+>)*([A-Za-z0-9]+)";
Sandro Frattura

PS...below is the later section of the code where it is REALLY used. The line I show above, i think, is sort of a variable declaration.

if ( m!$catreg! ) { print "found(pre) $cat\n"; print STDERR "checking for cat :$3:\n"; $cat = $catlist{$3}; print "cat == $cat\n"; next; }

Replies are listed 'Best First'.
Re: (**corrected**) What Does This Line Do?
by cLive ;-) (Prior) on Jun 07, 2001 at 00:16 UTC
    <H1>(<[A-Za-z0-9]+>)
    matches the ABCD in <H1>ABCD</H1>?

    I think not. Remove the < and >, and add an H1 closing tag to clarify, ie:

    <H1>([A-Za-z0-9]+)</H1>

    Or if you don't mind it picking up underscores as well as letters and numbers, you can use the much more succint:

    <H1>(\w+)</H1>
    \w means match a word character. I suggest you do a little search on parsing html in the Super Search, or look at the HTML::Parser module, discussed here

    You may want to also add modifiers to your regex, ie:

    if ( m!$catreg!is ) {
    the i makes it case insensitive (picks up h1 and H1), the s make the regex treat the whole string/page as one line, matching H1's created by that lovely editor, Dreamweaver, eg:
    <h1>I am a heading created by Dreamweaver</h1>

    <rant>
    (not that Dreamweaver users ever seem to use <Hn> when <p><b><font size=6> will do instead :)
    </rant>

    cLive ;-)

    Update: I missed that you were matching a possible tag before the match you use (see below). I strongly suggest you look at a parsing module if you don't know whether tags will contain tags or not!

      The regex also appears to be attempting to skip any tags immediately following the opening H1 tag, and not requiring a closing H1 tag (is a closing tag required for H1?).
Re: (**corrected**) What Does This Line Do?
by tachyon (Chancellor) on Jun 07, 2001 at 01:06 UTC
    $catreg = "^(<[A-Za-z0-9]+>)*<H1>(<[A-Za-z0-9]+>)*([A-Za-z0-9]+)";

    This places the characters on the right hand side of the equals sign and between the " into a variable called $catreg.

    if ( m!$catreg! ) {

    This line is a conditional test. IF the bit between the ( ) is true then the block following the condition will be evaluated - this block goes from { to }.

    To simplify m!foo! is a regular expresion that will match any string that contains the letters 'foo' in that order.

    In your case we have m!$catreg! When perl sees the $catreg it looks to see what the variable catreg contains and uses this content in the regular expression. Thus you are looking for a match like this:

    m!^(<[A-Za-z0-9]+>)*<H1>(<[A-Za-z0-9]+>)*([A-Za-z0-9]+)!

    Note we are matching against the unspecified default value for regex matches which is stored in the magical $_ variable. In english this reads as follows:

    # the lack of a $var =~ here indicates # that we are matching against $_ m # this is a match regex ! # we define the extent of the regex using a ! ^ # starting at the begining of the string ( # grouping and capturing opening bracket < # matches this charachter '<' [A-Za-z0-9]+ # match any number of alphanumerics in a row > # match a literal '>' ) # closing bracket # the match between brackets captures to $1 * # accept 0 or more of the preceeding stuff # but note only first match captured into $1 # within the brackets <H1> # match a literal '<H1>' (<[A-Za-z0-9]+>)* # same as first but captures into $2 # the first '<alphanumeric>' sequence ([A-Za-z0-9]+) # matches sequential alphanumerics and # captures all into $3 ! # end of regex

    I'm afraid that this regex does not do anything like what you think it does. It's results are complex, I suggest you try this sample code. Play with the value of $_ to see what I mean.

    $catreg = "^(<[A-Za-z0-9]+>)*<H1>(<[A-Za-z0-9]+>)*([A-Za-z0-9]+)"; $_="<a><b><H1><c><d>foo"; if ( m!$catreg! ) { print "Matched \$1:$1 \$2$2 \$3$3\n"; } else { print "No Match\n"; }

    If you want to capture the text between <H1> and </H1>,regardless of whether they are <h1> and </h1> then this will work.

    $_='<h1>foo</h1>blah<H1>bar</H1>more blah'; while (m!<h1>(.*?)</h1>!ig) { print "Found $1\n"; }

    tachyon

Re: (**corrected**) What Does This Line Do?
by ChemBoy (Priest) on Jun 07, 2001 at 00:19 UTC

    What is does is exactly what you say it does: parse a line for the text within an <H1> tag, and frequently screw up. :-)

    Slightly more precisely (though not precisely precisely), it matches a line that contains any text, an <H1> tag, possibly another HTML tag, and another arbitrary string of alphanumerics, and the does something with that last string. The circumstances under which it will fail are, alas, too numerous to go into here <wry>--I recommend looking into HTML::TokeParser if you have that option.

    If you want a slightly more lucid explanation of the regular expression, you might want to try OGRE, which I believe is written by a fellow monk.



    If God had meant us to fly, he would *never* have give us the railroads.
        --Michael Flanders

Re: (**corrected**) What Does This Line Do?
by srawls (Friar) on Jun 07, 2001 at 00:39 UTC
    ^(<[A-Za-z0-9]+>)*<H1>(<[A-Za-z0-9]+>)*([A-Za-z0-9]+)

    Matches beggining, then '<', then the char class. Mabey you should change that to \w+ or [^<\/]+. Then the '>'. Now it matches that any amount of times, storing the opening tag right before '<H1>' in $1. If you don't use that tag, I would advise you take out the capturing parens and put in (?: ... ). Alright, let's move on ...

    Next, it matches '<H1>' (and not '<h1>'). In the next set of parens, I think you need to take out the '<' and '>'. Also, you should match the '</h1>' tag as well. Then, it matches the next A-Za-z0-9 char. If you take out the '<' and '>' in that parens ($2 doesn't match unless you do) then there will be no A-Za-z0-9 chars left (the prev parens gobbled them up), so $3 would get the last char in between the header tag (due to backtracking).

    You are checking $3 to see what's in between the header tags, and that confused me at first, but you aren't capturing anything in $2 because of those '<' and '>' chars.

    Well, that was by no means a complete look at the regex, but it should have given you an idea about how errorprone parsing html is, I would look up on one of the modules like Parse::HTML to parse the html.

    The 15 year old, freshman programmer,
    Stephen Rawls

Re: (**corrected**) What Does This Line Do?
by stephen (Priest) on Jun 07, 2001 at 00:41 UTC
Re: (**corrected**) What Does This Line Do?
by runrig (Abbot) on Jun 07, 2001 at 00:22 UTC
    It appears to be (badly) attempting to return first some tag after an H1 tag (skipping all tags following the H1) in $1, then the longest sequence of word characters immediately following those tags in $2.
Re: What Does This Line Do?
by John M. Dlugosz (Monsignor) on Jun 07, 2001 at 00:42 UTC
    This will match a like that contains zero or more open tags, followed by a <H1>, another block of zero or more tags, and then a word. The $3 mentioned on the print line "checking for cat" will note the word after all the tags.