JediWizard has asked for the wisdom of the Perl Monks concerning the following question:

Dearest Monks,

I must be missing something here, so perhaps soembody can enlighten me. I have written a regular expression, and as far as I can tell, it should work. What I want to do is remove all HTML tags except for <BR>. The following code is not doing what I expect from it.

#!/usr/local/bin/perl -w use strict; my $file = do{local $/; <DATA>}; $file =~ s/ < (?!=BR) [^>]* > //gsmx; print $file; exit; __DATA__ <P ID="pStyle21X0"><SPAN ID="textStyle14">Parthenogenesis . . . . . . +. . . . . . . . 592 <BR>Patents on Life-Forms . . . . . . . . . . . 5 +94 <BR>Paternity Tests . . . . . . . . . . . . . . . 596 <BR>Pedigree + Analysis . . . . . . . . . . . . . 599 <BR>Penetrance . . . . . . . +. . . . . . . . . . 602 <BR>Phenylketonuria (PKU) . . . . . . . . . . + 604 <BR>Plasmids . . . . . . . . . . . . . . . . . . 606 <BR>Polygen +ic Inheritance . . . . . . . . . . . 609 <BR>Polymerase Chain Reactio +n . . . . . . . . 611 <BR>Polyploidy . . . . . . . . . . . . . . . . +. 613 <BR>Population Genetics . . . . . . . . . . . . 617 <BR>Prader- +Willi and Angelman </SPAN>^M </P>^M <P ID="pStyle21X4"><SPAN ID="textStyle14">Syndromes. . . . . . . . . . + . . . . . . 623 <BR>Prenatal Diagnosis . . . . . . . . . . . . . 626 + <BR>Prion Diseases: Kuru and </SPAN>^M </P>^M <P ID="pStyle21X4"><SPAN ID="textStyle14">Creutzfeldt-Jakob Syndrome . + . . . . . 631 <BR>Protein Structure . . . . . . . . . . . . . 634 <B +R>Protein Synthesis. . . . . . . . . . . . . . 638 <BR>Proteomics . . + . . . . . . . . . . . . . . . 643 </SPAN>^M ^M <SPAN ID="textStyle14">Pseudogenes . . . . . . . . . . . . . . . . 646 + <BR>Pseudohermaphrodites . . . . . . . . . . 648 <BR>Punctuated Equi +librium. . . . . . . . . . 650 </SPAN>^M </P>^M

I would expect that this would print out the data with only "BR" tags remaining.... however this is what I get:

Parthenogenesis . . . . . . . . . . . . . . 592 Patents on Life-Forms +. . . . . . . . . . . 594 Paternity Tests . . . . . . . . . . . . . . + . 596 Pedigree Analysis . . . . . . . . . . . . . 599 Penetrance . . + . . . . . . . . . . . . . . . 602 Phenylketonuria (PKU) . . . . . . +. . . . 604 Plasmids . . . . . . . . . . . . . . . . . . 606 Polygeni +c Inheritance . . . . . . . . . . . 609 Polymerase Chain Reaction . . + . . . . . . 611 Polyploidy . . . . . . . . . . . . . . . . . 613 Pop +ulation Genetics . . . . . . . . . . . . 617 Prader-Willi and Angelma +n ^M ^M Syndromes. . . . . . . . . . . . . . . . 623 Prenatal Diagnosis . . . +. . . . . . . . . . 626 Prion Diseases: Kuru and ^M ^M Creutzfeldt-Jakob Syndrome . . . . . . 631 Protein Structure . . . . . + . . . . . . . . 634 Protein Synthesis. . . . . . . . . . . . . . 638 + Proteomics . . . . . . . . . . . . . . . . . 643 ^M ^M Pseudogenes . . . . . . . . . . . . . . . . 646 Pseudohermaphrodites . + . . . . . . . . . 648 Punctuated Equilibrium. . . . . . . . . . 650 +^M ^M

What am I missing? Should not the (?!=BR) cause the regex to not match the <BR> tags?

Any Wisdom would be mutch appriciated.

May the Force be with you

Replies are listed 'Best First'.
Re: Forward look ahead assertion problems
by Enlil (Parson) on Feb 09, 2005 at 17:40 UTC
    The zero width negative look ahead assertion is (?!pattern) not (?!=pattern).

    -enlil

      Doh! I knew I was missing something... and it figures it would be something small. Thank you!

      enlil++

      May the Force be with you
Re: Forward look ahead assertion problems
by cowboy (Friar) on Feb 09, 2005 at 17:42 UTC
    According to perldoc perlre, a zero-width negative look-ahead looks like:
    (?!pattern)
    the = is actually causing it to look for anything not matching "=BR"

      Thank you. I guess I just needed another set of eyes on it.

      cowboy++

      May the Force be with you
Re: Forward look ahead assertion problems
by perlsen (Chaplain) on Feb 10, 2005 at 04:16 UTC

    Hi, do u know(tried) the right syntax:

    (?=PATTERN) (positive lookahead) (?!PATTERN) (negative lookahead) (?<=PATTERN) (positive lookbehind) (?<!PATTERN) (negative lookbehind)