rsriram has asked for the wisdom of the Perl Monks concerning the following question:

Hi, In my markup file, I want to check for a occurrence which matches, <figr n="(.+?)">(.*?)<\/figr> and replace it with the below mentioned tag.

if($_ =~ /<figr n="(.+?)">(.*?)<\/figr>/)
    {
        $fno=$1;
        $figno=sprintf("%03d", $fno);
        $_ =~ s/<figr n="(.+?)">(.*?)<\/figr>/<FIGIND NUM="$fno" ID="FG.$figno">$2<\/FIGIND>/g;
    }

The original value of figr in the markup file should be filled with preceeding zero's to make it a 3 digit number. My problem is, if I have two <figr n="#"> elements in a same line, the value of the first indicator is getting used for the second indicator too.

For example,

Nerve cells come in many shapes and sizes, but they all have a number of identifiable parts. A typical nerve cell is shown in <figr n="1">Figure 1</figr>. Like all other cells in the body, it has a nucleus that contains genetic information.<figr n="2">Figure 2</figr>. The cell is covered by a membrane and is filled with a fluid.

The output is created as:

Nerve cells come in many shapes and sizes, but they all have a number of identifiable parts. A typical nerve cell is shown in <FIGIND NUM="1" ID="FG.001">Figure 1</FIGIND>. Like all other cells in the body, it has a nucleus that contains genetic information.<FIGIND NUM="1" ID="FG.001">Figure 2</FIGIND>. The cell is covered by a membrane and is filled with a fluid.

Can anyone tell me what's wrong in my code? I also tried using <figr n="([^"]+)"> in search instead of the above pattern.

Replies are listed 'Best First'.
Re: Matching a pattern in Regex
by davorg (Chancellor) on Jul 25, 2006 at 10:22 UTC

    This is a good use for the /e option on the substitution operator.

    use strict; use warnings; $_ = <DATA>; if (/<figr n="/) { s[<figr n="(\d+)">(.*?)</figr>] [qq(<FIGIND NUM="$1" ID="FG.) . sprintf('%03d', $1) . qq(">$2</FIGI +ND>)]eg; } print; __DATA__ Nerve cells come in many shapes and sizes, but they all have a number +of identifiable parts. A typical nerve cell is shown in <figr n="1">F +igure 1</figr>. Like all other cells in the body, it has a nucleus th +at contains genetic information.<figr n="2">Figure 2</figr>. The cell + is covered by a membrane and is filled with a fluid.
    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      There is IMHO no sense in the 'if' condition in your snippet :) Or do i miss something?

           s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print

        Well, it was there (albeit in a more complex form) in the original code, so I thought I'd simplify it but leave it there.

        If there's a lot of text and substitutions are only carried out on a small number of lines, then it's probably worth doing a simple check like this - but I'd almost certainly leave it out if I was writing this code from scratch.

        As in all cases, the only sure way to know if it's useful is to benchmark the options with something similar to the real data.

        --
        <http://dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg

Re: Matching a pattern in Regex
by Hofmator (Curate) on Jul 25, 2006 at 10:35 UTC

    davorg provided you with a working solution, let me explain shortly why your code is not working.

    Paraphrasing your code into pseudo-code:

    if pattern matches set variable $fno based on above pattern match set variable $figno based on above pattern match substitute pattern with string that uses $fno and $figno
    so the variables $fno and $figno are set once based on the if-match (which grabs the first occurence on a line). Then the substitution replaces all (/g) occurences of the pattern with a string which contains the two variables. These variables don't change during the substitution, therefore you always get replacement according to the first match. The only variables that change 'automatically' during the substitution are the special variables $1, $2, ... as you can see in the result string ('Figure 2' is correctly replaced for the 2nd occurence).

    There is nothing wrong with your patterns, both work just fine - with the usual caveat about parsing such markup with regexes ...

    -- Hofmator

Re: Matching a pattern in Regex
by GrandFather (Saint) on Jul 25, 2006 at 10:36 UTC

    Generally using regexen for parsing markup is tricky and best left to modules such as HTML::TreeParser. If you can absolutely predict what the markup will be then you may get away with the following:

    use warnings; use strict; my $str = <<TXT; Nerve cells come in many shapes and sizes, but they all have a number +of identifiable parts. A typical nerve cell is shown in <figr n="1">Figur +e 1</figr>. Like all other cells in the body, it has a nucleus that cont +ains genetic information.<figr n="2">Figure 2</figr>. The cell is covered b +y a membrane and is filled with a fluid. TXT $str =~ s/ <figr\sn="(\d+?)"> # tag including figure number (.*?) # element contents <\/figr> # close tag / "<FIGIND NUM=\"$1\" ID=\"FG." . # replacement tag sprintf("%03d", $1) . # padded number "\">$2<\/FIGIND>" # remainder of element /gmsxe; # Global, multi-line, ignore newline, ignore whitespac +e, evaluate print $str;

    Prints:

    Nerve cells come in many shapes and sizes, but they all have a number +of identifiable parts. A typical nerve cell is shown in <FIGIND NUM="1" I +D="FG.001">Figure 1</FIGIND>. Like all other cells in the body, it has a nucleus that co +ntains genetic information.<FIGIND NUM="2" ID="FG.002">Figure 2</FIGIND>. The + cell is covered by a membrane and is filled with a fluid.

    DWIM is Perl's answer to Gödel
      If we care about the variations of the input data, even your regex is not enough—e.g it won't match <figr  n="2">Figure 2</figr>, as it contains more than one space between figr and n. The IMHO best solution is to process XML-like data as XML. But the input string can be not well-formed...

      I wrote a simple example of how the job could be done, if the input data is a part a of well-formed XML document:

      #!/usr/bin/perl use warnings; use strict; use XML::Twig; my $twig = XML::Twig->new( twig_handlers => { figr => sub { my $fnum = $_->att('n'); $_->del_att('n'); $_->set_tag('FIGIND'); $_->set_att(NUM => $fnum, ID => sprintf('FG.%03d', $fnum)) +; } } ); my $str; { local $/ = undef; $str = <DATA>; } $str = "<dummy>$str</dummy>"; $twig->parse($str); $str = $twig->sprint; $str =~ s!</?dummy>!!g; print $str; __DATA__ Nerve cells come in many shapes and sizes, but they all have a number of identifiable parts. A typical nerve cell is shown in <figr n="1">Figure 1</figr>. Like all other cells in the body, it has a nucleus that contains genetic information. <figr n="2">Figure 2</figr>. The cell is covered by a membrane and is filled with a fluid.
      It prints:
      Nerve cells come in many shapes and sizes, but they all have a number of identifiable parts. A typical nerve cell is shown in <FIGIND ID="FG.001" NUM="1">Figure 1</FIGIND>. Like all other cells in the body, it has a nucleus that contains genetic information. <FIGIND ID="FG.002" NUM="2">Figure 2</FIGIND>. The cell is covered by +a membrane and is filled with a fluid.

           s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print
Re: Matching a pattern in Regex
by rodion (Chaplain) on Jul 25, 2006 at 10:57 UTC
    Lot's of good advice is already here. Just one more item to deal with, the OP's question at the end of his post.
    Can anyone tell me what's wrong in my code?
    The line
    $_ =~ s/<figr n="(.+?)">(.*?)<\/figr>/<FIGIND NUM="$fno" ID="FG.$figno +">$2<\/FIGIND>/g;
    should use the $fno variable, as in
    $_ =~ s/<figr n="($fno)">(.*?)<\/figr>/<FIGIND NUM="$fno" ID="FG.$f +igno">$2<\/FIGIND>/g;
    That will correct the repeated substitutions for the wrong number. After that, change the "if" to a "while" and the code works.

    Works is good, but there are better ways. For those, see previous posts.

Re: Matching a pattern in Regex
by Ieronim (Friar) on Jul 25, 2006 at 10:28 UTC
    You need the /e switch in substitution.
    Use smth like
    s{<figr n="([^"]+)">(.*?)</figr>} {'<FIGIND NUM="'.$1.'" ID="FG.'.sprintf("%03d", $1).'">'.$2.'</FIGIND> +'}eg;
    instead of the snippet you have shown.

         s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print
Re: Matching a pattern in Regex
by davidrw (Prior) on Jul 25, 2006 at 12:23 UTC
    no one mentioned it yet, so here goes -- to avoid escaping the / in the regex, use s### (or similar) instead of s/// .. just adds one more tidbit of legibility. A trivial example:
    my $s = "<foo>stuff</foo>"; #$s =~ s/<foo>(.*?)<\/foo>/<bar>$1<\/bar>/; $s =~ s#<foo>(.*?)</foo>#<bar>$1</bar>#; print $s;