tommyw has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on converting some pseudo-SGML we've got at work into XML. The first step is ensuring that it's well formed. We've got some tags which look like:

<e st="obs" st="ali" num="1">
which won't parse, since duplicate attributes aren't allowed.

So I'm converting them with the following code:

my $NMTEXT='[A-Za-z]\w*'; while (s/<([^>]*)\s+ # Open tag, name and assorted crud. ($NMTEXT)="([^ >]*)" # Attribute and value (\s+[^>]*)?\s+ # Optional assorted crud \2="([^ >]*)" # Repeated attribute and value /<$1 $2="$3,$5"$4/gx) { }
This does the job fine, and the above example gets spat back out as:
<e st="obs,ali" num="1">
But in the process, I get warned about:
Use of uninitialized value in concatenation (.) or string at mkdtd.pl line 33, <XT> line 1.
I've run it through the debugger and found that in this case, I'm getting: ($1, $2, $3, $4) = ('e', 'st', 'obs', undef, 'ali'). Yup, it's right: there's no optional assorted crud.

So my question is: what's the best way of dealing with this? I don't want to turn warnings off, in case anything else might slip past, but I don't want to have this complaint thrown at me all the time.

(As a side issue, I also feel guilty about using a while loop with no body. Any better suggestions?)

--
Tommy
Too stupid to live.
Too stubborn to die.

Replies are listed 'Best First'.
Re: Using undefined back-references?
by Abigail-II (Bishop) on Aug 19, 2002 at 16:10 UTC
    Just add one extra character in the regexp:
    while (s/<([^>]*)\s+ # Open tag, name and assorted crud. ($NMTEXT)="([^ >]*)" # Attribute and value (|\s+[^>]*)?\s+ # Optional assorted crud \2="([^ >]*)" # Repeated attribute and value /<$1 $2="$3,$5"$4/gx) { }
    Notice the "Optional assorted crud" part.

    Abigail

Re: Using undefined back-references?
by Courage (Parson) on Aug 19, 2002 at 16:04 UTC
    It is one of reasons why non-capturing groups (?:...) are invented. In those cases when you do not need capturing, or get something optional in your string, just use them.

    In a line where you wrote:

    (\s+[^>]*)?\s+ # Optional assorted crud
    write another way:
    (?:\s+[^>]*)?\s+ # Optional assorted crud

    Courage, the Cowardly Dog

    addition: I suggest you to use a trick that fruiture later advices to you. Personally, I use that trick quite often.

      Won't do it, I'm afraid, 'cos I actually need to keep that crud, if it exists. Especially if somebody throws me the attributes in a different order:

      <e st="obs" num="1" st="ali">
      
      needs to group in the same way. By using ?: the entire intermediate string disappears, and I end up with
      <e st="obs,ali">
      
      Which wasn't precisely what I was after. Although it does shut the warning up :-)

      --
      Tommy
      Too stupid to live.
      Too stubborn to die.

        How about

        ((?:\s+[^>]*)?)\s+ # Optional assorted crud

        ? The capturing () is not marked with ? (or *) so it won't be undef, but the non-capturing (?:) is marked with ?, so it can fail to match: in that case the () saves an empty string, not undef.

        --
        http://fruiture.de
Re: Using undefined back-references?
by mirod (Canon) on Aug 19, 2002 at 16:08 UTC

    You can try using the e modifier in the substitution and have the replacement be something like this:

    /qq{<$1 $2="$3,$5"} . ($4 || '')/gex

    This should remove the warning by using '' instead of $4 if it's empty (it can't be 0 so there is no need for defined $4 here).

    --
    The Error Message is GOD - MJD

Re: Using undefined back-references?
by theorbtwo (Prior) on Aug 20, 2002 at 00:57 UTC

    I find regexes with (metachar...) tricks in them hard to read -- I'm not much of a regexer; it's one of the more important holes in my understanding of perl. Therefore, I'd use $4||''. However, that's already been mentioned in this thread -- the actual point is that I think the empty while would be clearer if you put in a #empty while; the work is in the condition comment. Writing that as 1 while foo; is more idiomatic. In either case, a simple empty pair of braces isn't clear that you don't just have an implied #WRITEME... which I think is what's bothering you.


    Confession: It does an Immortal Body good.