dsayars has asked for the wisdom of the Perl Monks concerning the following question:

I have a simple extraction script for extracting shape names from a visio stencil in .vsx (XML format). Problem is, there's a bug in Visio that puts newlines in some of the name strings. If the names are clean, this works as a regex:

<Master ID='.*?' NameU='(.*?)'

"(.*?)" then extracts fine as $1.

However, since newlines are present, I have to OR with something that matches them:

while ($text=~/<Master ID='.*?' NameU='(.*?)' |<Master ID='.*?' NameU='(.*?)/sg)

This matches the names containing newlines, but apparently because the match goes over the line boundary, the $1 contains a null. Only an empty line in my output shows that the match was made.

Is there a way out of this catch 22 when you have to match something containing a newline and extract data from it?

Replies are listed 'Best First'.
Re: Extract data from regex match where "." is newline?
by Eliya (Vicar) on Dec 15, 2011 at 21:22 UTC
    However, since newlines are present, I have to OR with something that matches them

    Not sure I understand. Normally, /s is sufficient for "." to also match newlines. No alternation required.

    my $text = "<Master ID='42' NameU='foo\nbar'"; if ($text =~ /<Master ID='.*?' NameU='(.*?)'/s) { print "'$1'\n"; }

    outputs (as expected)

    'foo bar'
Re: Extract data from regex match where "." is newline?
by ww (Archbishop) on Dec 16, 2011 at 02:05 UTC
    Second the motion from Deacon Eliya ...even after putting the /g back into the mix:
    #!/usr/bin/perl use Modern::Perl; # 943829 my $str = "<Master ID='foobar\nblivitz' NameUI='12345'"; if ( $str =~ /<Master ID='(.*?)' NameUI='(.*?)'/sg ) { say "\$1: $1"; say "\$2: $2"; }else{ say "WTH?" }
    Prints:
    $1: foobar blivitz $2: 12345
    So,
    1. Have you looked rilly, rilly carefully at the source data? For example,
          Characters? Character encoding? Tabs masquerading as spaces?
          Spurious appearance as if newlines were present because of wrap-on-render?
          Anything else?
    2. Any possibility that you have a quoting problem when you tell your regex to test $text?
    3. And, precisely, where are the newlines that you think are giving your trouble?

    Please, post a sample, wraped in <code>...</code> tags

      Thanks to Eliya, muba and ww. It turned out the Perl solution was not a Perl solution. I discovered by accident that the unwanted newlines could be removed by saving the Visio .vsx (stencil) file to a Visio.vdx (drawing) file. (Simply removeing all newlines was no good because it created one long 70-Mb line.) Since you are a religious order, you will be want to know what moral lesson I drew from this. It is that the tool you used to create the file on which you are going to run Perl often has the means to make the file more presentable to Perl, which is to say more worthy of Perl.

Re: Extract data from regex match where "." is newline?
by muba (Priest) on Dec 16, 2011 at 01:58 UTC

    Of course, one thing you could wonder is whether using regexps for this is the right solution. What if the ID and NameU attributes for some reason or another appear in reverse order?