in reply to extracting a substring from a string - multiple variables

Nobody answered after 16 min? Oh, graff did (and was faster than me) ;-)

... my ($fiop, $length, $data) = $string =~ m{<file # tag anchor \s+ fiop="([^"]+)" # (foo) \s+ length="([^"]+)" # (bar) /> # end: start file tag \s* (.*?) # (baz) - note the "nongreedy +ness" .*? </file> # end: file tag }x; print "$fiop, $length, $data\n"; ...

Addendum: forgot the tag-cleaning part:

... (my $notags = $string) =~ s{<file.+?</file>}{}; print "$notags\n"; ...

Your mistake was basically to take the "greedy modifier" (.*), which matches until the end of the string - and backtracks then - and matches from the rear ...

Regards

mwa

Replies are listed 'Best First'.
Re^2: extracting a substring from a string - multiple variables
by walinsky (Scribe) on Oct 27, 2007 at 23:23 UTC
    I'm not sure if I should rephrase...
    The data I get is from a POST request; it's almost XML except for the <file/> part that I need take out, as parser.pm barfs on it..
    in "<file fiop="foo" length="bar"/>baz</file>", baz is raw binary;
    that's why (I think) the preg matching (m^) doesn't work.
    FYI baz looks like: µÜ¡3õ§©AEurope/Amsterdam$...
      Ohh, *if* there is some binary within the tag and *if* the "length" field says sth. about its *length* you could easily construct a regex that extracts binary data of that length:
      my $binary = pack 'F*', (3.141592) x 10; # make binary vector of len +gth 80 bytes my $string = '...blah...<file fiop="foo" length="' . length($binary) +.'"/>' . $binary . '</file>...blah...'; my ($fiop, $length, $data) = $string =~ m{<file # tag anchor \s+ fiop="([^"]+)" # (fiop) \s+ length="([^"]+)" # (length) /> # end: start file tag ((??{ "\\C{$2}" })) # self modifying regex for +binary stuff </file> # end: file tag }sx; print "$fiop, $length (data comes below)\n"; print join ',', unpack("F*", $data); # extract binary data again (my $notags = $string) =~ s{<file.+</file>}{}; print "\n$notags\n";

      In the above I pack a binary sequence of 10 Pi-Numbers (double, 10 x 8 bytes) into the tag, match a binary sequence of its length ($2) and unpack it afterwards.

      Regards

      mwa

        seems like graff was faster again ;)
        I did like your approach though!
        I'm not quite sure if (in your solution) $binary is known - in my case I'm handling POSTed data, where $length is declared in the data, and $binary just sits between the <file/> tags.
        Still; if $length and $binary could be extracted, unpack sounds more logical to me.
        Seems like we're getting somewhere now; the code throws an error though, when applying the regex:
        Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/\\C{ <-- HERE 304507}/

        any idea ?