in reply to Re: extracting a substring from a string - multiple variables
in thread extracting a substring from a string - multiple variables

I'm not sure if I should rephrase...
The data I get is from a POST request; it's almost XML except for the <file/> part that I need take out, as parser.pm barfs on it..
in "<file fiop="foo" length="bar"/>baz</file>", baz is raw binary;
that's why (I think) the preg matching (m^) doesn't work.
FYI baz looks like: µÜ¡3õ§©AEurope/Amsterdam$...
  • Comment on Re^2: extracting a substring from a string - multiple variables

Replies are listed 'Best First'.
Re^3: extracting a substring from a string - multiple variables
by mwah (Hermit) on Oct 28, 2007 at 00:24 UTC
    Ohh, *if* there is some binary within the tag and *if* the "length" field says sth. about its *length* you could easily construct a regex that extracts binary data of that length:
    my $binary = pack 'F*', (3.141592) x 10; # make binary vector of len +gth 80 bytes my $string = '...blah...<file fiop="foo" length="' . length($binary) +.'"/>' . $binary . '</file>...blah...'; my ($fiop, $length, $data) = $string =~ m{<file # tag anchor \s+ fiop="([^"]+)" # (fiop) \s+ length="([^"]+)" # (length) /> # end: start file tag ((??{ "\\C{$2}" })) # self modifying regex for +binary stuff </file> # end: file tag }sx; print "$fiop, $length (data comes below)\n"; print join ',', unpack("F*", $data); # extract binary data again (my $notags = $string) =~ s{<file.+</file>}{}; print "\n$notags\n";

    In the above I pack a binary sequence of 10 Pi-Numbers (double, 10 x 8 bytes) into the tag, match a binary sequence of its length ($2) and unpack it afterwards.

    Regards

    mwa

      seems like graff was faster again ;)
      I did like your approach though!
      I'm not quite sure if (in your solution) $binary is known - in my case I'm handling POSTed data, where $length is declared in the data, and $binary just sits between the <file/> tags.
      Still; if $length and $binary could be extracted, unpack sounds more logical to me.
        walinsky
        in my case I'm handling POSTed data, where $length is declared in the data, and $binary just sits between the <file/> tags.

        If thats so you *definitely* can't use any approach other than blindly extracting a byte sequence of given length (as in Example 2) because the data *might* at some point contain the sequence  \x00€µ</file>³á>>~ which would break your program otherwise (if you'd use the regex like ... =~m{<file>.*?</file>} ...).

        Regards

        mwa

      Seems like we're getting somewhere now; the code throws an error though, when applying the regex:
      Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/\\C{ <-- HERE 304507}/

      any idea ?

        Hmmm .. you are hitting the "quantifier length limit" of your perl implementation (which should be 0xffff) (?).

        (1) How long is your binary chunk at all (above message says "304507" - dooh!) and (2) what number is in the ... length="xxx" ... field? Really *that* large?


        update:

        How to read arbitary big binary chunks from within regular expressions ...

        You could advance until you hit the data (after the closing of the start tag) and simply read the data that follow. This implies you have one ... ...<file>..</file> ... entry per string at this point.

        ... my $binary = pack 'F*', (3.141592) x 8001; # this will dump a 64K+ bi +nary chunk my $string = '...blah...<file fiop="foo" length="' . length($binary) +.'"/>' . $binary . '</file>...blah...'; my ($fiop, $length, $data); if( $string =~ m{<file \s+ fiop="([^"]+)" \s+ length="([^"]+)" />}gx +) { ($fiop, $length) = ($1, $2); # extract tag prop +erties as usual $data = substr $string, pos($string), $length # extract data by +direct string copy } print "$fiop, $length\n"; print join ',', unpack("F*", $data); (my $notags = $string) =~ s{<file.+</file>}{}; print "\n$notags\n"; ...

        Regards

        mwa