in reply to Re^2: extracting a substring from a string - multiple variables
in thread extracting a substring from a string - multiple variables

Ohh, *if* there is some binary within the tag and *if* the "length" field says sth. about its *length* you could easily construct a regex that extracts binary data of that length:
my $binary = pack 'F*', (3.141592) x 10; # make binary vector of len +gth 80 bytes my $string = '...blah...<file fiop="foo" length="' . length($binary) +.'"/>' . $binary . '</file>...blah...'; my ($fiop, $length, $data) = $string =~ m{<file # tag anchor \s+ fiop="([^"]+)" # (fiop) \s+ length="([^"]+)" # (length) /> # end: start file tag ((??{ "\\C{$2}" })) # self modifying regex for +binary stuff </file> # end: file tag }sx; print "$fiop, $length (data comes below)\n"; print join ',', unpack("F*", $data); # extract binary data again (my $notags = $string) =~ s{<file.+</file>}{}; print "\n$notags\n";

In the above I pack a binary sequence of 10 Pi-Numbers (double, 10 x 8 bytes) into the tag, match a binary sequence of its length ($2) and unpack it afterwards.

Regards

mwa

Replies are listed 'Best First'.
Re^4: extracting a substring from a string - multiple variables
by walinsky (Scribe) on Oct 28, 2007 at 01:07 UTC
    seems like graff was faster again ;)
    I did like your approach though!
    I'm not quite sure if (in your solution) $binary is known - in my case I'm handling POSTed data, where $length is declared in the data, and $binary just sits between the <file/> tags.
    Still; if $length and $binary could be extracted, unpack sounds more logical to me.
      walinsky
      in my case I'm handling POSTed data, where $length is declared in the data, and $binary just sits between the <file/> tags.

      If thats so you *definitely* can't use any approach other than blindly extracting a byte sequence of given length (as in Example 2) because the data *might* at some point contain the sequence  \x00€µ</file>³á>>~ which would break your program otherwise (if you'd use the regex like ... =~m{<file>.*?</file>} ...).

      Regards

      mwa

Re^4: extracting a substring from a string - multiple variables
by walinsky (Scribe) on Oct 28, 2007 at 12:41 UTC
    Seems like we're getting somewhere now; the code throws an error though, when applying the regex:
    Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/\\C{ <-- HERE 304507}/

    any idea ?

      Hmmm .. you are hitting the "quantifier length limit" of your perl implementation (which should be 0xffff) (?).

      (1) How long is your binary chunk at all (above message says "304507" - dooh!) and (2) what number is in the ... length="xxx" ... field? Really *that* large?


      update:

      How to read arbitary big binary chunks from within regular expressions ...

      You could advance until you hit the data (after the closing of the start tag) and simply read the data that follow. This implies you have one ... ...<file>..</file> ... entry per string at this point.

      ... my $binary = pack 'F*', (3.141592) x 8001; # this will dump a 64K+ bi +nary chunk my $string = '...blah...<file fiop="foo" length="' . length($binary) +.'"/>' . $binary . '</file>...blah...'; my ($fiop, $length, $data); if( $string =~ m{<file \s+ fiop="([^"]+)" \s+ length="([^"]+)" />}gx +) { ($fiop, $length) = ($1, $2); # extract tag prop +erties as usual $data = substr $string, pos($string), $length # extract data by +direct string copy } print "$fiop, $length\n"; print join ',', unpack("F*", $data); (my $notags = $string) =~ s{<file.+</file>}{}; print "\n$notags\n"; ...

      Regards

      mwa

        since:
        if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.*?)</file>} +{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );

        works, I wondered if we couldn't just back reference like:
        if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.{$2})</file +>}{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );

        Wouldn't something like that be possible; that would also leave out the implication that there's just one <file></file> pair