in reply to Re^5: extracting a substring from a string - multiple variables
in thread extracting a substring from a string - multiple variables

since:
if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.*?)</file>} +{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );

works, I wondered if we couldn't just back reference like:
if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.{$2})</file +>}{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );

Wouldn't something like that be possible; that would also leave out the implication that there's just one <file></file> pair

Replies are listed 'Best First'.
Re^7: extracting a substring from a string - multiple variables
by graff (Chancellor) on Oct 28, 2007 at 20:30 UTC
    Whoa... let's take a step back.
    • You are trying to handle POSTed data, so there's a reasonable chance that you can't trust the 'length="..."' information.
    • There is also a concern (because it's POSTed data) of corruptions involving "file" tags somehow being present within the binary data.
    • The binary chunks are apparently rather large, so you might run into memory issues if your approach involves having too many copies of too much data in perl variables.
    • Now you seem to be hinting that a given POST might contain two or more segments within "file" tags.
    • You haven't said much about the content outside the "file" tags, but apparently it's supposed to be valid XML once the "file" tags are removed.

    I think you'd be better off if your client(s) used ftp to transfer the binary stuff as data files (with distinct file names), and then just put references to the file names in the XML stream that gets posted. This way, there's nothing in the XML stream except valid XML, and doing stuff with the binary data will be easier, putting less load on the overall process.

    But if there's no chance of doing it sensibly like that, then you just need to use a while loop for handling more than one <file/>...</file> element in the data, and hope for the best:

    while ( $indata =~ s{<file fiop="([^"]+)" length="(\d+)"/>(.*?)</file> +}{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 ); # do something with $fileData, possibly after checking that # $fileLength == length( $fileData ), if that matters to you } if ( $indata =~ m{<file fiop=|</file>} ) { # there's something wrong with the posted data, so it's still # not suitable for XML parsing... }
Re^7: extracting a substring from a string - multiple variables
by mwah (Hermit) on Oct 28, 2007 at 20:25 UTC
    walinsky
    Wouldn't something like that be possible; that would also leave out the implication that there's just one <file/></file> pair

    To make the problem clearer:

    • we have a text like
      <file fiop="fiop_name" length="333333"/>#333K binary chunk goes here#}</file>
    • the "binary chunk" may have any length and may contain any data, possibly (with a lower probability) even the ending tag ... \x02\xc5</file>\x64\xf4  ...
    • per string $string, there is more than one of such sequences <file .../> ...</file> to be expected

    The only chance I'd see here would be to advance to the start of data, extract the data by substr($string, pos($string), $length) and update the string's pos($string) behind the data: pos($string) += $length. At that point, it could be checked for the expected ending tag </file>. All this happens in a while loop under /g until no more <file> blocks can be found.

    Could the above text describe problem and solution?

    Regards

    mwa

Re^7: extracting a substring from a string - multiple variables
by walinsky (Scribe) on Oct 28, 2007 at 21:46 UTC
    Replying to myself, mwah and graff

    I'm handling POSTed data, actually I'm reverse engineering .Mac services. This is why I'm not afraid the 'length' information can't be trusted; programmers from Cupertino wouldn't fool themselves _that_ much.
    Also, as far as I've seen, there's never been more than 1 <file></file> pair. It would just have been the cherry on the cake to take that chance out.
    As
    if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.*?)</file>} +{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );

    works (for now), but I definitely want to take out the chance that the binary data contains </file>, I just'd like to optimize the regex.
    I _do_ know it can be done with substr, but (knowing -but not completeley understanding- the power of regex) I just wondered if a back-referencing to length within the regex would/could be possible.
    Is something like:
    if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.{$2})</file +>}{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );
    possible ?

    update:
    As regex is limited in matching to a given (64k - or so) length; we decided to assume there's only 1 occurence of a <file/> node; we can match greedy:
    if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.*)</file>}{ +}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );
    (matching the final occurence of </file>)