karavay has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to extract multiple "image data" (images) from file - storage structure is as follows:

AZII*
.
.->e.g 20 bytes of data
.
AZII*
.
.
etc...

The following code extracts the data between AZII* and puts in to numbered files:
open (FILE, '1.dat') || die "$!"; $num=0; while ($line=<FILE>) { $num++ if $line =~/AZII*/; if ($line=~/AZII*/ .. /AZII*/) { open (DES, ">>out/$num.txt"); print DES $line; } if ($num == "5"){ # process 5 files only last; } }

The problem that I came across is the structure of the image files (tiffs in this case) should be fixed formatted as any binary structure e.g:

header
image data

currently the above code extracts the data in the following way:
look for AZII* if found output the whole line in to the file which damages the the image:

file 1:
-header(AZII*)
-image data
file 2:
-last row of image data from file 1 + header file 2 (AZII*)
-image data

Basically it prints the whole row of the last AZII* match.
The last row - .* part before AZII* should be printed in to file 1 and AZII* alone should start the next file2.

------------------------------------------------
e.g ƒÿýGDÄÄ#>q0‡‘‘„#"x°ÂñÇÿÿÿÿ ñ½ÿÿ AZII*
---------------------------------------------------

Where: ƒÿýGDÄÄ#>q0‡‘‘„#"x°ÂñÇÿÿÿÿ should be the end of file 1
And: AZII* should start the next file

heww hope this makes sense - any suggestions on how can I approach this.

Thanks,

Replies are listed 'Best First'.
Re: File Extraction - Cont...
by graff (Chancellor) on Sep 26, 2007 at 02:44 UTC
    If you are using MS-Windows, you probably need to do "binmode" on both the input and output files after you open them, to turn off "crlf" mode. If you don't do that, the binary data will get corrupted.

    Apart from that, you really should set the input recorder separator variable to "AZII*" (as recommended in replies to your previous thread), because using line-oriented i/o on non-text data is just strange. Try something like this:

    use strict; open( FILE, '1.dat') or die "1.dat: $!"; binmode FILE; $/ = "AZII*"; my $num = 0; while (<FILE>) { chomp; next unless length(); # skip the initial (empty) input record $num++; open (DES, ">out/$num.txt") or die ( "out/$num.tiff" ); binmode DES; print DES "AZII*".$_; close DES; last if ( $num == 5 ); }
    (Updated to add the "next unless length()" condition -- if the file begins with "AZII*", the first input record will be empty after the "chomp".)

    You may need to play with that, e.g. if "AZII*" is supposed to be followed by a line-feed (or carriage-return + line-feed).

    I don't understand why the output file names in the OP were set to "$num.txt" -- these are not text files. You say they are image files (tiff), so why not use the appropriate extension for the output files? Am I missing something in your description of the problem?

      I’ve used txt in this example as it is easier to debug check the output - i've actually use binmode in my code (probably deleted it while forming my request with the rest of the comments). Thanks for your reply I'll play around with it - the problem is that I'am a newbie to perl and still have a pretty poor vocabulary (really hope to improve it in the future and the only way to archive this is to practice and ask questions :) Thanks again,
      your code has solved the problem - really appreciate your help
Re: File Extraction - Cont...
by CountZero (Bishop) on Sep 26, 2007 at 05:57 UTC
    As an aside, your regex does not do what you probably think it does.

    /AZII*/ checks whether there is a literal string "AZII" followed by zero or more characters "I" in $line. If you want to check for the literal string "AZII*" then you must escape the "*" as it has special meaning in a regex: /AZII\*/. It probably did not hurt you here, unless somewhere in the binary data was a sequence of bytes which translates to "AZII".

    Update: Changed "AZII" to "AZI" as per apl's and johngg's (in CB) comments.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Minor nit... the pattern "AZII*" would match AZI as well as AZII, AZIII, etc.
Re: File Extraction - Cont...
by jdporter (Paladin) on Sep 26, 2007 at 05:02 UTC

    This ought to work:

    use open IO => ':raw'; use IO::File; undef $/; while (<>) { my $num; ( IO::File->new( join( '.', $ARGV, ++$num ), 'w' ) or die "Can't write - $!\n" )->print($1) while /(AZII\*.*?)(?=$|AZII\*)/msg }
    A word spoken in Mind will reach its own level, in the objective world, by its own weight