in reply to Text Extraction

karavay  I need to extract large chunks of text from files larger than 20mb each ...
What would you suggest using - regex or external module (which one) ...


20MB is small nowadays. Last week I extracted stuff (w/Perl) from a
~100MB File which took less than 2 seconds (old 3.4GHz Athlon) via Regex.

What are your limits, is this 'on the fly', is speed of importance?

If so, I'd write a small (few lines) Inline::C Wrapper to the C strstr()
function and look up "AZII"... (if your library implementation does
DWORD or QWORD aligned accesses and reads machine words at a time.)

But that (how to do that) depends *strongly* on the contex. Whats to
do with the found text then? Extract? Find only and tell?

Can you provide one small, minimal but exact sample of the text in question?

Regards
mwa

Replies are listed 'Best First'.
Re^2: Text Extraction
by karavay (Beadle) on Sep 25, 2007 at 18:16 UTC
    the file is a container of tiff images - each image location is defined by AZII (image border).. so what i need to do is to extract multiple images from one file..
    AZII
    ..
    ..(image 1)
    ..
    ..
    AZII
    ..
    ..(image 2)
    ..
    ..
    AZII - > etc
    ..

    Speed is not important if the extraction process is not tooooo slow :)
    Thanks
      If you can define a "record separator" between the images,
      then this should be easy. Let Perl find the images, say only
      where the images are delimited
      my $fn = 'file.dat'; $/ = "AZI\n"; open my $fh, '<', $fn or die "cant stand smell of $!"; my @images = <$fh>; close $fh; my $num = 1110; for my $img (@images) { open $fh, '>', ++$num .'.tiff' or die "can't dump image! $!"; print $fh $img }
      The $/ sets the "image separator", please check which characters
      are *exactly* in the file, line separators any? Is this Unix/Linux?
      On Win, for example, you have to make sure to open the files in binmode mode ...

      Another variant would be not to save the records in an array (which is unnecessary).
      Like:
      ... while( my $img = <$fh> ) { # read one record # [update $fh => $ih] open my $ih, '>', ++$num .'.tiff' or die "can't dump image! $!"; print $ih $img } ...
      I'll leave that one to your own exercise ...

      Regards
      mwa

      (updated to correct stupid copy/paste error in second code block).
        Thanks alot for the tips - I'll take over from here :)