karavay has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract large chunks of text from files larger than 20mb each:

e.g

AZII*
.
.(text i need to extract appx 20k)
.
AZII*

What would you suggest using - regex or external module (which one). What would be the most efficient solution as performance is an issue here.
Thanks,

Replies are listed 'Best First'.
Re: Text Extraction
by johngg (Canon) on Sep 25, 2007 at 17:59 UTC
    You might be able to do this as a one-liner using a flip-flop.

    $ cat xxxx adasasd AZII* dasdasd eregrg rtghyth ujyyujyu ghnggh AZII* rtrthyyt juhgj AZII* fddfgd hgftyhyj iukuikuik AZII* gytty ioiukuy $ perl -ne 'print if /^AZII\*/ ... /^AZII\*/;' xxxx > yyyy $ cat yyyy AZII* dasdasd eregrg rtghyth ujyyujyu ghnggh AZII* AZII* fddfgd hgftyhyj iukuikuik AZII* $

    I hope this is of use.

    Cheers,

    JohnGG

Re: Text Extraction
by kyle (Abbot) on Sep 25, 2007 at 17:54 UTC

    What have you tried so far?

    Off the top of my head, I think you could:

    1. Slurp the whole file and regex (or index) through it.
    2. Read a little at a time and use /AZII\*/ .. /AZII\*/ to find the part you're looking for.
    3. Set $/='AZII*' (see perlvar) and go from there.

    If you really want to be sure you have the fastest, implement it a few different ways and use Benchmark to choose.

Re: Text Extraction
by mwah (Hermit) on Sep 25, 2007 at 17:56 UTC
    karavay  I need to extract large chunks of text from files larger than 20mb each ...
    What would you suggest using - regex or external module (which one) ...


    20MB is small nowadays. Last week I extracted stuff (w/Perl) from a
    ~100MB File which took less than 2 seconds (old 3.4GHz Athlon) via Regex.

    What are your limits, is this 'on the fly', is speed of importance?

    If so, I'd write a small (few lines) Inline::C Wrapper to the C strstr()
    function and look up "AZII"... (if your library implementation does
    DWORD or QWORD aligned accesses and reads machine words at a time.)

    But that (how to do that) depends *strongly* on the contex. Whats to
    do with the found text then? Extract? Find only and tell?

    Can you provide one small, minimal but exact sample of the text in question?

    Regards
    mwa
      the file is a container of tiff images - each image location is defined by AZII (image border).. so what i need to do is to extract multiple images from one file..
      AZII
      ..
      ..(image 1)
      ..
      ..
      AZII
      ..
      ..(image 2)
      ..
      ..
      AZII - > etc
      ..

      Speed is not important if the extraction process is not tooooo slow :)
      Thanks
        If you can define a "record separator" between the images,
        then this should be easy. Let Perl find the images, say only
        where the images are delimited
        my $fn = 'file.dat'; $/ = "AZI\n"; open my $fh, '<', $fn or die "cant stand smell of $!"; my @images = <$fh>; close $fh; my $num = 1110; for my $img (@images) { open $fh, '>', ++$num .'.tiff' or die "can't dump image! $!"; print $fh $img }
        The $/ sets the "image separator", please check which characters
        are *exactly* in the file, line separators any? Is this Unix/Linux?
        On Win, for example, you have to make sure to open the files in binmode mode ...

        Another variant would be not to save the records in an array (which is unnecessary).
        Like:
        ... while( my $img = <$fh> ) { # read one record # [update $fh => $ih] open my $ih, '>', ++$num .'.tiff' or die "can't dump image! $!"; print $ih $img } ...
        I'll leave that one to your own exercise ...

        Regards
        mwa

        (updated to correct stupid copy/paste error in second code block).