Text Extraction

karavay has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Text Extraction by johngg (Canon) on Sep 25, 2007 at 17:59 UTC
You might be able to do this as a one-liner using a flip-flop. `$ cat xxxx adasasd AZII* dasdasd eregrg rtghyth ujyyujyu ghnggh AZII* rtrthyyt juhgj AZII* fddfgd hgftyhyj iukuikuik AZII* gytty ioiukuy $ perl -ne 'print if /^AZII\/ ... /^AZII\/;' xxxx > yyyy $ cat yyyy AZII* dasdasd eregrg rtghyth ujyyujyu ghnggh AZII* AZII* fddfgd hgftyhyj iukuikuik AZII* $` [download] I hope this is of use. Cheers, JohnGG	[reply] [d/l]
Re: Text Extraction by kyle (Abbot) on Sep 25, 2007 at 17:54 UTC
What have you tried so far? Off the top of my head, I think you could: Slurp the whole file and regex (or index) through it. Read a little at a time and use `/AZII\/ .. /AZII\/` to find the part you're looking for. Set `$/='AZII*'` (see perlvar) and go from there. If you really want to be sure you have the fastest, implement it a few different ways and use Benchmark to choose.	[reply] [d/l] [select]
Re: Text Extraction by mwah (Hermit) on Sep 25, 2007 at 17:56 UTC
karavay I need to extract large chunks of text from files larger than 20mb each ... What would you suggest using - regex or external module (which one) ... 20MB is small nowadays. Last week I extracted stuff (w/Perl) from a ~100MB File which took less than 2 seconds (old 3.4GHz Athlon) via Regex. What are your limits, is this 'on the fly', is speed of importance? If so, I'd write a small (few lines) Inline::C Wrapper to the C `strstr()` function and look up "AZII"... (if your library implementation does DWORD or QWORD aligned accesses and reads machine words at a time.) But that (how to do that) depends strongly on the contex. Whats to do with the found text then? Extract? Find only and tell? Can you provide one small, minimal but exact sample of the text in question? Regards mwa	[reply] [d/l]
Re^2: Text Extraction by karavay (Beadle) on Sep 25, 2007 at 18:16 UTC
the file is a container of tiff images - each image location is defined by AZII (image border).. so what i need to do is to extract multiple images from one file.. AZII .. ..(image 1) .. .. AZII .. ..(image 2) .. .. AZII - > etc .. Speed is not important if the extraction process is not tooooo slow :) Thanks	[reply]
Re^3: Text Extraction by mwah (Hermit) on Sep 25, 2007 at 18:36 UTC
If you can define a "record separator" between the images, then this should be easy. Let Perl find the images, say only where the images are delimited `my $fn = 'file.dat'; $/ = "AZI\n"; open my $fh, '<', $fn or die "cant stand smell of $!"; my @images = <$fh>; close $fh; my $num = 1110; for my $img (@images) { open $fh, '>', ++$num .'.tiff' or die "can't dump image! $!"; print $fh $img }` [download] The $/ sets the "image separator", please check which characters are exactly in the file, line separators any? Is this Unix/Linux? On Win, for example, you have to make sure to open the files in binmode mode ... Another variant would be not to save the records in an array (which is unnecessary). Like: `... while( my $img = <$fh> ) { # read one record # [update $fh => $ih] open my $ih, '>', ++$num .'.tiff' or die "can't dump image! $!"; print $ih $img } ...` [download] I'll leave that one to your own exercise ... Regards mwa (updated to correct stupid copy/paste error in second code block).	[reply] [d/l] [select]
Re^4: Text Extraction by karavay (Beadle) on Sep 25, 2007 at 20:22 UTC