Re: Text Extraction

Maybe the following. As jdporter did, I have put your sample data in a file named 782426.pl.

The first record in your sample data appears to be incomplete. I have discarded it. Similarly, the last record appears to be an exception and I have discarded that also.

use strict;
use warnings;


my $file = '782426.pl';
open(my $fh, '<', $file) or die "$file: $!";
my @records = do { local $/ = "\032"; <$fh> };
close($fh);


# Discard first and last records
shift(@records);
pop(@records);

foreach (@records) {
        chop;   # remove trailing \032 (record separator)
        s/^[^\.]*\.+\s*//gm;
        # Do what you want with the record here
        print "\n****\n$_\n";
}
[download]

update: removed useless substitution (s/^$//gm) from loop.

Comment on Re: Text Extraction Download Code

Replies are listed 'Best First'.
Re^2: Text Extraction by sonicscott9041 (Novice) on Jul 23, 2009 at 01:02 UTC
Just for clairity: The lone '1' at the top of the data is the page number for the first page. The part at the very bottom of the file showing the number of records and the garbage below that, is just that.... garbage. I would like to learn how this regex works. This especially: `s/^[^\.]\.+\s//gm;` NOTE: The data file changed because someone changed the report for their use. I have created a changed the script that runs the report, so that it creates the report at run time (as opposed to running a 'canned' report)! Sorry for the confusion. Here is the data now: Read more... (23 kB)	[reply] [d/l] [select]
Re^3: Text Extraction by ig (Vicar) on Jul 23, 2009 at 01:48 UTC
In a private message, sonicscott9041 said he needs to produce a CSV file and pointed out that there are multiple sets of data on a page. After reviewing the data a little more attentively, there are obvious sets of records separated by blank lines, with page brakes interrupting these. Here is a simple approach to producing CSV output. It is based on the original report but the matches for start and end values for each record set can easily be changed to accommodate the new report. `use strict; use warnings; my $file = '782426.pl'; open(my $fh, '<', $file) or die "$file: $!"; my $csv; foreach my $line (<$fh>) { chomp($line); next unless($line =~ m/^([^\.]+)\.+\s+(.*)/); if($1 eq 'STOCK NO') { $csv = $2; } $csv .= ",$2"; print "$csv\n" if($1 eq 'SALES CST'); } close($fh);` [download]	[reply] [d/l]
Re^4: Text Extraction by sonicscott9041 (Novice) on Jul 23, 2009 at 02:02 UTC
Thanks ig. I never expected to go from raw data to a .csv file in so few lines of code! Using the new data file, and changing print statement to eq 'LST PRICE' here is a small sampling of the output: `G0001,G0001,2010,GMC,ACADIA,/,33410.00 G0002,G0002,2010,GMC,ACADIA,/,32615.00 G0003,G0003,2010,GMC,ACADIA,/,33010.00 G0004,G0004,2010,GMC,ACADIA,/,32615.00 G0005,G0005,2010,GMC,ACADIA,/,33410.00` [download] Haven't pinned down why the STOCK NO is showing up twice. Just wanted to update you. UPDATE: Now have it removing the first STOCK NO in the final output and print format to a quoted .csv file: `!/usr/bin/perl use strict; use warnings; my $file = 'gwfnvi.txt'; open(my $fh, '<', $file) or die "$file: $!"; my $csv; foreach my $line (<$fh>) { chomp($line); next unless($line =~ m/^([^\.]+)\.+\s+(.*)/); if($1 eq 'STOCK NO') { $csv = $3; } $csv .= ",\"$2\""; my $str2 = substr($csv, 1); print "$str2\n" if($1 eq 'LST PRICE'); } close($fh);` [download] Thanks again!	[reply] [d/l] [select]
Re^5: Text Extraction by ig (Vicar) on Jul 23, 2009 at 02:28 UTC
Re^6: Text Extraction by sonicscott9041 (Novice) on Jul 23, 2009 at 02:43 UTC
Some notes below your chosen depth have not been shown here