Text Extraction

sonicscott9041 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Text Extraction by Limbic~Region (Chancellor) on Jul 22, 2009 at 16:25 UTC
sonicscott9041, It sounds like your requirements are simple: Ignore any line that begins with a whitespace or control character Ignore blank lines Remove everything up to and including the first space on remaining lines `#!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0] or die "Usage: $0 <input_file>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; while (<$fh>) { next if /^(?:\s\|[:cntrl:])/ \|\| /^$/; s/^\S+\s//; print; }` [download] If that's not what you want, you will need to do a better job of describing your requirements. The code above is untested. Cheers - L~R	[reply] [d/l]
Re^2: Text Extraction by sonicscott9041 (Novice) on Jul 22, 2009 at 16:32 UTC
Thank you L~R. I will try this. I guess I was over-thinking it.	[reply]
Re: Text Extraction by jdporter (Paladin) on Jul 22, 2009 at 17:03 UTC
Presumably, that `^Z` is actually a Control-Z character, i.e. ASCII character 26, which was used in DOS (and in CP/M) as the end-of-file marker. It is no longer necessary but is ignored by Windows and Perl (unless you have turned on binmode for the stream). The following is a succinct solution which is specific to your input data format. If the format changes, you'd probably have to tweak this somewhat. I show how to extract just the values in each record, as you asked; but I also show how to extract the values along with the field names, in case that is useful. (I would think it would be, ordinarily.) use strict; use warnings; $/ = ''; # paragraph mode while (<DATA>) { next if /^\s/; # the header chomp; my @just_the_values = /^[^.]\. {1,2}(.)/mg; my %keys_and_values = /^([^.])\.* {1,2}(.)/mg; print "'$_'\n" for @just_the_values; print "$_ = '$keys_and_values{$_}'\n" for sort keys %keys_and_valu +es; print "----------------\n"; } __DATA__ NEW VEHICLE INVENTORY prepared by ANYUSER 04:15:00pm 21 Jul 2009 - PAGE # 2 STOCK NO........... G0034203 YR................. 10 CARLINE............ ACADIA SERIAL#............ 1GKLRKEDXAJ102450 COLOR DESCRIPTIONS. / LST PRICE.......... 36010.00 SALES CST.......... 36010.00 DAY................ 20 SC................. 1 STOCK NO........... G0034204 YR................. 10 CARLINE............ ACADIA SERIAL#............ 1GKLRKED1AJ101543 COLOR DESCRIPTIONS. / LST PRICE.......... 33615.00 SALES CST.......... 33615.00 DAY................ 20 SC................. 1 [download] Between the mind* which plans and the hands which build, there must be a mediator... and this mediator must be the heart.	[reply] [d/l] [select]
Re^2: Text Extraction by sonicscott9041 (Novice) on Jul 22, 2009 at 20:22 UTC
Here is the entire data file: / snip / See updated data below.	[reply]
Re^3: Text Extraction by jdporter (Paladin) on Jul 22, 2009 at 21:15 UTC
I've stored your data in a file called `782426.txt`, because I couldn't get `DATA` to work with embedded Ctrl-Z characters. Now, in the below solution, I read the data a line at a time, building up a record until I encounter the line which I know, a priori, to be the last one in each record. I could have tried to exploit the blank line which occurs between records, but in dealing with the header, I rather crudely blow away all blank lines. use strict; use warnings; open F, '<', '782426.txt' or die; $/="\r\n"; binmode F; my $last_key; my $record = {}; my @records; while (<F>) { s/^//; # kill that pesky thing. /^\s+NEW CAR INV prepared by / and scalar(<F>), next; # the header /^[\.\s]$/ and next; # skip any blank lines /^\d+ records listed\./ and last; # end of report chomp; my($key,$val) = /^(.{19}) +(.)/; if ($key=~/\S/) { $key =~ s/\.+$//; # kill trailing dots $record->{ $last_key=$key } = $val; } elsif(defined $last_key) { $record->{$last_key} .= $val; } if ( $key eq 'SALES CST' ) # last line of record { push @records, $record; $record = {}; } } close F; for my $record ( @records ) { print "$_='$record->{$_}'\n" for sort keys %$record; print "\n"; } [download] I didn't bother trying to deal with that line at the top of your data which contains a lone '1' character. Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.	[reply] [d/l] [select]
Re: Text Extraction by ig (Vicar) on Jul 22, 2009 at 22:51 UTC
Maybe the following. As jdporter did, I have put your sample data in a file named 782426.pl. The first record in your sample data appears to be incomplete. I have discarded it. Similarly, the last record appears to be an exception and I have discarded that also. `use strict; use warnings; my $file = '782426.pl'; open(my $fh, '<', $file) or die "$file: $!"; my @records = do { local $/ = "\032"; <$fh> }; close($fh); # Discard first and last records shift(@records); pop(@records); foreach (@records) { chop; # remove trailing \032 (record separator) s/^[^\.]\.+\s//gm; # Do what you want with the record here print "\n**\n$_\n"; }` [download] update:** removed useless substitution (s/^$//gm) from loop.	[reply] [d/l]
Re^2: Text Extraction by sonicscott9041 (Novice) on Jul 23, 2009 at 01:02 UTC
Just for clairity: The lone '1' at the top of the data is the page number for the first page. The part at the very bottom of the file showing the number of records and the garbage below that, is just that.... garbage. I would like to learn how this regex works. This especially: `s/^[^\.]\.+\s//gm;` NOTE: The data file changed because someone changed the report for their use. I have created a changed the script that runs the report, so that it creates the report at run time (as opposed to running a 'canned' report)! Sorry for the confusion. Here is the data now: Read more... (23 kB)	[reply] [d/l] [select]
Re^3: Text Extraction by ig (Vicar) on Jul 23, 2009 at 01:48 UTC
In a private message, sonicscott9041 said he needs to produce a CSV file and pointed out that there are multiple sets of data on a page. After reviewing the data a little more attentively, there are obvious sets of records separated by blank lines, with page brakes interrupting these. Here is a simple approach to producing CSV output. It is based on the original report but the matches for start and end values for each record set can easily be changed to accommodate the new report. `use strict; use warnings; my $file = '782426.pl'; open(my $fh, '<', $file) or die "$file: $!"; my $csv; foreach my $line (<$fh>) { chomp($line); next unless($line =~ m/^([^\.]+)\.+\s+(.*)/); if($1 eq 'STOCK NO') { $csv = $2; } $csv .= ",$2"; print "$csv\n" if($1 eq 'SALES CST'); } close($fh);` [download]	[reply] [d/l]
Re^4: Text Extraction by sonicscott9041 (Novice) on Jul 23, 2009 at 02:02 UTC
Re^5: Text Extraction by ig (Vicar) on Jul 23, 2009 at 02:28 UTC
Some notes below your chosen depth have not been shown here