Re: Text Extraction
by Limbic~Region (Chancellor) on Jul 22, 2009 at 16:25 UTC
|
sonicscott9041,
It sounds like your requirements are simple:
- Ignore any line that begins with a whitespace or control character
- Ignore blank lines
- Remove everything up to and including the first space on remaining lines
#!/usr/bin/perl
use strict;
use warnings;
my $file = $ARGV[0] or die "Usage: $0 <input_file>";
open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $
+!";
while (<$fh>) {
next if /^(?:\s|[:cntrl:])/ || /^$/;
s/^\S+\s//;
print;
}
If that's not what you want, you will need to do a better job of describing your requirements. The code above is untested.
| [reply] [d/l] |
|
|
Thank you L~R. I will try this. I guess I was over-thinking it.
| [reply] |
Re: Text Extraction
by jdporter (Paladin) on Jul 22, 2009 at 17:03 UTC
|
Presumably, that ^Z is actually a Control-Z character, i.e. ASCII character 26, which was used in DOS (and in CP/M) as the end-of-file marker. It is no longer necessary but is ignored by Windows and Perl (unless you have turned on binmode for the stream).
The following is a succinct solution which is specific to your input data format.
If the format changes, you'd probably have to tweak this somewhat.
I show how to extract just the values in each record, as you asked; but I also show how to extract the values along with the field names, in case that is useful. (I would think it would be, ordinarily.)
use strict;
use warnings;
$/ = ''; # paragraph mode
while (<DATA>)
{
next if /^\s/; # the header
chomp;
my @just_the_values = /^[^.]*\.* {1,2}(.*)/mg;
my %keys_and_values = /^([^.]*)\.* {1,2}(.*)/mg;
print "'$_'\n" for @just_the_values;
print "$_ = '$keys_and_values{$_}'\n" for sort keys %keys_and_valu
+es;
print "----------------\n";
}
__DATA__
NEW VEHICLE INVENTORY prepared by ANYUSER
04:15:00pm 21 Jul 2009 - PAGE # 2
STOCK NO........... G0034203
YR................. 10
CARLINE............ ACADIA
SERIAL#............ 1GKLRKEDXAJ102450
COLOR DESCRIPTIONS. /
LST PRICE.......... 36010.00
SALES CST.......... 36010.00
DAY................ 20
SC................. 1
STOCK NO........... G0034204
YR................. 10
CARLINE............ ACADIA
SERIAL#............ 1GKLRKED1AJ101543
COLOR DESCRIPTIONS. /
LST PRICE.......... 33615.00
SALES CST.......... 33615.00
DAY................ 20
SC................. 1
Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.
| [reply] [d/l] [select] |
|
|
Here is the entire data file:
/ snip /
See updated data below.
| [reply] |
|
|
I've stored your data in a file called 782426.txt, because I couldn't get DATA to work with embedded Ctrl-Z characters.
Now, in the below solution, I read the data a line at a time, building up a record until I encounter the line which I know, a priori, to be the last one in each record. I could have tried to exploit the blank line which occurs between records, but in dealing with the header, I rather crudely blow away all blank lines.
use strict;
use warnings;
open F, '<', '782426.txt' or die;
$/="\r\n";
binmode F;
my $last_key;
my $record = {};
my @records;
while (<F>)
{
s/^//; # kill that pesky thing.
/^\s+NEW CAR INV prepared by / and scalar(<F>), next; # the header
/^[\.\s]*$/ and next; # skip any blank lines
/^\d+ records listed\./ and last; # end of report
chomp;
my($key,$val) = /^(.{19}) +(.*)/;
if ($key=~/\S/)
{
$key =~ s/\.+$//; # kill trailing dots
$record->{ $last_key=$key } = $val;
}
elsif(defined $last_key)
{
$record->{$last_key} .= $val;
}
if ( $key eq 'SALES CST' ) # last line of record
{
push @records, $record;
$record = {};
}
}
close F;
for my $record ( @records )
{
print "$_='$record->{$_}'\n" for sort keys %$record;
print "\n";
}
I didn't bother trying to deal with that line at the top of your data which contains a lone '1' character.
Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.
| [reply] [d/l] [select] |
Re: Text Extraction
by ig (Vicar) on Jul 22, 2009 at 22:51 UTC
|
Maybe the following. As jdporter did, I have put your sample data in a file named 782426.pl.
The first record in your sample data appears to be incomplete. I have discarded it. Similarly, the last record appears to be an exception and I have discarded that also.
use strict;
use warnings;
my $file = '782426.pl';
open(my $fh, '<', $file) or die "$file: $!";
my @records = do { local $/ = "\032"; <$fh> };
close($fh);
# Discard first and last records
shift(@records);
pop(@records);
foreach (@records) {
chop; # remove trailing \032 (record separator)
s/^[^\.]*\.+\s*//gm;
# Do what you want with the record here
print "\n****\n$_\n";
}
update: removed useless substitution (s/^$//gm) from loop. | [reply] [d/l] |
|
|
Just for clairity:
The lone '1' at the top of the data is the page number for the first page.
The part at the very bottom of the file showing the number of records and the garbage below that, is just that.... garbage.
I would like to learn how this regex works. This especially: s/^[^\.]*\.+\s*//gm;
NOTE: The data file changed because someone changed the report for their use. I have created a changed the script that runs the report, so that it creates the report at run time (as opposed to running a 'canned' report)! Sorry for the confusion.
Here is the data now:
| [reply] [d/l] [select] |
|
|
In a private message, sonicscott9041 said he needs to produce a CSV file and pointed out that there are multiple sets of data on a page.
After reviewing the data a little more attentively, there are obvious sets of records separated by blank lines, with page brakes interrupting these.
Here is a simple approach to producing CSV output. It is based on the original report but the matches for start and end values for each record set can easily be changed to accommodate the new report.
use strict;
use warnings;
my $file = '782426.pl';
open(my $fh, '<', $file) or die "$file: $!";
my $csv;
foreach my $line (<$fh>) {
chomp($line);
next unless($line =~ m/^([^\.]+)\.+\s+(.*)/);
if($1 eq 'STOCK NO') {
$csv = $2;
}
$csv .= ",$2";
print "$csv\n" if($1 eq 'SALES CST');
}
close($fh);
| [reply] [d/l] |
|
|
|
|
|