in reply to Re^2: Text Extraction
in thread Text Extraction

In a private message, sonicscott9041 said he needs to produce a CSV file and pointed out that there are multiple sets of data on a page.

After reviewing the data a little more attentively, there are obvious sets of records separated by blank lines, with page brakes interrupting these.

Here is a simple approach to producing CSV output. It is based on the original report but the matches for start and end values for each record set can easily be changed to accommodate the new report.

use strict; use warnings; my $file = '782426.pl'; open(my $fh, '<', $file) or die "$file: $!"; my $csv; foreach my $line (<$fh>) { chomp($line); next unless($line =~ m/^([^\.]+)\.+\s+(.*)/); if($1 eq 'STOCK NO') { $csv = $2; } $csv .= ",$2"; print "$csv\n" if($1 eq 'SALES CST'); } close($fh);

Replies are listed 'Best First'.
Re^4: Text Extraction
by sonicscott9041 (Novice) on Jul 23, 2009 at 02:02 UTC
    Thanks ig. I never expected to go from raw data to a .csv file in so few lines of code! Using the new data file, and changing print statement to eq 'LST PRICE' here is a small sampling of the output:
    G0001,G0001,2010,GMC,ACADIA,/,33410.00 G0002,G0002,2010,GMC,ACADIA,/,32615.00 G0003,G0003,2010,GMC,ACADIA,/,33010.00 G0004,G0004,2010,GMC,ACADIA,/,32615.00 G0005,G0005,2010,GMC,ACADIA,/,33410.00
    Haven't pinned down why the STOCK NO is showing up twice. Just wanted to update you. UPDATE: Now have it removing the first STOCK NO in the final output and print format to a quoted .csv file:
    !/usr/bin/perl use strict; use warnings; my $file = 'gwfnvi.txt'; open(my $fh, '<', $file) or die "$file: $!"; my $csv; foreach my $line (<$fh>) { chomp($line); next unless($line =~ m/^([^\.]+)\.+\s+(.*)/); if($1 eq 'STOCK NO') { $csv = $3; } $csv .= ",\"$2\""; my $str2 = substr($csv, 1); print "$str2\n" if($1 eq 'LST PRICE'); } close($fh);
    Thanks again!
      I never expected to go from raw data to a .csv file in so few lines of code!

      This is one of the reasons so many people like Perl.

      Haven't pinned down why the STOCK NO is showing up twice.

      My mistake.

      if($1 eq 'STOCK NO') { $csv = $2; } $csv .= ",$2";

      should have been

      if($1 eq 'STOCK NO') { $csv = $2; } else { $csv .= ",$2"; }

      or perhaps

      ($1 eq 'STOCK NO') ? ( $csv = $2 ) : ( $csv .= ",$2" );

      Here's an improved version that uses Text::CSV to quote the data and doesn't duplicate the first field.

      use strict; use warnings; use Text::CSV; my $file = '782426.pl'; open(my $fh, '<', $file) or die "$file: $!"; my $csv = Text::CSV->new( { eol => "\n" } ); my @columns; foreach ( <$fh> ) { chomp; next unless( m/^([^\.]+)\.+\s+(.*)/ ); push(@columns, $2); if($1 eq 'SALES CST') { # last record of a set $csv->print(\*STDOUT, \@columns); @columns = (); } } close($fh);
        Thanks ig ! I think we posted at the same time. Any comments on my coding to obtain the desired output (comma separated, quoted csv)? Critique away!