in reply to Re^3: Text Extraction
in thread Text Extraction

Thanks ig. I never expected to go from raw data to a .csv file in so few lines of code! Using the new data file, and changing print statement to eq 'LST PRICE' here is a small sampling of the output:
G0001,G0001,2010,GMC,ACADIA,/,33410.00 G0002,G0002,2010,GMC,ACADIA,/,32615.00 G0003,G0003,2010,GMC,ACADIA,/,33010.00 G0004,G0004,2010,GMC,ACADIA,/,32615.00 G0005,G0005,2010,GMC,ACADIA,/,33410.00
Haven't pinned down why the STOCK NO is showing up twice. Just wanted to update you. UPDATE: Now have it removing the first STOCK NO in the final output and print format to a quoted .csv file:
!/usr/bin/perl use strict; use warnings; my $file = 'gwfnvi.txt'; open(my $fh, '<', $file) or die "$file: $!"; my $csv; foreach my $line (<$fh>) { chomp($line); next unless($line =~ m/^([^\.]+)\.+\s+(.*)/); if($1 eq 'STOCK NO') { $csv = $3; } $csv .= ",\"$2\""; my $str2 = substr($csv, 1); print "$str2\n" if($1 eq 'LST PRICE'); } close($fh);
Thanks again!

Replies are listed 'Best First'.
Re^5: Text Extraction
by ig (Vicar) on Jul 23, 2009 at 02:28 UTC
    I never expected to go from raw data to a .csv file in so few lines of code!

    This is one of the reasons so many people like Perl.

    Haven't pinned down why the STOCK NO is showing up twice.

    My mistake.

    if($1 eq 'STOCK NO') { $csv = $2; } $csv .= ",$2";

    should have been

    if($1 eq 'STOCK NO') { $csv = $2; } else { $csv .= ",$2"; }

    or perhaps

    ($1 eq 'STOCK NO') ? ( $csv = $2 ) : ( $csv .= ",$2" );

    Here's an improved version that uses Text::CSV to quote the data and doesn't duplicate the first field.

    use strict; use warnings; use Text::CSV; my $file = '782426.pl'; open(my $fh, '<', $file) or die "$file: $!"; my $csv = Text::CSV->new( { eol => "\n" } ); my @columns; foreach ( <$fh> ) { chomp; next unless( m/^([^\.]+)\.+\s+(.*)/ ); push(@columns, $2); if($1 eq 'SALES CST') { # last record of a set $csv->print(\*STDOUT, \@columns); @columns = (); } } close($fh);
      Thanks ig ! I think we posted at the same time. Any comments on my coding to obtain the desired output (comma separated, quoted csv)? Critique away!

        In the if block you are setting $csv to $3 but $3 is undefined. Then you append $2 to $csv. This has the result that the first field is only present once but it has a comma before it, which shouldn't be there.

        In your original data you had one record set that didn't have the 'SALES CST' record. As a result, that record set and the following one were concatenated together. Such problems are not uncommon when parsing irregular text files. It is good to include error checking, but that depends on having a good sense of what is or isn't allowed. For example: is it an error for a set of records to be missing 'SALES CST'? Or is this OK? If it is OK for the 'SALES CST' record to be missing from a set, what should be written to the CSV file for such a record set?

        Here is a variation that requires the first record to be 'STOCK NO' and reports errors if there is an unknown field or a duplicate field. Any field but the first can be missing from a record set and will default to an empty string.

        use strict; use warnings; use Text::CSV; # Field names in the order they are to appear in the CSV file my @fields = ( 'STOCK NO', 'YEAR', 'MAKE', 'CARLINE', 'COLOR DESCRIPTIONS', 'SALES CST', ); my $file = '782426.pl'; open(my $fh, '<', $file) or die "$file: $!"; my $csv = Text::CSV->new( { eol => "\n" } ); $csv->print(\*STDOUT, \@fields ); my @columns; my %columns = map { ( $_ => "" ) } @fields; while ( <$fh> ) { chomp; next unless( m/^([^\.]+)\.+\s+(.*)/ ); if($1 eq $fields[0]) { write_columns(\%columns); reset_columns(\%columns); } die "Unknown column $1" unless(exists($columns{$1})); die "duplicate column $1" if($columns{$1}); $columns{$1} = $2; } write_columns(\%columns); close($fh); exit(0); sub write_columns { my $columns = shift; if($columns->{$fields[0]}) { $csv->print(\*STDOUT, [ map { $columns->{$_} } @fields + ] ); } } sub reset_columns { my $columns = shift; %$columns = map { ( $_ => "" ) } @fields; }

        This has added some complexity but it is less likely to produce a CSV file with hard to detect errors. You can add more checks to reduce the risk or errors further.