Bulk Reading and Writing of Large Text Files

Sterling_Malory has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

So I have created this script with help from previous posts on Perl Monks. It successfully reads from large tab and comma delimited text docs and picks out the rows of data which I require and places them into new docs that are much smaller and more manageable in size.

However I am unable to create a version of it which carries it out in bulk. My command for running it is as follows:

 perl midasproc2.pl midas_wxhrly_199901-199912.txt 19260 "1999-12-15 1
+1:00"
[download]

This means that the a different value must be manually entered each time in place of "19260". Is it possible to alter the script to allow me to enter numerous values in addition to 19260?

1999-01-01 00:00, EGPK, ICAO, METAR, 1, 1006, 1001, , , 120, 13, 00, ,
+ , , , , , , , 1000, , 1, , 96, 6, , $
1999-01-01 00:00, EGSS, ICAO, METAR, 1, 484, 1001, , , 130, 8, 00, , ,
+ , , , , , , 1000, , 1, , 120, , , , $
1999-01-01 01:00, 03002, WMO, SYNOP, 1, 12, 1011, 4, 6, 160, 20, , , ,
+ , , , , , 20, 440, 1004.1, 7, , 24, $
[download]

Below is the script I pieced together: ------------------------------------------

#!/usr/bin/perl

use strict;
use warnings;
my ($record, $date, $outfile, $station, $sstation, $linecnt, $pcntg, @
+values);

print "Processing \"$ARGV[0]\"...\n";
$date = $ARGV[2];
$station = $ARGV[1];
$outfile = $date.".txt";
$outfile =~ s/ /-/;

print "For Date: $date and Station: $station\n\n";
$linecnt = `wc -l < $ARGV[0]`;
open (INFILE, $ARGV[0]);
open (OUTFILE, ">>", $outfile);

while (<INFILE>) {
   $pcntg = int (($. / $linecnt ) * 100) ;
   print "$pcntg %\r";
   chomp();
   $record = $_;
   @values = split (',',$record);
   $sstation = $values[5];
   $sstation =~ s/ //g;
   if ($values[0] eq $date && $sstation eq $station ) {
      print OUTFILE $record."\n";
#      print "xx".$values[5]."xx\n";
      print "Record written to $outfile...\n";
      last;
   }
}

print "\nFinished\n";
close (INFILE);
close (OUTFILE);
[download]

-----------------------------------------------

Any help anyone can offer will be greatly appreciated. Also I hope this code can help others filter through large delimited documents.

Comment on Bulk Reading and Writing of Large Text Files Select or Download Code

Replies are listed 'Best First'.
Re: Bulk Reading and Writing of Large Text Files by BrowserUk (Patriarch) on May 21, 2013 at 10:56 UTC
Minimal changes might be something like: `... my ($record, $date, $outfile, %stations, $sstation, $linecnt, $pcntg, +@values); ... $date = $ARGV[1]; %stations = map{ $_ => 1 } @ARGV[ 2 .. $#ARGV]; ... if( $values[0] eq $date && exists $stations{ $sstation } ) {` [download] It re-orders the command line arguments to put the station number(s) at the end of the command line. It then builds a hash from the station number(s) entered. And checks each value of `$sstation` against the hash using exists The command line then becomes: `perl midasproc2.pl midas_wxhrly_199901-199912.txt "1999-12-15 11:00" 1 +9260 19261 19262 [...]` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Bulk Reading and Writing of Large Text Files by Sterling_Malory (Initiate) on May 21, 2013 at 11:56 UTC
Thank you for your response. I have been trying both options. Using your advice my script now reads: #!/usr/bin/perl use strict; use warnings; my ($record, $date, $outfile, $station, $sstation, $linecnt, $pcntg, @ +values); print "Processing \"$ARGV[0]\"...\n"; $date = $ARGV[1]; %stations = map{ $_ => 1 } @ARGV[ 2 .. $#ARGV]; $outfile = $date.".txt"; $outfile =~ s/ /-/; print "For Date: $date and Station: $station\n\n"; $linecnt = `wc -l < $ARGV[0]`; open (INFILE, $ARGV[0]); open (OUTFILE, ">>", $outfile); while (<INFILE>) { $pcntg = int (($. / $linecnt ) * 100) ; print "$pcntg %\r"; chomp(); $record = $_; @values = split (',',$record); $sstation = $values[5]; $sstation =~ s/ //g; if( $values[0] eq $date && exists $stations{ $sstation } ) { print OUTFILE $record."\n"; # print "xx".$values[5]."xx\n"; print "Record written to $outfile...\n"; last; } } print "\nFinished\n"; close (INFILE); close (OUTFILE); [download] It has worked just yet and I get the following errors: `[s1269452@burn MIDAS_DATA]$ perl bulk_reorder_2105.txt midas_wxhrly_19 +9901-199912.txt "1999-12-15 11:00" 19260 Global symbol "%stations" requires explicit package name at bulk_reord +er_2105.txt line 10. Global symbol "$stations" requires explicit package name at bulk_reord +er_2105.txt line 14. Global symbol "%stations" requires explicit package name at bulk_reord +er_2105.txt line 27. Execution of bulk_reorder_2105.txt aborted due to compilation errors.` [download] I have tried tweaking it a bit but I haven't really got anywhere.	[reply] [d/l] [select]
Re^3: Bulk Reading and Writing of Large Text Files by BrowserUk (Patriarch) on May 21, 2013 at 12:29 UTC
I changed your declaration line from: `my ($record, $date, $outfile, $station, $sstation, $linecnt, $pcntg, @ +values); #.............................^^^^^^^^` [download] to `my ($record, $date, $outfile, %stations, $sstation, $linecnt, $pcntg, +@values); #.............................^^^^^^^^^` [download] You didn't! With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: Bulk Reading and Writing of Large Text Files by Sterling_Malory (Initiate) on May 21, 2013 at 13:54 UTC
Re^5: Bulk Reading and Writing of Large Text Files by BrowserUk (Patriarch) on May 21, 2013 at 14:09 UTC
Some notes below your chosen depth have not been shown here
Re: Bulk Reading and Writing of Large Text Files by choroba (Cardinal) on May 21, 2013 at 10:49 UTC
strict is useful if you declare each variable in the tightest scope possible. Declaring all the variables at the beginning of the script just gives you global variables with all the pitfalls. I created a hash of the stations so you can easily check whether a report for a given station was requested. On the command line, just place all the stations where you originally had one. As you have not posted a sample of the input data and your specification is not clear on this, I do not know what to do with the last on line 30. You definitely do not want to end the loop there, because it must run for other stations yet. If it is possible for a station to be reported multiple times but you only want the first report, you can delete its entry from the hash. If you just use it to speed the processing up and you know each station is mentioned just once for the given date, you can still delete the entries and use `last unless %stations` to quit the loop once all the stations have been processed. #!/usr/bin/perl use warnings; use strict; my $filename = shift; my $date = pop; print qq(Processing "$filename"...\n); (my $outfile = "$date.txt") =~ s/ /-/; open my $IN, '<', $filename or die "$filename: $!\n"; # Count the lines. 1 while <$IN>; my $line_count = $.; seek $IN, $. = 0, 0; open my $OUT, '>', $outfile or die "$outfile: $!\n"; my %stations; @stations{ @ARGV } = (); while (<$IN>) { chomp; my $pcntg = int 100 * ($. / $line_count ); print STDERR "$pcntg %\r"; my ($file_date, $file_station) = (split /,/)[0, 5]; $file_station =~ s/ //g; if ($file_date eq $date and exists $stations{$file_station}) { print $OUT "$_\n"; print "Record written to $outfile...\n"; delete $stations{$file_station}; last unless %stations; } } print "\nFinished\n"; close $OUT; [download] Update: Added missing indices at line 27 and missing angle brackets at line 14. Update 2: Added the `last` handling. Also, STDERR used to report the percentage as it is not buffered. لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: Bulk Reading and Writing of Large Text Files by Sterling_Malory (Initiate) on May 21, 2013 at 11:13 UTC
Thank you for the response What is the best way to provide you with a sample of the data? Each value should only occur once per date/time entry so I think what you have provided me with should work	[reply]
Re^3: Bulk Reading and Writing of Large Text Files by choroba (Cardinal) on May 21, 2013 at 11:30 UTC
A good practice is to include a small sample of the data in the question (3-4 lines) enclosed in the `<code>` tags for easy download. لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^4: Bulk Reading and Writing of Large Text Files by Sterling_Malory (Initiate) on May 21, 2013 at 11:41 UTC
Re: Bulk Reading and Writing of Large Text Files by Tux (Canon) on May 23, 2013 at 06:33 UTC
This looks like (malformed: spaces) CSV data to me. Splitting the lines, however simple it may look right now, seldom is the correct way to deal with that kind of data. Use Text::CSV_XS or Text::CSV: `use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, allow_whitespace => 1, aut +o_diag => 1 }); my $fs = -s $ARGV[0] or die "Empty input file"; open my $fh, "<", $ARGV[0]; while (my $row = $csv->getline ($fh)) { print int (100 * tell ($fh) / $fs), "%%\r"; my $sstation = $row->[5]; ...` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l]