Better Way to Search Large File.

walkingthecow has asked for the wisdom of the Perl Monks concerning the following question:

Without the use of a database, I am trying to find a better way than what I have to search a large CSV file. I got the size of it down to about 10mb, 221,000 lines. I'm trying to pull records and put stuff in the middle, as you can see, but I just know there is a much better way to do exactly what this does. Well, anyway, here's my code:

#!/usr/bin/perl -w

my $employee_file = "employee_list.csv";
my $employee_name;
my $ans;
my @exact_breakdown;
open(OUTFILE,">test.script");
my $ans2;

if (@ARGV == 1) {
   chomp ($employee_name=$ARGV[0]);
}
else {
   print "\n\nPlease enter the employee's name: ";
   chomp ($employee_name=<STDIN>);
   while(1) {
      print "\n\nYou entered $employee_name - is this correct? <y or n
+>:";
      chomp($ans=<STDIN>);
      if ($ans =~ /[Nn]/) {
         print "\n\nPlease enter the server name: ";
         chomp ($employee_name=<STDIN>);
         next;
      }     
      elsif ($ans =~ /[Yy]/) {
         last;
      }
      else {
         next;
      }
   }
}

my @name_breakdown=split(/ /,$employee_name);

my @exact_check=`grep \"$employee_name\" $employee_file | cut -d : -f 
+2`;
chomp(@exact_check);
my @approx_check = `grep \"$name_breakdown[1]\" $employee_file | grep 
+\"$name_breakdown[0]\" | cut -d : -f 2`;
chomp(@approx_check);

print "EXACT MATCHES:\n";
print "=" x50,"\n";
if ($exact_check[0] ne "") {
    print "a. $exact_check[0]\n";
}
if ($exact_check[1] ne "") {
    print "b. $exact_check[1]\n";
}
if ($exact_check[2] ne "") {
    print "c. $exact_check[2]\n";
}
print "\n\nAPPROXIMATE MATCHES:\n";
print "=" x50,"\n";
if ($approx_check[0] ne "") {
    print "d. $approx_check[0]\n";
}
if ($approx_check[1] ne "") {
    print "e. $approx_check[1]\n";
}
if ($approx_check[2] ne "") {
    print "f. $approx_check[2]\n";
}
if ($approx_check[3] ne "") {
    print "g. $approx_check[3]\n";
}
if ($approx_check[4] ne "") {
    print "h. $approx_check[4]\n";
}
if ($approx_check[5] ne "") {
    print "i. $approx_check[5]\n";
}
if ($approx_check[6] ne "") {
    print "j. $approx_check[6]\n";
}
if ($approx_check[7] ne "") {
    print "k. $approx_check[7]\n";
}
if ($approx_check[8] ne "") {
    print "l. $approx_check[8]\n";
}
if ($approx_check[9] ne "") {
    print "m. $approx_check[9]\n";
}
print "Please select which is correct: ";
chomp($ans2=<STDIN>);

if ($ans2 =~ /[Aa]/) {
    @exact_breakdown = split(/,/,$exact_check[0]);
    print OUTFILE "$exact_breakdown[0],TESTING,$exact_breakdown[1],$ex
+act_breakdown[2]";
}

elsif ($ans2 =~ /[Bb]/) {
    @exact_breakdown = split(/,/,$exact_check[1]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Cc]/) {
    @exact_breakdown = split(/,/,$exact_check[2]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Dd]/) {
    @exact_breakdown = split(/,/,$approx_check[0]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Ee]/) {
    @exact_breakdown = split(/,/,$approx_check[1]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Ff]/) {
    @exact_breakdown = split(/,/,$approx_check[2]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Gg]/) {
    @exact_breakdown = split(/,/,$approx_check[3]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Hh]/) {
    @exact_breakdown = split(/,/,$approx_check[4]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Ii]/) {
    @exact_breakdown = split(/,/,$approx_check[5]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Jj]/) {
    @exact_breakdown = split(/,/,$approx_check[6]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Kk]/) {
    @exact_breakdown = split(/,/,$approx_check[7]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Ll]/) {
    @exact_breakdown = split(/,/,$approx_check[8]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
elsif ($ans2 =~ /[Mm]/) {
    @exact_breakdown = split(/,/,$approx_check[9]);
    print OUTFILE "$exact_breakdown[0],THINGS,$exact_breakdown[1],$exa
+ct_breakdown[2]";
}
[download]

Comment on Better Way to Search Large File. Download Code

Replies are listed 'Best First'.
Re: Better Way to Search Large File. by dHarry (Abbot) on Aug 08, 2008 at 07:05 UTC
You will have to go through the entire file, there is no way around that. Unless you can calculate the exact spot you want to modify upfront. The Perl CookBook gives several strategies for updating a file. 1. Read from the original file, output to a temporary file. Make the changes you want to the temporary file. Rename the temporary back to the original once you’re done. A snippet to explain the idea: `use strict; use warnings; my ($old, $new); open(OLD, "< $old") or die "can't open $old: $!"; open(NEW, "> $new") or die "can't open $new: $!"; while (<OLD>) { # Some logic... print NEW $_ or die "can't write $new: $!"; } close(OLD) or die "can't close $old: $!"; close(NEW) or die "can't close $new: $!"; rename($old, "$old.orig") or die "can't rename $old to $old.orig: $!"; rename($new, $old) or die "can't rename $new to $old: $!";` [download] 2. Use the -i and -p switches to Perl. This also creates a temporary file but Perl takes care of the file manipulation. There are a few more like: "Updating the file in place without a temporary file" and "Updating a Random-Access File" but in your case (a small csv file) this seems like overkill to me. Bless the CookBook! BTW You have a lot of if/elsif statements. This can probably be improved upon. You might want to take a look at the Switch. Or use the ternary operator to improve the readability. A last remark: there are many modules around on CPAN to work with CSV data. This might be the simplest/most elegant solution of all:-) Hope this helps.	[reply] [d/l]
Re: Better Way to Search Large File. by jwkrahn (Abbot) on Aug 08, 2008 at 07:46 UTC
You could try it like this: #!/usr/bin/perl use warnings; use strict; my $employee_file = 'employee_list.csv'; open OUTFILE, '>', 'test.script' or die "Cannot open 'test.script' $!" +; my $employee_name; if ( @ARGV == 1 ) { $employee_name = $ARGV[ 0 ]; } else { print "\n\nPlease enter the employee's name: "; chomp( $employee_name = <STDIN> ); while ( 1 ) { print "\n\nYou entered $employee_name - is this correct? <y or + n>:"; my $ans = lc substr <STDIN>, 0, 1; if ( $ans eq 'n' ) { print "\n\nPlease enter the server name: "; chomp( $employee_name = <STDIN> ); next; } elsif ( $ans eq 'y' ) { last; } } } my @name_breakdown = split ' ', $employee_name; open my $EFH, '<', $employee_file or die "Cannot open '$employee_file' + $!"; while ( <$EFH> ) { my $field = ( split /:/ )[ 1 ]; push @exact_check, $field if length $field and /\Q$employee_name/ +; push @approx_check, $field if length $field and /\Q$name_breakdown +[1]/ and /\Q$name_breakdown[0]/; } close $EFH; print "EXACT MATCHES:\n", '=' x 50, "\n"; my $key = 'a'; my %lookup; for my $check ( @exact_check ) { $lookup{ $key } = $check; print $key++, ". $check\n"; } print "\n\nAPPROXIMATE MATCHES:\n", '=' x 50, "\n"; for my $check ( @approx_check ) { $lookup{ $key } = $check; print $key++, ". $check\n"; } print "Please select which is correct: "; my $ans = lc substr <STDIN>, 0, 1; if ( exists $lookup{ $ans } ) { my @breakdown = split /,/, $lookup{ $ans }; splice @breakdown, 1, 0, $ans eq 'a' ? 'TESTING' : 'THINGS'; print OUTFILE join ',', @breakdown; } __END__ [download]	[reply] [d/l]
Re: Better Way to Search Large File. by eosbuddy (Scribe) on Aug 08, 2008 at 05:58 UTC
I suppose this sounds crazy (might be retrograded suggestion even), how about reading in the whole file content as an array and using `grep` ... venerated ones here might be able to let you know the cons of such an action (personally I feel that breaking up the file into an appropriate database in mysql and querying that database will free you of a lot of work ... but I guess you know better :-) ).	[reply] [d/l]
Re: Better Way to Search Large File. by apl (Monsignor) on Aug 08, 2008 at 11:38 UTC
Off-topic... arrays, hashes and subroutines are Good Things when you see repeated code. An array that maps $exact_breakdown subscripts to $ans2 regex would reduce the size of your code greatly. A subroutine that returns $employee_name would make your code more readable, etc.	[reply]
Re: Better Way to Search Large File. by hangon (Deacon) on Aug 08, 2008 at 11:09 UTC
If your CSV file is sorted on the employee_name field, you could try a slightly modified binary search. Its a little tricky - seek to the middle of each search span, parse the second line from that point, and use tell to keep track of your position. I don't know if this would gain you anything unless the file was too large to load into memory.	[reply]