best way to use grep

vbynagari has asked for the wisdom of the Perl Monks concerning the following question:

I need help with this.

I need to build a cross reference table from one file with say distributor ID and customer ID.

I have another file that I need to check whether the distributor ID and customer ID (from above cross reference table I built) combination already exists in this new file and create a new file only if the xref table I built doesn't have this record.

I used grep and since the files I am dealing with are huge, it is taking 5-6 hrs to complete the grep.

Is there a better way to do this? Any help is appreciated.

here is the sample code I have:

# Now read the values from XREF into an array
open XREF, "<$xref" or warn $!;
@xreflines=<XREF>;
#remove the blank lines from the araary
@xreflines = grep /\S/, @xreflines;
print LOG "XREF lines are read into an array $now \n";
close XREF;


#now read the good customer file created from VIPOUT and check against
+ the XREF file and see whether the distributor/customer combination a
+lready exists. If not create a new file and this is the one that gets
+ loaded into BW
open CUSTOMERFILE, "<$out_file" or warn $!;
@lines=<CUSTOMERFILE>;
close CUSTOMERFILE;

open XFILE, ">>$goodfile" or warn $!;
foreach $line (@lines) {
    ($DISTID1,$CUST1,$junk)=split('\;',$line);
    chomp $DISTID1;
    chomp $CUST1;
    print LOG "\$xrefvalue is $xrefvalue \n";
    $x=grep /$xrefvalue/, @xreflines;
    if ( $x == 0 ) {
        print LOG "\$x value is $x \n";
        print LOG "\$line is $line \n";
        print XFILE "$line \n";
    }
}
close XFILE;
[download]

Here is the same xref file: (combination of distributor ID & customer ID)

3036802849

3036802842

3036802854

3036802856

30368021983

3036802882

30368021703

3036802258

30368026951

30368022425

30368025243

and the data from the customer file:

3696693;5308;;BJS BREWHOUSE;2631 EDMONDSON RD;452091910;CINCINNATI;OH;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

1871781;01800;;BRADYS;25 UNION ST;045382116;BOOTHBAY HARBOR;ME;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

3172110;59475;;ARRIVEDERCI ITALIAN CUISINE;8900 E PINNACLE PEAK RD STE D1;852553647;SCOTTSDALE;AZ;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

3172110;26154;;KINGS MINI MART;4150 N 35TH AVE;850173858;PHOENIX;AZ;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

2534996;5830;;CAZADOREZ MEXICAN RESTAURANT;3900 N HARRISON ST;748041427;SHAWNEE;OK;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

2534996;3473;;ROUTE 66 NAMAN LIQUOR;4301 N SARA RD UNIT 106;730993223;YUKON;OK;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

1871316;8P670;;H L PENINSULA PEARL;1590 BAYSHORE HWY;940101601;BURLINGAME;CA;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

3138693;46526;;K & N MARKET;464 N BAILEY ST;480654710;ROMEO;MI;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

1870957;044816-SAME;;SHELL GAS;1752 WEBSTER AVE;104577341;BRONX;NY;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

first 2 fields are distributor ID and customer ID.

if the distributor ID and customer ID matches from xref file, then I need to ignore it else, I need to create a new file. The output is the same line from the customer file.

Comment on best way to use grep Download Code

Replies are listed 'Best First'.
Re: best way to use grep by Corion (Patriarch) on Dec 27, 2018 at 19:25 UTC
Crossposted on Stackoverflow. It's polite to mention other places on the web where you asked the same question so that work doesn't get doubled. Some notes on your post: `$x=grep /$xrefvalue/, @xreflines;` [download] Instead of searching through the `@xreflines` like that, use instead a hash as the datastructure: `my %xreflines = map { $_ => 1 } grep /\S/, <XREF>; ... $x = $xreflines{ $xrefvalue };` [download] This removes the loop and changes it into a direct lookup whether the xref exists. Also, for reading a (comma, or semicolon-)delimited file, consider using Text::CSV_XS instead of splitting the data yourself.	[reply] [d/l] [select]
Re: best way to use grep by jwkrahn (Abbot) on Dec 27, 2018 at 21:39 UTC
`@xreflines=<XREF>; #remove the blank lines from the araary @xreflines = grep /\S/, @xreflines;` [download] You didn't chomp to remove newlines. `$x=grep /$xrefvalue/, @xreflines;` [download] if `@xreflines` contains the values 30368021983, 30368021984 and 30368021985 and `$xrefvalue` contains the value 3036802198 then it will match all three, incorrectly. As Corion points out, it is better to use a hash: `my %xreflines = map { /(\d+)/, 1 } <XREF>; unless ( exists $xreflines{ "$DISTID1$CUST1" } ) { print LOG "\$x value is $x \n"; print LOG "\$line is $line \n"; print XFILE "$line \n"; }` [download]	[reply] [d/l] [select]
Re: best way to use grep by Laurent_R (Canon) on Dec 27, 2018 at 22:11 UTC
In terms of performance, the single most important change to be done is to use a hash enabling direct lookup instead of an array to store ther `$xref` data, as already suggested by Corion. Depending on the size of the `$xref` file, this change alone can make your program thousands times (or even much more) faster. A few other changes might improve your code. Insert these pragmas: `use strict; use warnings;` [download] at the top of all your Perl programs, and declare your variables (with the `my` keyword). You don't need to store your customer file into an array (`@lines`) and then read the array, this leads you to process twice the same data, which is a waste of time. Do your checks when reading your customer file line by line; Use the three-argument syntax of `open` and use lexical file handles (see examples below). This could lead to something like this (untested): use strict; use warnings; # Read the values from XREF into a hash my $xref = "input_file.txt"; # put here real filename open my $XREF, "<", $xref or warn "Could not open $xref $!"; # A more + detailed message can sometimes be useful my %xreflines = map { chomp; $_ => 1 } grep /\S/, <$XREF>; # popu +lating the hash with the content of the xref file close $XREF; my $goodfile = ...; # insert the name +of the output file here open my $XFILE, ">>", $goodfile or warn "Could not open $goodfile $!"; my $outfile = ...; # insert the name +of the customer file here open my $CUSTOMERFILE, "<", $outfile or warn "Could not open $outfile +$!"; while (my $line = <$CUSTOMERFILE>) { my ($dist, $cust) = (split /;/, $line)[0,1]; # no need to chomp + the fields, there are other fields afterwards in the line my $key = "$dist$cust"; # Building the loo +kup key unless (exists $xreflines{$key}) { # hash lookup print $XFILE "$line"; # printing the lin +es whose ID is not found in the cross ref file } } close $XFILE; close $CUSTOMERFILE; [download] Update: removed some extra quote marks left out from the OP code when changing the `open` syntax. Thanks to choroba for pointing out.	[reply] [d/l] [select]
Re: best way to use grep by AnomalousMonk (Archbishop) on Dec 28, 2018 at 00:58 UTC
`print LOG "\$xrefvalue is $xrefvalue \n";` `$x=grep /$xrefvalue/, @xreflines;` I don't see where `$xrefvalue` is being declared and initialized. I assume it's being built from `$DISTID1` and `$CUST1` in something like the way that Laurent_R shows here: `my $xrefvalue = "$DISTID1$CUST1";` If so, there's a potential problem because two records like `3696693;5308;;BJS BREWHOUSE;2631 EDMONDSON RD;...` `369669;35308;;HORSESHOE ROAD INN;12 3RD ST;...` will have the same `$xrefvalue` value, "36966935308", unless there is some unstated rule that tells you this can never happen. Better, IMHO, to use a non-numeric separator to guarantee unambiguous cross-ref values: `my $separator = ';';` `...` `my $xrefvalue = "$DISTID1$separator$CUST1";` A semicolon seems nice because the CSV field separator is already a semicolon. The advice to build a cross-ref lookup hash seems very, very good. I imagine the rest of the code might look a lot like the code in Laurent_R's post except the split statement could be `my ($dist, $cust) = (split $separator, $line)[0,1];` or `my ($dist, $cust) = split $separator, $line, 3;` You don't say how big your database is, but a hash could accommodate tens of millions of cross references in system memory for very fast lookup; much more than that and you're looking at a database. (Other approaches, like using Text::CSV_XS or a regex field extractor, or perhaps emulated multidimensional hash keys might be better (or sexier), but let's just take one step at a time!) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]