vbynagari has asked for the wisdom of the Perl Monks concerning the following question:

I need help with this.

I need to build a cross reference table from one file with say distributor ID and customer ID.

I have another file that I need to check whether the distributor ID and customer ID (from above cross reference table I built) combination already exists in this new file and create a new file only if the xref table I built doesn't have this record.

I used grep and since the files I am dealing with are huge, it is taking 5-6 hrs to complete the grep.

Is there a better way to do this? Any help is appreciated.

here is the sample code I have:

# Now read the values from XREF into an array open XREF, "<$xref" or warn $!; @xreflines=<XREF>; #remove the blank lines from the araary @xreflines = grep /\S/, @xreflines; print LOG "XREF lines are read into an array $now \n"; close XREF; #now read the good customer file created from VIPOUT and check against + the XREF file and see whether the distributor/customer combination a +lready exists. If not create a new file and this is the one that gets + loaded into BW open CUSTOMERFILE, "<$out_file" or warn $!; @lines=<CUSTOMERFILE>; close CUSTOMERFILE; open XFILE, ">>$goodfile" or warn $!; foreach $line (@lines) { ($DISTID1,$CUST1,$junk)=split('\;',$line); chomp $DISTID1; chomp $CUST1; print LOG "\$xrefvalue is $xrefvalue \n"; $x=grep /$xrefvalue/, @xreflines; if ( $x == 0 ) { print LOG "\$x value is $x \n"; print LOG "\$line is $line \n"; print XFILE "$line \n"; } } close XFILE;

Here is the same xref file: (combination of distributor ID & customer ID)

3036802849

3036802842

3036802854

3036802856

30368021983

3036802882

30368021703

3036802258

30368026951

30368022425

30368025243

and the data from the customer file:

3696693;5308;;BJS BREWHOUSE;2631 EDMONDSON RD;452091910;CINCINNATI;OH;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

1871781;01800;;BRADYS;25 UNION ST;045382116;BOOTHBAY HARBOR;ME;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

3172110;59475;;ARRIVEDERCI ITALIAN CUISINE;8900 E PINNACLE PEAK RD STE D1;852553647;SCOTTSDALE;AZ;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

3172110;26154;;KINGS MINI MART;4150 N 35TH AVE;850173858;PHOENIX;AZ;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

2534996;5830;;CAZADOREZ MEXICAN RESTAURANT;3900 N HARRISON ST;748041427;SHAWNEE;OK;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

2534996;3473;;ROUTE 66 NAMAN LIQUOR;4301 N SARA RD UNIT 106;730993223;YUKON;OK;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

1871316;8P670;;H L PENINSULA PEARL;1590 BAYSHORE HWY;940101601;BURLINGAME;CA;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

3138693;46526;;K & N MARKET;464 N BAILEY ST;480654710;ROMEO;MI;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

1870957;044816-SAME;;SHELL GAS;1752 WEBSTER AVE;104577341;BRONX;NY;;;;;US0109;;;;;;;;;;;110207;;;;;;;;;3;;;;;;;;;

first 2 fields are distributor ID and customer ID.

if the distributor ID and customer ID matches from xref file, then I need to ignore it else, I need to create a new file. The output is the same line from the customer file.

Replies are listed 'Best First'.
Re: best way to use grep
by Corion (Patriarch) on Dec 27, 2018 at 19:25 UTC

    Crossposted on Stackoverflow. It's polite to mention other places on the web where you asked the same question so that work doesn't get doubled.

    Some notes on your post:

    $x=grep /$xrefvalue/, @xreflines;

    Instead of searching through the @xreflines like that, use instead a hash as the datastructure:

    my %xreflines = map { $_ => 1 } grep /\S/, <XREF>; ... $x = $xreflines{ $xrefvalue };

    This removes the loop and changes it into a direct lookup whether the xref exists.

    Also, for reading a (comma, or semicolon-)delimited file, consider using Text::CSV_XS instead of splitting the data yourself.

Re: best way to use grep
by jwkrahn (Abbot) on Dec 27, 2018 at 21:39 UTC
    @xreflines=<XREF>; #remove the blank lines from the araary @xreflines = grep /\S/, @xreflines;

    You didn't chomp to remove newlines.

    $x=grep /$xrefvalue/, @xreflines;

    if @xreflines contains the values 30368021983, 30368021984 and 30368021985 and $xrefvalue contains the value 3036802198 then it will match all three, incorrectly.

    As Corion points out, it is better to use a hash:

    my %xreflines = map { /(\d+)/, 1 } <XREF>; unless ( exists $xreflines{ "$DISTID1$CUST1" } ) { print LOG "\$x value is $x \n"; print LOG "\$line is $line \n"; print XFILE "$line \n"; }
Re: best way to use grep
by Laurent_R (Canon) on Dec 27, 2018 at 22:11 UTC
    In terms of performance, the single most important change to be done is to use a hash enabling direct lookup instead of an array to store ther $xref data, as already suggested by Corion. Depending on the size of the $xref file, this change alone can make your program thousands times (or even much more) faster.

    A few other changes might improve your code. Insert these pragmas:

    use strict; use warnings;
    at the top of all your Perl programs, and declare your variables (with the my keyword).

    You don't need to store your customer file into an array (@lines) and then read the array, this leads you to process twice the same data, which is a waste of time. Do your checks when reading your customer file line by line;

    Use the three-argument syntax of open and use lexical file handles (see examples below).

    This could lead to something like this (untested):

    use strict; use warnings; # Read the values from XREF into a hash my $xref = "input_file.txt"; # put here real filename open my $XREF, "<", $xref or warn "Could not open $xref $!"; # A more + detailed message can sometimes be useful my %xreflines = map { chomp; $_ => 1 } grep /\S/, <$XREF>; # popu +lating the hash with the content of the xref file close $XREF; my $goodfile = ...; # insert the name +of the output file here open my $XFILE, ">>", $goodfile or warn "Could not open $goodfile $!"; my $outfile = ...; # insert the name +of the customer file here open my $CUSTOMERFILE, "<", $outfile or warn "Could not open $outfile +$!"; while (my $line = <$CUSTOMERFILE>) { my ($dist, $cust) = (split /;/, $line)[0,1]; # no need to chomp + the fields, there are other fields afterwards in the line my $key = "$dist$cust"; # Building the loo +kup key unless (exists $xreflines{$key}) { # hash lookup print $XFILE "$line"; # printing the lin +es whose ID is not found in the cross ref file } } close $XFILE; close $CUSTOMERFILE;
    Update: removed some extra quote marks left out from the OP code when changing the open syntax. Thanks to choroba for pointing out.
Re: best way to use grep
by AnomalousMonk (Archbishop) on Dec 28, 2018 at 00:58 UTC
    print LOG "\$xrefvalue is $xrefvalue \n";
    $x=grep /$xrefvalue/, @xreflines;

    I don't see where  $xrefvalue is being declared and initialized. I assume it's being built from  $DISTID1 and  $CUST1 in something like the way that Laurent_R shows here:
        my $xrefvalue = "$DISTID1$CUST1";
    If so, there's a potential problem because two records like
        3696693;5308;;BJS BREWHOUSE;2631 EDMONDSON RD;...
        369669;35308;;HORSESHOE ROAD INN;12 3RD ST;...
    will have the same  $xrefvalue value, "36966935308", unless there is some unstated rule that tells you this can never happen.

    Better, IMHO, to use a non-numeric separator to guarantee unambiguous cross-ref values:
        my $separator = ';';
        ...
        my $xrefvalue = "$DISTID1$separator$CUST1";
    A semicolon seems nice because the CSV field separator is already a semicolon.

    The advice to build a cross-ref lookup hash seems very, very good. I imagine the rest of the code might look a lot like the code in Laurent_R's post except the split statement could be
        my ($dist, $cust) = (split $separator, $line)[0,1];
    or
        my ($dist, $cust) = split $separator, $line, 3;
    You don't say how big your database is, but a hash could accommodate tens of millions of cross references in system memory for very fast lookup; much more than that and you're looking at a database. (Other approaches, like using Text::CSV_XS or a regex field extractor, or perhaps emulated multidimensional hash keys might be better (or sexier), but let's just take one step at a time!)


    Give a man a fish:  <%-{-{-{-<