Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

oh wise monks! may a feeble noob ask for wisdom!
I have some very inefficent code, it takes way to long to run and i just know there is a better way to do it! but my skills are lacking! I have 2 files that share a string. I want to match a string from file A in File B and add a new string from File A to end of the line in File B where the common string was matched. I know this is bad but thats why asked for help!
#!/usr/bin/perl -w chomp($phi = `ls slk`); chomp($phi2 = `ls *.csv`); open(OUT,"> /home/Aug18.csv") or die "Cannot open output file; open(READDATA, "< /home/$phi") or die "Cannot Open file; my $stuff; my %h = (); while ($stuff = <READDATA>) { if ($stuff =~ /(\d\d\d\d\w\d?)\s+(\w*)/){ $port = $1; $lsn = $2; $h{$port} = $lsn; } } close READDATA; open(READDATA2, "< /home/$phi2") or die "Cannot Open file.; my $stuff2; while ($stuff2 = <READDATA2>){ while (($port2, $lsn2) = each %h){ if ($stuff2 =~/$port2/){ chomp($stuff2); print OUT "$stuff2,$lsn2\n"} } }

It works for now but my god its slow! Thanks for any advice!

Replies are listed 'Best First'.
Re: 2 files 1 output
by tcf22 (Priest) on Aug 21, 2003 at 14:07 UTC
    You don't need to loop through the hash for every line of file 2. Just check if the hash key exists. Try something like this:
    #!/usr/bin/perl -w use strict; my $phi = `ls slk`; my $phi2 = `ls *.csv`; chomp($phi); chomp($phi2); open(OUT,"> /home/Aug18.csv") or die "Cannot open output file; open(READDATA, "< /home/$phi") or die "Cannot Open file; my %h; while (my $stuff = <READDATA>){ if ($stuff =~ /(\d\d\d\d\w\d?)\s+(\w*)/){ $h{$1} = $2; } } close READDATA; open(READDATA2, "< /home/$phi2") or die "Cannot Open file.; while (my $stuff2 = <READDATA2>){ chomp($stuff2); if (exists $h{$stuff2}){ print OUT "$stuff2,$h{$stuff2}\n"; } } close READDATA2;
Re: 2 files 1 output
by bfish (Novice) on Aug 21, 2003 at 13:37 UTC
    How big are these files and how much memory do you have on your system? If the first file is sufficiently small then you can read it into a hash based on the matching string and then just check the hash for each entry in the second file. If this looks practical, I can help you with the code...
Re: 2 files 1 output
by graff (Chancellor) on Aug 22, 2003 at 02:27 UTC
    Just a few "peripheral" comments about your code:
    chomp($phi = `ls slk`); chomp($phi2 = `ls *.csv`); open(OUT,"> /home/Aug18.csv") or die "Cannot open output file; open(READDATA, "< /home/$phi") or die "Cannot Open file;
    Since you know you are going to open a file named "slk", why not just open( READDATA, "slk" ) or die $!; -- no need to run a "ls" in a sub-shell to get the file name!

    And the second "chomp" line doesn't do what you think (neither would the first one, if "slk" happened to be a directory); if the "ls *.csv" is likely to return more than one file, then you want to assign its output to an array:

    @phi2 = `ls *.csv`; chomp @phi2; # chomp will apply to every element of the array
    (Of course, if there's only one *.csv file, then your usage as shown in the OP would work, just by coincidence.)

    In general, if you have just two files that the script is supposed to deal with, it makes more sense to have the two file names provided as command-line arguments (available to the script as @ARGV). The script can print to STDOUT, and you can redirect that on the command line to some other file. So the command line would look like this:

    your_script.pl slk someinput.csv > /home/Aug18.csv
    And the script would look like this (I haven't tested it, since I don't have appropriate data, but it does compile):
    #!/usr/bin/perl -w use strict; die "Usage: $0 infile1 infile2\n" unless (@ARGV==2 and -f $ARGV[0]); my ($infile1, $infile2) = @ARGV; open( READDATA, $infile1 ) or die "Cannot open $infile1: $!\n"; my %h = (); while (<READDATA>) { if ( /(\d{4}\w\d?)\s+(\w*)/ ){ my ( $port, $lsn ) = ( $1, $2 ); $h{$port} = $lsn; } } close READDATA; open( READDATA2, $infile2 ) or die "Cannot open $infile2: $!\n"; while (<READDATA2>) { chomp; if ( /(\d{4}\w\d?)/ and exists( $h{$1} )) { print "$_,$h{$1}\n"; } }
    Note that while tcf22's reply got to the core issue (just look for the existence of a matching hash key -- don't loop over the entire hash -- when reading each line of the second file), tcf22's suggested code might not do the same thing as your original code. (And tcf22's suggestion kept your strange handling of file names, which makes me wonder...)

    You didn't say exactly what was in the second file that you are reading, and tcf22 assumed that each line of that file contained just a string that would match a "port" string. But according to you original code, if the first file had a port string like "1234x", then you should get a match when the second file has a line like "foo bar 1234x baz".

    For that matter, if file1 mentions two ports, "1234x" and "1234x5", and file2 contains a line with just "1234x", then your original code (looping over all the hash keys) would print two matches -- so in that sense, the code I'm suggesting above doesn't really match your original script's behavior either. (But I think mine works the way you intended -- using the same regex when reading both files, and then using $1 as the hash key, will always yield just the single exact match. (Now, you just need to worry about whether one or the other file happens to contain multiple lines with the same port pattern.)

    Finally, please do note how my example differs from yours in terms of indentation -- this is important for legibility. And using $_ is not a bad thing, it is a Good Thing. (I don't know why we've been seeing such a slew of recent SoPW posts that seem unwilling to use $_.)