Seeking guidance on how to approach a task

mark_nelson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Seeking guidance on how to approach a task by SamCG (Hermit) on Dec 16, 2005 at 22:08 UTC
I use this regularly to compare columns in tab-delimited files. It could be quite easily modified to suit your purposes. I wouldn't normally just hand you code, but I had this already. If you want the results to go to a file instead of a screen, all you need to do it compcolumns.pl file1 column1 file2 column2 > outputfile.txt: #! perl -w use strict; my ($file, $col, $file2, $col2) = @ARGV; my (%unique, %unique2); my @column; my ($x, $header); my @fields; # I put something like this in all my scripts so I don't need to remem +ber argument order if ($file eq "?"){ die "usage: compcolumns.pl file1 column1 file2 column2\n"; } ## whenever you open a file, do yourself a favor and tell yourself if +it doesn't open open FIL, $file or die "Could not open $file: $!\n"; $x = 0; %unique=(); while (<FIL>){ chomp; ##The split here, a regular expression, is what you would chan +ge (at least if your data is based on space delimiting instead of tab +-delimiting). @fields = split /\t/; if (defined ($fields[$col]) ){ $header = $fields[$col] if $x == 0; $unique{$fields[$col]}++; } else{ $unique{"BLANK"}++; $header = "BLANK"; } $x++; } open FIL, $file2 or die "Could not open $file2: $!\n"; $x=0; while (<FIL>){ chomp; @fields = split /\t/; if (defined ($fields[$col2]) ){ $header = $fields[$col2] if $x == 0; $unique2{$fields[$col2]}++; } else{ $unique2{"BLANK"}++; $header = "BLANK"; } $x++; } ## The output routine. Note that if something's in both files, it wil +l be printed twice. foreach (sort keys %unique) { if (!exists $unique2{$_}){ print "$_ in $file but not $file2\n"; } else { print "$_ in both files\n"; } } foreach (sort keys %unique2) { if (!exists $unique{$_}){ print "$_ in $file2 but not $file\n"; } else { print "$_ in both files\n"; } } [download] Since I usually get passed excel files, I haven't had the occassion to work with 190,000 line files. I haven't had any slowness issues, though.	[reply] [d/l]
Re: Seeking guidance on how to approach a task by injunjoel (Priest) on Dec 16, 2005 at 21:56 UTC
Greetings, Untested Suggestion #!/usr/bin/perl -w use strict; my %unique_values; #open file two first and read in the values of interest unless(open(F2, "file2name.txt")){ die "a horrible death!"; }else{ #line by line split off the first word i.e. user01, user02, etc %unique_values = map{ (split / /,$_)[0], undef }<F2>; close F2; unless(open(F1, "file1name.txt")){ die "a horrible death!"; }else{ while(<F1>){ chomp; if(exists($unique_values{$_}){ delete $unique_values{$_}); } } close F1; } } #open a file for output unless(open(OUT, ">outputfile.txt")){ die "again we fail!"; }else{ print $_."\n" for(sort keys %unique_values); close OUT; } [download] Basically open the second file first... since its got the unique values of interest. Load in the values into a hash, for easy lookup, and open the first file. if the value is already in the hash delete it! -InjunJoel "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo	[reply] [d/l]
Re^2: Seeking guidance on how to approach a task by mark_nelson (Initiate) on Dec 19, 2005 at 16:27 UTC
First, thank you to everyone for the replies. I apologize for the tardiness of my respnose back but I've been through airport hell this weekend and I'm just getting back to the task at hand. InjunJoel and SamCG: Thank you for posting the code and I'll read through it and definately learn from it! davies: A database is something I hadn't thought of. I might be able to get the change control to let me install mysql on the box altho I have minimal exposure to sql. TedPride: This process will only have to be run once a night so that helps from what I'm seeing in your post. graff: That makes sense too! Thank you for the code example! CountZero: Cpan! got it! I was reading some of that but there is alot to wade through! ambrus: I do have cygwin installed on the box so Join is an option. To all: Thank you for being so informative with your replies. I have multiple paths to look at and learn from. I am reading man pages and the camel book but its taking some time. mn	[reply]
Re: Seeking guidance on how to approach a task by TedPride (Priest) on Dec 16, 2005 at 22:25 UTC
Untested, but it should work: `use strict; use warnings; my ($in, $out, %f1); open($in, 'in1.dat'); while (<$in>) { chomp; $f1{$_} = (); } close($in); open($out, '>out.dat'); open($in, 'in2.dat'); while (<$in>) { print $out if !exists $f1{(split /\t/)[0]}; } close($in); close($out);` [download] I assumed your delimiter was \t. Go ahead and change that if I'm wrong. I'm also assuming you don't have to run this a bunch of times simultaneously. If memory usage is a problem, you can just step through the files one line at a time and compare (since the files appear to be sorted already). But this should be easily doable with a hash, even in Perl. EDIT: By "this" I meant the task at hand, not walking through the files one line at a time and comparing. Just thought I'd clarify my bad writing :)	[reply] [d/l]
Re: Seeking guidance on how to approach a task by CountZero (Bishop) on Dec 17, 2005 at 10:34 UTC
CPAN is your friend! Using the List::Compare module: `use strict; use List::Compare; my @list_one = qw/user01 user02 user03/; my %hash_two = qw/user01 user01@emailaddress.com user03 user03@emailad +dress.com user04 user04@emailaddress.com/; my @list_two = keys %hash_two; my $lc = List::Compare->new(\@list_one, \@list_two); my @results = map {$_ . ' '. $hash_two{$_}} $lc->get_Ronly; print join "\n", @results;` [download] Result: `user04 user04@emailaddress.com` Update: I quickly made two files of about 190,000 records and used those to see how long it took. Not including the reading in of the files, the script zipped through all 190,000 records in 14 seconds (on a Windows XP, Athlon XP 1600+, Activestate Perl 5.8.7, running within the Komodo-editor). CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re: Seeking guidance on how to approach a task by davies (Monsignor) on Dec 16, 2005 at 22:19 UTC
At the risk of stating the bleeping obvious, I would use a database for this. It's exactly the sort of task that most database engines are optimised for. While you might not want to store the data from run to run, even in this situation I would create a temporary database, populate it with the two files, and then use its query language to extract the data. Regards, John Davies	[reply]
Re^2: Seeking guidance on how to approach a task by duff (Parson) on Dec 17, 2005 at 03:27 UTC
While a worthy reply, it ignores some real issues. Primarily, does the OP have easy access to a database? If so, does the OP understand how they work and does he know how to structure the query? If not, does the OP really want to go through the trouble of learning how they work to solve this one problem? etc. duff	[reply]
Re: Seeking guidance on how to approach a task by graff (Chancellor) on Dec 17, 2005 at 08:22 UTC
Contrary to an earlier suggestion, I'd read "file 1" first, and store each "user*" string as a hash key -- there will be less data overall to keep in memory. Then, reading "file 2", only print a line if its first token does not exist as a hash key. These are the records in file 2 that do not exist in file 1. Put that way, the script can be very short, especially if you just give the file names as command line args: `use strict; die "Usage: $0 file.1 file.2\n" unless ( @ARGV == 2 ); # test for number of args my %known; open( F, "<", $ARGV[0] ) or die "$ARGV[0]: $!"; while (<F>) { chomp; $known{$_} = undef; # don't need a value, just the key } close F; open( F, "<", $ARGV[1] ) or die "$ARGV[1]: $!"; while (<F>) { my $key = ( split ' ', $_ )[0]; # get first "word" print unless exists( $known{$key} ); } close F;` [download] That will print the targeted records to STDOUT, which you can redirect to a file via the command line: `little_perl_script.pl file.1 file.2 > target.set` [download] (update: just noticed that this is identical to TedPride's suggestion, ignoring a few trivial, irrelevant differences. sorry about the redundancy)	[reply] [d/l] [select]
Re: Seeking guidance on how to approach a task by ambrus (Abbot) on Dec 17, 2005 at 17:04 UTC
There's an alternative to do this kind of thing with the standard unix text utilities instead of perl. Of course, you might not want to do that if you want to learn perl. So here's the method. `[am]king ~/a/tm$ cat emails user01 user01@emailaddress.com user03 user03@emailaddress.com user04 user04@emailaddress.com [am]king ~/a/tm$ cat names user01 user02 user03` [download] First, you sort the two files on their first field. They appear sorted in this example, but it's always best to go for sure. `[am]king ~/a/tm$ sort -k1 emails > emails.sort [am]king ~/a/tm$ sort -k1 names > names.sort` [download] Then, you use join to find the unpairable names in emails.sort. `[am]king ~/a/tm$ join -v1 -o0 emails.sort names.sort user04` [download] You might also be able to use the comm utility instead of join. Update 2009 sep 2. See Re^2: Joining two files on common field for a list of other nodes where unix textutils is suggested to merge files.	[reply] [d/l] [select]
Re: Seeking guidance on how to approach a task by mark_nelson (Initiate) on Dec 19, 2005 at 21:30 UTC
I've been trying to digest the information above and I've been playing around with the suggestions with success. In my original post, I did leave out some information mainly because I didn't want the full "answer" to be given as I have to put effort into this as well. In trying to apply the information above as well as trying to do my own reading, I'm trying to understand array's and how it will and can apply to the full need. The one function that I did leave out was that I will be neededing to compare email addresses from the current list of users against a nightly generated file of the same formate to make sure that the email adddresses haven't changed. example format of both files is: userid emailaddress@here.com I realize that this will be seperate coding, but I was thinking I could use the same hashes, just add the email addresses to them. am I thinking along the correct lines? mn	[reply]