hash jerkin

rerunn has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a problem which I have been struggling with and have basically been running in circles trying to solve it.

I have 2 files, (file_A and file_B) consisting many lines of records. Each record is in this format:

field_1:field_2:field_3:field_4:field_5

What i need to do is take file_A and compare field_1 in each line with field_1 in file_B. If an occurence of field_1 does not exist in file_B, I must remove the line from file_A.

This first part was simple enough -- I made a hash of arrays. field_1 was the key and the whole line was the array/value. I then compared the keys in each array to yank out the lines. So this leaves me %file_A with the correct lines taken out. What I need to do next is take this %file_A and then do the same thing with field_4. I attempted loop through %file_A comparing field_4 in the array with its counterpart in %file_B. It worked but was painfully slow. There must be a better way. There is also a file_C involved and this involves only keeping the lines in file_A that dont have the corresponding field_3 value -- I can wait till I figure out the first one.

Hopefully someone didnt get painfully confused enough to answer my question :)

Thanks and warm regards,
rerunn.

Comment on hash jerkin

Replies are listed 'Best First'.
Re: hash jerkin by BrowserUk (Patriarch) on Feb 12, 2003 at 03:59 UTC
You should be able to do the whole thing including file_C in one pass of file_A. #!perl -slw use strict; use vars qw[$NEW_A]; use Data::Dumper; use Inline::Files '-backup'; local $,=', '; my %b = map{ ( '1'.$_->[0] => 1, '4'.$_->[1] => 1, ) } map{ [ (split':')[0,3] ] } <FILE_B>; my %c = map{ $_ => 1 } map{ (split':')[2] } <FILE_C>; #print Dumper \%b, \%c; open NEW_A, "> $NEW_A" or die $!; while(<FILE_A>) { chomp; my @fields = split':'; print( "Removing '$_' from FILE_A because field 1:'$fields[0]' not + in FILE_B"), next if !exists $b{ '1'.$fields[0] }; print( "Removing '$_' from FILE_A because field 4:'$fields[3]' not + in FILE_B"), next if !exists $b{ '4'.$fields[3] }; print( "Removing '$_' from FILE_A because field 3:'$fields[2]' not + in FILE_C"), next if !exists $c{ $fields[2] }; print NEW_A $_, $/; } __FILE_A__ 1::Z:A: 2::Y:B: 3::X:C: 4::Z:D: 5::Y:E: 6::X:F: 7::T:G: __FILE_B__ 1:::A:* 2:::B:* 3:::C:* 4:::D:* 5:::E:* 6:::F:* __FILE_C__ ::Z:: ::Y:: ::T:: __NEW_A__ 1::Z:A: 2::Y:B: 4::Z:D: 5::Y:E: [download] Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply] [d/l]
Re: hash jerkin by Cabrion (Friar) on Feb 11, 2003 at 23:49 UTC
`open A, "<fileA" \|\| die $!; open B, "<fileA" \|\| die $!; open OUT, ">fileAtmp" \|\| die $!; my ($bufA, $bufB); while ($bufA = <A>) { $bufB = <B> \|\| last; #jump out if file B is shorter than A my @A = split ':', $bufA; my @B = split ':', $bufB; print OUT $bufA if $A[0] eq $B[0]; } close A; close B; close OUT;` [download] The rest is left as an exercise for the reader. You should get the gist.	[reply] [d/l]
Re: Re: hash jerkin by Fletch (Bishop) on Feb 12, 2003 at 04:36 UTC
`open A, "<fileA" \|\| die $!;` Aherm, ITYM `or` not `\|\|`. Precedence matters.	[reply] [d/l] [select]
Re: hash jerkin by mikezone (Novice) on Feb 11, 2003 at 23:49 UTC
I'm kinda confused as to what you did, so I'm going to try to write a translation of what you say in Perl: `my $file_A = read_file( "file_A" ); my $file_B = read_file( "file_B" ); foreach (keys %$file_A) { delete $file_A->{ $_ } unless exists( $file_B->{ $_ } ); } sub read_file { my $file = shift; open FILE, $file or die "Can't open '$file' for reading $!"; my @file = <FILE>; close FILE; my $hoa = {}; foreach ( @file ) { my (@fields) = split /:/; my $key = $fields[0]; $hoa->{ $key } = [ @fields ]; } return $hoa; }` [download] If this is how you parsed the files, then it seems to me you could easily do the following: `sub change_field_to_check { my $hoa = shift; my $field = shift; $field -= 1; my $new_arrangement; foreach (keys %$hoa) { $new_arrangement->{ $hoa->{ $_ }[ $field ] } = $hoa->{ $_ }; } return $new_arrangement; } $file_A = change_field_to_check( $file_A, 4 ); # Field 4 $file_B = change_field_to_check( $file_B, 4 ); # Field 4 foreach (keys %$file_A) { delete $file_A->{ $_ } unless exists( $file_B->{ $_ } ); }` [download] This ought to extrapolate to other files and other fields. If your fields are named something else, having a hash that translates the field name to their order will give you the right key. Hope this helps. - m.	[reply] [d/l] [select]
Re: hash jerkin by steves (Curate) on Feb 12, 2003 at 02:36 UTC
If order of lines in file_B is not important, and if there's no one-to-one line relationship order-wise between file_A and file_B, I'd change your algorithm to this: Read file_B and hash each line twice (two hashes): a hash for field_1 and a hash for field_4. Hash values are just 1 (true) for "found it in file_B". Read file_C and hash its field_3 the same way. Now open file_A and an output file of the desired lines. For each line in file_A, split the line. Now you just compare the split values to each hash and decide whether to write it to the temporary file or not. Write the lines you want to keep to the temporary file.	[reply]
Re: hash jerkin by JamesNC (Chaplain) on Feb 12, 2003 at 05:11 UTC
Here is another way using Tie::File `use Tie::File; use strict; my (@A, @B, @C); my ($field1, $A, $B); tie @B, 'Tie::File', 'txtA.txt' or die "couldn't TiE $!\n"; tie @A, 'Tie::File', 'txtB.txt' or die "couldn't TiE $!\n"; foreach $A (@A) { my ($field1) = split/:/, $A; print $field1; foreach $B (@B){ if($B=~/^$field1/){ #found match push it onto C push @C, $A;} } } ## We remove those that didn't match by replacing @A with @C :) @A = @C; print @A;` [download] Here are the test files txtA.txt `filea:file2:file1 file2:something:old` [download] txtB.txt `fileB:file2:file3 filea:file2:file1 file2:filex:file23 file34:file2:file3 filea:file2:file1 file2:filex:file23` [download] Cheers, James Edit:removed a label	[reply] [d/l] [select]