rerunn has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a problem which I have been struggling with and have basically been running in circles trying to solve it.

I have 2 files, (file_A and file_B) consisting many lines of records. Each record is in this format:

field_1:field_2:field_3:field_4:field_5

What i need to do is take file_A and compare field_1 in each line with field_1 in file_B. If an occurence of field_1 does not exist in file_B, I must remove the line from file_A.

This first part was simple enough -- I made a hash of arrays. field_1 was the key and the whole line was the array/value. I then compared the keys in each array to yank out the lines. So this leaves me %file_A with the correct lines taken out. What I need to do next is take this %file_A and then do the same thing with field_4. I attempted loop through %file_A comparing field_4 in the array with its counterpart in %file_B. It worked but was painfully slow. There must be a better way. There is also a file_C involved and this involves only keeping the lines in file_A that dont have the corresponding field_3 value -- I can wait till I figure out the first one.

Hopefully someone didnt get painfully confused enough to answer my question :)

Thanks and warm regards,
rerunn.

Replies are listed 'Best First'.
Re: hash jerkin
by BrowserUk (Patriarch) on Feb 12, 2003 at 03:59 UTC

    You should be able to do the whole thing including file_C in one pass of file_A.

    #!perl -slw use strict; use vars qw[$NEW_A]; use Data::Dumper; use Inline::Files '-backup'; local $,=', '; my %b = map{ ( '1'.$_->[0] => 1, '4'.$_->[1] => 1, ) } map{ [ (split':')[0,3] ] } <FILE_B>; my %c = map{ $_ => 1 } map{ (split':')[2] } <FILE_C>; #print Dumper \%b, \%c; open NEW_A, "> $NEW_A" or die $!; while(<FILE_A>) { chomp; my @fields = split':'; print( "Removing '$_' from FILE_A because field 1:'$fields[0]' not + in FILE_B"), next if !exists $b{ '1'.$fields[0] }; print( "Removing '$_' from FILE_A because field 4:'$fields[3]' not + in FILE_B"), next if !exists $b{ '4'.$fields[3] }; print( "Removing '$_' from FILE_A because field 3:'$fields[2]' not + in FILE_C"), next if !exists $c{ $fields[2] }; print NEW_A $_, $/; } __FILE_A__ 1:*:Z:A:* 2:*:Y:B:* 3:*:X:C:* 4:*:Z:D:* 5:*:Y:E:* 6:*:X:F:* 7:*:T:G:* __FILE_B__ 1:*:*:A:* 2:*:*:B:* 3:*:*:C:* 4:*:*:D:* 5:*:*:E:* 6:*:*:F:* __FILE_C__ *:*:Z:*:* *:*:Y:*:* *:*:T:*:* __NEW_A__ 1:*:Z:A:* 2:*:Y:B:* 4:*:Z:D:* 5:*:Y:E:*

    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: hash jerkin
by Cabrion (Friar) on Feb 11, 2003 at 23:49 UTC
    open A, "<fileA" || die $!; open B, "<fileA" || die $!; open OUT, ">fileAtmp" || die $!; my ($bufA, $bufB); while ($bufA = <A>) { $bufB = <B> || last; #jump out if file B is shorter than A my @A = split ':', $bufA; my @B = split ':', $bufB; print OUT $bufA if $A[0] eq $B[0]; } close A; close B; close OUT;
    The rest is left as an exercise for the reader. You should get the gist.
      open A, "<fileA" || die $!;

      Aherm, ITYM or not ||. Precedence matters.

Re: hash jerkin
by mikezone (Novice) on Feb 11, 2003 at 23:49 UTC

    I'm kinda confused as to what you did, so I'm going to try to write a translation of what you say in Perl:

    my $file_A = read_file( "file_A" ); my $file_B = read_file( "file_B" ); foreach (keys %$file_A) { delete $file_A->{ $_ } unless exists( $file_B->{ $_ } ); } sub read_file { my $file = shift; open FILE, $file or die "Can't open '$file' for reading $!"; my @file = <FILE>; close FILE; my $hoa = {}; foreach ( @file ) { my (@fields) = split /:/; my $key = $fields[0]; $hoa->{ $key } = [ @fields ]; } return $hoa; }

    If this is how you parsed the files, then it seems to me you could easily do the following:

    sub change_field_to_check { my $hoa = shift; my $field = shift; $field -= 1; my $new_arrangement; foreach (keys %$hoa) { $new_arrangement->{ $hoa->{ $_ }[ $field ] } = $hoa->{ $_ }; } return $new_arrangement; } $file_A = change_field_to_check( $file_A, 4 ); # Field 4 $file_B = change_field_to_check( $file_B, 4 ); # Field 4 foreach (keys %$file_A) { delete $file_A->{ $_ } unless exists( $file_B->{ $_ } ); }

    This ought to extrapolate to other files and other fields. If your fields are named something else, having a hash that translates the field name to their order will give you the right key.

    Hope this helps.

    - m.

Re: hash jerkin
by steves (Curate) on Feb 12, 2003 at 02:36 UTC

    If order of lines in file_B is not important, and if there's no one-to-one line relationship order-wise between file_A and file_B, I'd change your algorithm to this:

    • Read file_B and hash each line twice (two hashes): a hash for field_1 and a hash for field_4. Hash values are just 1 (true) for "found it in file_B".
    • Read file_C and hash its field_3 the same way.
    • Now open file_A and an output file of the desired lines. For each line in file_A, split the line. Now you just compare the split values to each hash and decide whether to write it to the temporary file or not. Write the lines you want to keep to the temporary file.
Re: hash jerkin
by JamesNC (Chaplain) on Feb 12, 2003 at 05:11 UTC
    Here is another way using Tie::File
    use Tie::File; use strict; my (@A, @B, @C); my ($field1, $A, $B); tie @B, 'Tie::File', 'txtA.txt' or die "couldn't TiE $!\n"; tie @A, 'Tie::File', 'txtB.txt' or die "couldn't TiE $!\n"; foreach $A (@A) { my ($field1) = split/:/, $A; print $field1; foreach $B (@B){ if($B=~/^$field1/){ #found match push it onto C push @C, $A;} } } ## We remove those that didn't match by replacing @A with @C :) @A = @C; print @A;
    Here are the test files
    txtA.txt
    filea:file2:file1 file2:something:old
    txtB.txt
    fileB:file2:file3 filea:file2:file1 file2:filex:file23 file34:file2:file3 filea:file2:file1 file2:filex:file23
    Cheers, James Edit:removed a label