chavanak has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I have two arrays that I have to compare. E.g.:
Array1 ATOM 2198 [b]SG CYS L 51[/b] 39.781 -12.827 5.691 1.00 26 +.67 ATOM 2199 N MET L 52 37.845 -15.766 5.722 1.00 33.08 ATOM 2200 CA MET L 52 38.312 -17.144 5.674 1.00 33.08 ATOM 2201 C MET L 52 37.329 -18.022 4.901 1.00 33.08
Array2 ATOM 2212 [b] CB MET L 52[/b] 17.332 94.112 87.029 1.00 0 +.00 ATOM 2213 CG MET L 52 18.017 94.866 88.170 1.00 0.00 ATOM 2214 SD MET L 52 18.711 96.457 87.699 1.00 0.00 ATOM 2215 CE MET L 52 17.198 97.429 87.820 1.00 0.00 ATOM 2216 N ARG L 53 19.331 91.671 87.132 1.00 0.00
I am supposed to remove elements from array2 that are not present in array1. But my problem is I have to judge the differences based only on the bold text above. i.e., the perl program should compare both the arrays and if the bold part is common in both files, then it should be removed from array2. I am not understanding how I can tell perl to look only for the bold text and ignore the remaining text and number. Can anyone help me? Cheers

Replies are listed 'Best First'.
Re: Accessing secondary elements in array
by johngg (Canon) on Nov 04, 2009 at 15:07 UTC

    If I've understood correctly, you could construct a hash keyed by the four columns of interest. If you split the line on whitespace and slice out the columns and join them again with some delimiter (I chose a colon) you construct the key. I have added a line to your "array 2" data with a common "bold part" so you can see that it gets removed.

    Note that I have used Data::Dumper so that you can see the lookup hash and resultant @array2. Here's the code.

    use strict; use warnings; use Data::Dumper; open my $array1FH, q{<}, \ <<'EOF1' or die qq{open: < HEREDOC 1: $!\n} +; ATOM 2198 SG CYS L 51 39.781 -12.827 5.691 1.00 26.67 ATOM 2199 N MET L 52 37.845 -15.766 5.722 1.00 33.08 ATOM 2200 CA MET L 52 38.312 -17.144 5.674 1.00 33.08 ATOM 2201 C MET L 52 37.329 -18.022 4.901 1.00 33.08 EOF1 my @array1 = <$array1FH>; close $array1FH or die qq{close: < HEREDOC 1: $!\n}; my %array1Lookup = map { join( q{:}, ( split )[ 2 .. 5 ] ), 1 } @array1; print Data::Dumper->Dumpxs( [ \ %array1Lookup ], [ qw{ *array1Lookup } + ] ); open my $array2FH, q{<}, \ <<'EOF2' or die qq{open: < HEREDOC 2: $!\n} +; ATOM 2212 CB MET L 52 17.332 94.112 87.029 1.00 0.00 ATOM 2213 CG MET L 52 18.017 94.866 88.170 1.00 0.00 ATOM 2214 SD MET L 52 18.711 96.457 87.699 1.00 0.00 ATOM 2215 CE MET L 52 17.198 97.429 87.820 1.00 0.00 ATOM 2216 N ARG L 53 19.331 91.671 87.132 1.00 0.00 ATOM 2217 CA MET L 52 19.331 91.671 87.132 1.00 0.00 EOF2 my @array2 = (); while ( <$array2FH> ) { chomp; my $lookupKey = join q{:}, ( split )[ 2 .. 5 ]; next if $array1Lookup{ $lookupKey }; push @array2, $_; } close $array2FH or die qq{close: < HEREDOC 2: $!\n}; print Data::Dumper->Dumpxs( [ \ @array2 ], [ qw{ *array2 } ] );

    The output.

    %array1Lookup = ( 'C:MET:L:52' => 1, 'N:MET:L:52' => 1, 'CA:MET:L:52' => 1, 'SG:CYS:L:51' => 1 ); @array2 = ( 'ATOM 2212 CB MET L 52 17.332 94.112 87.029 1 +.00 0.00', 'ATOM 2213 CG MET L 52 18.017 94.866 88.170 1 +.00 0.00', 'ATOM 2214 SD MET L 52 18.711 96.457 87.699 1 +.00 0.00', 'ATOM 2215 CE MET L 52 17.198 97.429 87.820 1 +.00 0.00', 'ATOM 2216 N ARG L 53 19.331 91.671 87.132 1 +.00 0.00' );

    I hope I have guessed correctly and this is of some help.

    Cheers,

    JohnGG

    Update: Added missing @ sigil to array2 in 2nd paragraph.

Re: Accessing secondary elements in array
by JavaFan (Canon) on Nov 04, 2009 at 12:03 UTC
    I would use a regexp to extract the "bold" part, and compare that. Which part do you have a problem with? The regexp? Comparing two strings? Intersecting the array (for that: see the perlfaq)?
      The problem for me is in regexp and intersection. To be very honest I have no idea how to use regexp for this particular task :( I am very new to perl so any guidance to material or example code will be really helpful
        A few assumptions, "bold text" is the part of the text that is surrounded by [b] and [/b]. "bold text" isn't nested inside "bold text", and there's at most one piece of "bold text" per string. Strings not containing any bold text is to be ignored.

        Then I would do something like (not tested):

        my %seen; m{\[b\](.*?)\[/b\]} and $seen{$1} = 1 for @array1; my @result = grep {m{\[b\](.*?)\[/b\]} && !$seen{$1}} @array2;