Column Comparison of a File in Perl

snape has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a file which has a format like :

A class 1 1 1 1 1 1
A id 12 12 15 15 16 16 
B class 0 0 0 0 0 0 
B id 0 0 0 0 0 0 0 
C id 0 0 0 0 0 0 0 
M X1 1 2 2 2 2 2 
M X2 2 1 1 1 1 1 
M X3 1 2 2 2 2 2 
M X4 2 2 2 1 1 2 
M X5 1 1 1 1 1 1 
M X6 1 1 1 2 2 1 
M X7 1 1 1 1 1 1 
M X8 1 1 1 1 1 1 
M X9 1 1 1 1 1 1 
M X10 1 1 1 1 1 1 
M X11 1 2 1 1 1 1 
M X12 2 2 2 2 2 2 
M X13 2 1 2 2 2 2 
M X14 2 1 2 2 2 2 
M X15 1 2 1 1 1 1 
M X16 1 1 2 2 2 2 
M X17 1 2 2 2 2 2
[download]

This is a subset of the big file. I need some help in writing the perl program for doing the file manipulation. 1. I would like to delete all the rows EXCEPT the rows A id followed by numbers and M X<number> followed by numbers. The problem is how should I read the file considering that its a big file so, I can't read the entire file in the array and then use for each statement. How to do if I am using the while and reading it line by line ? The output of the file should be as follows:

A id 12 12 15 15 16 16 
M X1 1 2 2 2 2 2 
M X2 2 1 1 1 1 1 
M X3 1 2 2 2 2 2 
M X4 2 2 2 1 1 2 
M X5 1 1 1 1 1 1 
M X6 1 1 1 2 2 1 
M X7 1 1 1 1 1 1 
M X8 1 1 1 1 1 1 
M X9 1 1 1 1 1 1 
M X10 1 1 1 1 1 1 
M X11 1 2 1 1 1 1 
M X12 2 2 2 2 2 2 
M X13 2 1 2 2 2 2 
M X14 2 1 2 2 2 2 
M X15 1 2 1 1 1 1 
M X16 1 1 2 2 2 2 
M X17 1 2 2 2 2 2
[download]

2. Since, I have only two values i.e. 1 or 2 in both the columns, I would like to compare the values which are identical in the same position of the row but are different in position of the columns. for eg:

A id  12(F.C.)  12 (S.C.)  15(F.C.)  15(S.C.) 16(F.C.)16 (S.C.)
M X1    1        2          2          2        2       2 
M X2    2        1          1          1        1       1 
M X3    1        2          2          2        2       2
[download]

It is the subset of the above data, where F.C. represents First Column and S.C. represents Second Column (included for being more descriptive). Here, we see that the second column of 12 is identical to first and second column of 15 and 16. Therefore, I would like to know the longest stretch of the two similar/identical columns. Similarly, I would like to do it for all the other columns. Remember: that I can't compare the first column of 12 with second column of 12. I will be obliged if you can help on this. Any ideas and if possible may be snippets of code will be highly appreciated. Thanks a lot.

Comment on Column Comparison of a File in Perl Select or Download Code

Replies are listed 'Best First'.
Re: Column Comparison of a File in Perl by bv (Friar) on Jan 19, 2010 at 21:43 UTC
Your first problem is definitely the easiest. To loop over each line of the file, use `while (<>)` like so: `open my $outfile, '>', 'out.txt'; while (<>) { print $outfile $_ if /^(?:A id\|M X)/; } close $outfile;` [download] Obviously you may need a more rigorous regular expression, but this works for your example data. Your second problem is more difficult, so it'll have to wait for now. Update: On second glance, your second probem is not well stated. What kind of comparison do you want to do? What should the output look like? Afraid I can't help with a poor problem statement. `print map{substr'hark, suPerJacent other l',$_,1}(11,7,6,16,5,1,15,18..23,8..10,24,17,0,12,13,3,14,2,4);`	[reply] [d/l] [select]
Re: Column Comparison of a File in Perl by toolic (Bishop) on Jan 19, 2010 at 21:46 UTC
Problem #1 This code will read your input file 1 line at a time, and filter out just the lines you require, without reading the whole file into memory at once. You would need to redirect the output to a file. use strict; use warnings; my $flag = 0; while (<DATA>) { if (/^A id/) { print; $flag = 1; } print if $flag and /^M/ } __DATA__ A class 1 1 1 1 1 1 A id 12 12 15 15 16 16 B class 0 0 0 0 0 0 B id 0 0 0 0 0 0 0 C id 0 0 0 0 0 0 0 M X1 1 2 2 2 2 2 M X2 2 1 1 1 1 1 M X3 1 2 2 2 2 2 M X4 2 2 2 1 1 2 M X5 1 1 1 1 1 1 M X6 1 1 1 2 2 1 M X7 1 1 1 1 1 1 M X8 1 1 1 1 1 1 M X9 1 1 1 1 1 1 M X10 1 1 1 1 1 1 M X11 1 2 1 1 1 1 M X12 2 2 2 2 2 2 M X13 2 1 2 2 2 2 M X14 2 1 2 2 2 2 M X15 1 2 1 1 1 1 M X16 1 1 2 2 2 2 M X17 1 2 2 2 2 2 [download] Problem #2 I'm trying to understand your comparison requirements. Could you elaborate? Update: Here's a guess... `use strict; use warnings; while (<DATA>) { next if /^A id/; my ($x, $c1, $c2, @cols) = (split)[1..7]; print "$x: col 12, 1st: $c1\n"; for my $col (@cols) { if ($c1 == $col) { print " matches\n"; } else { print " does not match\n"; } } print "$x: col 12, 2nd: $c2\n"; for my $col (@cols) { if ($c2 == $col) { print " matches\n"; } else { print " does not match\n"; } } } __DATA__ A id 12 12 15 15 16 16 M X1 1 2 2 2 2 2 M X2 2 1 1 1 1 1 M X3 1 2 2 2 2 2 M X4 2 2 2 1 1 2` [download] Prints... X1: col 12, 1st: 1 does not match does not match does not match does not match X1: col 12, 2nd: 2 matches matches matches matches X2: col 12, 1st: 2 does not match does not match does not match does not match X2: col 12, 2nd: 1 matches matches matches matches X3: col 12, 1st: 1 does not match does not match does not match does not match X3: col 12, 2nd: 2 matches matches matches matches X4: col 12, 1st: 2 matches does not match does not match matches X4: col 12, 2nd: 2 matches does not match does not match matches [download]	[reply] [d/l] [select]
Re^2: Column Comparison of a File in Perl by snape (Pilgrim) on Jan 19, 2010 at 22:06 UTC
Thanks a lot fopr the reply and sorry for bad explainations. Since, I have only two values i.e. 1 or 2 in both the columns, I would like to compare the values which are identical in the same position of the row but are different in position of the columns. for eg: `A id 12(F.C.) 12 (S.C.) 15(F.C.) 15(S.C.) 16(F.C.)16 (S.C.) M X1 1 2 2 2 2 2 M X2 2 1 1 1 1 1 M X3 1 2 2 2 2 2` [download] It is the subset of the above data, where F.C. represents First Column and S.C. represents Second Column (included for being more descriptive). Here, we see that the second column of 12 is identical to first and second column of 15 and 16. Therefore, I would like to know the longest stretch of the two similar/identical columns. Similarly, I would like to do it for all the other columns. Remember: that I can't compare the first column of 12 with second column of 12.	[reply] [d/l]