check for contiguous row values

gudluck has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract information of the rows which have contigous values in first column. Data has lines " THIS IS A BREAK" where I need to split it. as well as when the difference between number is more than one.(for example between 10000066 and 10000071). There should be atleast two or more contigous rows.
Input tab delimited file:

10000058        3DKG000004283       290.48
10000059        3DKG000004315       290.48
10000060        3DKG000004421       1693.9
10000061        3DKG000004543       3118.77
 THIS   IS      A      BREAK
10000062        3DiKG000004569      2372.94
10000063        3DiKG000004681      528.87
10000064        3DiKG000004741      187.54
10000065        3DiKG000004773      327.84
10000066        3DiKG000004879      1301.43
10000071        3DiKG000005165      17.94
10000072        3DiKG000005193      13.45
10000074        3DiKG000005261      14.33
10000076        3DiKG000005331      144 
 THIS   IS      A       BREAK
10000145        3DKG000007633       10.43
10000146        3DKG000007663       10.43
10000147        3DKG000007693       1224.8
10000148        3DKG000007727       1224.8
10000149        3DKG000007769       1359.73
10000162        3DKG000008189       307.62
10000163        3DKG000008231       307.62
10000164        3DKG000008261       14.69
[download]

OutPut should be start, end and count:

3DKG000004283 3DKG000004543 4
3DiKG000004569 3DiKG000004879 5
3DiKG000005165 3DiKG000005193 2
3DKG000007633 3DKG000007769 5
3DKG000008189 3DKG000008261 3
[download]

My code below is not giving me what i wanted:

#!/usr/bin/perl -w
open(IN, "input.txt") || die "Can't open output1.txt: $!";
open(FILE, ">output.txt")     || die "couldn't create the file\n";

while (<IN>) {
$lines = $_;
($n,$p) = $lines =~ /^(\d+)\t(.+)\n/;

 push(@num, $n);
 push(@data, $p);
}
close(IN);

$nlit = scalar(@num);

for($c=0;$c<=$nlit;$c++) {

$first = $num[$c];
$second = $num[$c+1];

$diff= $second-$first;

if ($diff <= 1)
    {
    push(@B, $data[$c]);
    push(@N, $num[$c]);
    }
elsif ($diff > 1)
    {
  if (scalar(@B) >=2)
  {
   $si = scalar(@B);

   @firstty = split /\t/, $B[0];
   @lastty = split /\t/, $B[$#B];

   print FILE "$firstty[0]\t$lastty[0]\t$si\n";
 }

undef @B;
undef @N;
}

}
close (IN);
[download]

Iam a newbie, Please correct my script. Thanks

Comment on check for contiguous row values Select or Download Code

Replies are listed 'Best First'.
Re: check for contiguous row values by kennethk (Abbot) on Aug 31, 2010 at 18:41 UTC
Welcome to the monastery. We are more than happy to help new members of the Perl community learn and grow. On line 7, your regular expression assumes your data is tab delimited. However, the input file you posted is fixed width. Was this a copy/paste error, or are you unfamiliar with different types of whitespace? In either case, the initial parse failure that this results in can be fixed by swapping `\t` (the tab character) to `\s+` (1 or more whitespace character), assuming column 1 will never contain whitespace. `($n,$p) = $lines =~ /^(\d+)\s+(.+)\n/;` On every iteration of your loop that didn't involve the record break, you reset your `@B` and `@N` arrays. Therefore, you can never accumulate enough records to hit your output loop. You likely mean to move those inside your `if (scalar(@B) >=2)` block. As per the first issue, you need to change you splits from `/\t/` to `/\s+/`. Making those changes to your script, I get the following output: `3DKG000004283 3DKG000004543 4 3DiKG000004569 3DiKG000004773 4 3DiKG000005165 3DiKG000005331 2 3DKG000007633 3DKG000007727 4` [download] as well as a number of warnings for uninitialized subtraction. This is not yet your goal, but is working toward it. Further corrections require modifications to your algorithm. What resources are you using for learning Perl? There are enough wholly unnecessary operations in the above that understanding where you are coming from (languages you know, resources available, ...) would help us guide you toward making this script work as well as improve your own style.	[reply] [d/l] [select]
Re: check for contiguous row values by pemungkah (Priest) on Aug 31, 2010 at 18:41 UTC
I would recommend a slight rethink of the problem. There are two things (as you've explained it) that need to happen: We need to capture lines that aren't "THIS IS A BREAK" and process them. We need to know when we've hit a break line. I would suggest that you change the logic to reflect that. Right now I don't see you ever checking at all for the break line, which is certainly not going to work. Try it this way (pseudo-code so you can get the practice writing the Perl): my @captured; While I still have records: Is this a break line? Y - check_for_more_breaks(@captured) @captured = () next item N - save line to @captured Anything left in @captured? check_for_more_breaks(@captured) sub check_for_more_breaks: # Your actual processing here is unclear; I've taken # a guess at it. my @subgroup; my $last_sequence_number; for each line: extract the fields is last_sequence number defined? N - save this one as the last sequence number save the current data of interest in @subgroup next else is there a gap in the sequence numbers? Y - summarize(@subgroup) @subgroup = (); save current sequence number as last save the current data of interest next [download] For `summarize(@subgroup)`, I don't quite get from your question what the processing you do is (record first and last in group and sum fields?). Anyway - your gap finding is missing, and you need to add logic for the two different kinds of gaps before you can summarize.	[reply] [d/l]
Re: check for contiguous row values by jwkrahn (Abbot) on Aug 31, 2010 at 20:15 UTC
This seems to work: `#!/usr/bin/perl use warnings; use strict; open my $IN, '<', 'input.txt' or die "Can't open 'input.txt' $!"; open my $FILE, '>', 'output.txt' or die "Can't open 'output.txt' $!"; my $previous = -1; my @data; while ( <$IN> ) { my ( $n, $p ) = split; no warnings qw/ numeric uninitialized /; if ( $n == $previous + 1 ) { push @data, $p } else { print $FILE "$data[0]\t$data[-1]\t", scalar @data, "\n +" if @data > 1; @data = $p; } $previous = $n; } __END__` [download]	[reply] [d/l]
Re^2: check for contiguous row values by kennethk (Abbot) on Aug 31, 2010 at 20:26 UTC
Not quite - you exit before outputting the final line. You need to add another line to flush `@data`, e.g.: `print $FILE "$data[0]\t$data[-1]\t", scalar @data, "\n" if @data > 1;` after the while loop. This depends on how your file is terminated - your code will function correctly if it is terminated with two or more newlines.	[reply] [d/l] [select]
Re: check for contiguous row values by Marshall (Canon) on Aug 31, 2010 at 21:04 UTC
I did things a bit differently. I would advise having some clear code that talks about the BREAK line rather than using some trick to skip it. Striving for the absolute shortest code is often meaningless. Update: I went ahead and took it the redo loop trick. I put in way more comments than normal to try to help Op understand how it works. #!/usr/bin/perl -w use strict; my $last_num=0; my @data =(); while ( my $line=<DATA>) { # Output CASE 1: # A BREAK line just causes the output to be dumped if ($line =~ m/BREAK\s*$/) { output_data(); next; } my ($seq_num, $data) = split(/\s+/,$line); #no need for chomp # Output CASE 2: # when numbers go to far, also a signal to dump output # but we've already read one line too far! if ( ($last_num != 0) and ($seq_num > $last_num+1) ) { output_data(); } # # This is the normal data processing push (@data, $data); $last_num = $seq_num; #print "seq_num = $seq_num data = $data\n"; #debug statement } #Output CASE 4: output_data(); #last record may be here! sub output_data { #Output CASE 3: #won't print anything if data set is less than 2 print "$data[0] $data[-1] ".@data, "\n" if @data>1; @data =(); #erase current data; $last_num =0; #set up for next set } =prints 3DKG000004283 3DKG000004543 4 3DiKG000004569 3DiKG000004879 5 3DiKG000005165 3DiKG000005193 2 3DKG000007633 3DKG000007769 5 3DKG000008189 3DKG000008261 3 =cut __DATA__ 10000058 3DKG000004283 290.48 10000059 3DKG000004315 290.48 10000060 3DKG000004421 1693.9 10000061 3DKG000004543 3118.77 THIS IS A BREAK 10000062 3DiKG000004569 2372.94 10000063 3DiKG000004681 528.87 10000064 3DiKG000004741 187.54 10000065 3DiKG000004773 327.84 10000066 3DiKG000004879 1301.43 10000071 3DiKG000005165 17.94 10000072 3DiKG000005193 13.45 10000074 3DiKG000005261 14.33 10000076 3DiKG000005331 144 THIS IS A BREAK 10000145 3DKG000007633 10.43 10000146 3DKG000007663 10.43 10000147 3DKG000007693 1224.8 10000148 3DKG000007727 1224.8 10000149 3DKG000007769 1359.73 10000162 3DKG000008189 307.62 10000163 3DKG000008231 307.62 10000164 3DKG000008261 14.69 [download]	[reply] [d/l]
Re^2: check for contiguous row values by gudluck (Novice) on Sep 01, 2010 at 16:17 UTC
Thank you all for nice suggestions and useful pseudo script and as well working scripts with detailed explanation. I appreciate your prompt replies.	[reply]