gudluck has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract information of the rows which have contigous values in first column. Data has lines " THIS IS A BREAK" where I need to split it. as well as when the difference between number is more than one.(for example between 10000066 and 10000071). There should be atleast two or more contigous rows.
Input tab delimited file:
10000058 3DKG000004283 290.48 10000059 3DKG000004315 290.48 10000060 3DKG000004421 1693.9 10000061 3DKG000004543 3118.77 THIS IS A BREAK 10000062 3DiKG000004569 2372.94 10000063 3DiKG000004681 528.87 10000064 3DiKG000004741 187.54 10000065 3DiKG000004773 327.84 10000066 3DiKG000004879 1301.43 10000071 3DiKG000005165 17.94 10000072 3DiKG000005193 13.45 10000074 3DiKG000005261 14.33 10000076 3DiKG000005331 144 THIS IS A BREAK 10000145 3DKG000007633 10.43 10000146 3DKG000007663 10.43 10000147 3DKG000007693 1224.8 10000148 3DKG000007727 1224.8 10000149 3DKG000007769 1359.73 10000162 3DKG000008189 307.62 10000163 3DKG000008231 307.62 10000164 3DKG000008261 14.69
OutPut should be start, end and count:
3DKG000004283 3DKG000004543 4 3DiKG000004569 3DiKG000004879 5 3DiKG000005165 3DiKG000005193 2 3DKG000007633 3DKG000007769 5 3DKG000008189 3DKG000008261 3
My code below is not giving me what i wanted:
#!/usr/bin/perl -w open(IN, "input.txt") || die "Can't open output1.txt: $!"; open(FILE, ">output.txt") || die "couldn't create the file\n"; while (<IN>) { $lines = $_; ($n,$p) = $lines =~ /^(\d+)\t(.+)\n/; push(@num, $n); push(@data, $p); } close(IN); $nlit = scalar(@num); for($c=0;$c<=$nlit;$c++) { $first = $num[$c]; $second = $num[$c+1]; $diff= $second-$first; if ($diff <= 1) { push(@B, $data[$c]); push(@N, $num[$c]); } elsif ($diff > 1) { if (scalar(@B) >=2) { $si = scalar(@B); @firstty = split /\t/, $B[0]; @lastty = split /\t/, $B[$#B]; print FILE "$firstty[0]\t$lastty[0]\t$si\n"; } undef @B; undef @N; } } close (IN);
Iam a newbie, Please correct my script. Thanks

Replies are listed 'Best First'.
Re: check for contiguous row values
by kennethk (Abbot) on Aug 31, 2010 at 18:41 UTC
    Welcome to the monastery. We are more than happy to help new members of the Perl community learn and grow.
    1. On line 7, your regular expression assumes your data is tab delimited. However, the input file you posted is fixed width. Was this a copy/paste error, or are you unfamiliar with different types of whitespace? In either case, the initial parse failure that this results in can be fixed by swapping \t (the tab character) to \s+ (1 or more whitespace character), assuming column 1 will never contain whitespace. ($n,$p) = $lines =~ /^(\d+)\s+(.+)\n/;

    2. On every iteration of your loop that didn't involve the record break, you reset your @B and @N arrays. Therefore, you can never accumulate enough records to hit your output loop. You likely mean to move those inside your if (scalar(@B) >=2) block.

    3. As per the first issue, you need to change you splits from /\t/ to /\s+/.

    Making those changes to your script, I get the following output:

    3DKG000004283 3DKG000004543 4 3DiKG000004569 3DiKG000004773 4 3DiKG000005165 3DiKG000005331 2 3DKG000007633 3DKG000007727 4

    as well as a number of warnings for uninitialized subtraction. This is not yet your goal, but is working toward it. Further corrections require modifications to your algorithm.

    What resources are you using for learning Perl? There are enough wholly unnecessary operations in the above that understanding where you are coming from (languages you know, resources available, ...) would help us guide you toward making this script work as well as improve your own style.

Re: check for contiguous row values
by pemungkah (Priest) on Aug 31, 2010 at 18:41 UTC
    I would recommend a slight rethink of the problem. There are two things (as you've explained it) that need to happen:
    1. We need to capture lines that aren't "THIS IS A BREAK" and process them.
    2. We need to know when we've hit a break line.
    I would suggest that you change the logic to reflect that. Right now I don't see you ever checking at all for the break line, which is certainly not going to work.

    Try it this way (pseudo-code so you can get the practice writing the Perl):

    my @captured; While I still have records: Is this a break line? Y - check_for_more_breaks(@captured) @captured = () next item N - save line to @captured Anything left in @captured? check_for_more_breaks(@captured) sub check_for_more_breaks: # Your actual processing here is unclear; I've taken # a guess at it. my @subgroup; my $last_sequence_number; for each line: extract the fields is last_sequence number defined? N - save this one as the last sequence number save the current data of interest in @subgroup next else is there a gap in the sequence numbers? Y - summarize(@subgroup) @subgroup = (); save current sequence number as last save the current data of interest next
    For summarize(@subgroup), I don't quite get from your question what the processing you do is (record first and last in group and sum fields?). Anyway - your gap finding is missing, and you need to add logic for the two different kinds of gaps before you can summarize.
Re: check for contiguous row values
by jwkrahn (Abbot) on Aug 31, 2010 at 20:15 UTC

    This seems to work:

    #!/usr/bin/perl use warnings; use strict; open my $IN, '<', 'input.txt' or die "Can't open 'input.txt' $!"; open my $FILE, '>', 'output.txt' or die "Can't open 'output.txt' $!"; my $previous = -1; my @data; while ( <$IN> ) { my ( $n, $p ) = split; no warnings qw/ numeric uninitialized /; if ( $n == $previous + 1 ) { push @data, $p } else { print $FILE "$data[0]\t$data[-1]\t", scalar @data, "\n +" if @data > 1; @data = $p; } $previous = $n; } __END__
      Not quite - you exit before outputting the final line. You need to add another line to flush @data, e.g.:

      print $FILE "$data[0]\t$data[-1]\t", scalar @data, "\n" if @data > 1;

      after the while loop. This depends on how your file is terminated - your code will function correctly if it is terminated with two or more newlines.

Re: check for contiguous row values
by Marshall (Canon) on Aug 31, 2010 at 21:04 UTC
    I did things a bit differently. I would advise having some clear code that talks about the BREAK line rather than using some trick to skip it. Striving for the absolute shortest code is often meaningless. Update: I went ahead and took it the redo loop trick.

    I put in way more comments than normal to try to help Op understand how it works.

    #!/usr/bin/perl -w use strict; my $last_num=0; my @data =(); while ( my $line=<DATA>) { # Output CASE 1: # A BREAK line just causes the output to be dumped if ($line =~ m/BREAK\s*$/) { output_data(); next; } my ($seq_num, $data) = split(/\s+/,$line); #no need for chomp # Output CASE 2: # when numbers go to far, also a signal to dump output # but we've already read one line too far! if ( ($last_num != 0) and ($seq_num > $last_num+1) ) { output_data(); } # # This is the normal data processing push (@data, $data); $last_num = $seq_num; #print "seq_num = $seq_num data = $data\n"; #debug statement } #Output CASE 4: output_data(); #last record may be here! sub output_data { #Output CASE 3: #won't print anything if data set is less than 2 print "$data[0] $data[-1] ".@data, "\n" if @data>1; @data =(); #erase current data; $last_num =0; #set up for next set } =prints 3DKG000004283 3DKG000004543 4 3DiKG000004569 3DiKG000004879 5 3DiKG000005165 3DiKG000005193 2 3DKG000007633 3DKG000007769 5 3DKG000008189 3DKG000008261 3 =cut __DATA__ 10000058 3DKG000004283 290.48 10000059 3DKG000004315 290.48 10000060 3DKG000004421 1693.9 10000061 3DKG000004543 3118.77 THIS IS A BREAK 10000062 3DiKG000004569 2372.94 10000063 3DiKG000004681 528.87 10000064 3DiKG000004741 187.54 10000065 3DiKG000004773 327.84 10000066 3DiKG000004879 1301.43 10000071 3DiKG000005165 17.94 10000072 3DiKG000005193 13.45 10000074 3DiKG000005261 14.33 10000076 3DiKG000005331 144 THIS IS A BREAK 10000145 3DKG000007633 10.43 10000146 3DKG000007663 10.43 10000147 3DKG000007693 1224.8 10000148 3DKG000007727 1224.8 10000149 3DKG000007769 1359.73 10000162 3DKG000008189 307.62 10000163 3DKG000008231 307.62 10000164 3DKG000008261 14.69
      Thank you all for nice suggestions and useful pseudo script and as well working scripts with detailed explanation. I appreciate your prompt replies.