in reply to Spliting file + removing column

Let me see if I have this straight. If this is what you are asking for, the following code should do it:
#!/usr/bin/perl -w use strict; # This should really be passed in on the command line or something. my $data_file = "HIVgag.ct"; # Find the vertices. my @vertices; open(my $fh, "<", $data_file) or die "Can't open '$data_file': $!"; while (<$fh>) { if (/energy/i) { if (not @vertices) { # This is the first line of the first block. Do nothing. next; } else { # We have completed the first block. last; } } my @row = split /\s+/, $_; push @vertices, $row[0]; } # Scalar context turns @vertices into the number of elements it has. print "*vertices " . @vertices . "\n"; for my $vertex (@vertices) { print "$vertex G\n"; } print "*edges\n"; seek($fh, 0, 0); my %connect; my $position = @vertices - 1; while (<$fh>) { $position++; if (/energy/i) { if ($position != @vertices) { die "In line $., too few vertices found"; } $position = -1; next; } my ($this_vertex, $type, @row) = split /\s+/, $_; if ($this_vertex ne $vertices[$position]) { die "Unexpected vertex '$this_vertex' at line $."; } for my $other_vertex (@row) { if (0 == $other_vertex or $this_vertex == $other_vertex) { next; } my $key = "$this_vertex $other_vertex"; $connect{$key}++; print "$key $connect{$key}\n"; } }
Things to note.

Replies are listed 'Best First'.
Re^2: Spliting file + removing column
by AG87 (Acolyte) on Jan 13, 2011 at 17:24 UTC

    yes you are right. I being the biologist do not know much about programming.

    The first block of the above code you provided is just printing G explicitly in the vertices block. It could either be A, T, G or C. Secondly I also want to have their serial numbers in vertices block. i-e the first block should look like

    1 G
    2 G
    3 G

    In *edges section, like you said it is of no use to print the same pair twice. Is it possible to print the pairs once followed by the number indicating their total sum in file. Its a huge file. I have given a part of it here. For example, the *edges portion in this case should be like below where third column indicates the total sum of the pair in whole file which is 2 in this case

    1 2 2
    2 1 2
    2 3 2
    2 533 2
    3 2 2
    3 4 2
    3 532 2

    Please note that the 1st column of the input file i-e serial number pairs only with 3rd, 4th and 5th column NOT with the 6th column (which is happening in my code above)

    I shall be very thankful for your guidance

      The simplest thing to do is on the first pass collect the correct vertex output into an array at the same time that @vertices is being collected, then print that second arrayfor the *vertices section.

      I already told you how to correct *edges.

      Give it a try and see how it goes.

        Its not working correctly :( anyways I will try to sort it out. Thanks.