Spliting file + removing column

AG87 has asked for the wisdom of the Perl Monks concerning the following question:

I have a little bit problem in my code which I am unable to sort out. My input file is

536 ENERGY = -176.2 pro
1 G 0 2 0 1
2 G 1 3 533 2
3 G 2 4 532 3
536 ENERGY = -175.9 pro
1 G 0 2 0 1
2 G 1 3 533 2
3 G 2 4 532 3

My code is

$data_file="HIVgag.ct";

open(MYFILE, $data_file) || die("Could not open file!");
$b=$between=$energy=0;
while (<MYFILE>) {
    if (/energy/i) {
        $energy++;
        $between=$b;
    }
    elsif ($energy) {$b++}
}

print "$energy / $between\n";
close(MYFILE);

open(MYFILE, $data_file) || die("Could not open file!");
open(WRITE, ">final.net");
print WRITE "*vertices $between\n";

 while(<MYFILE>)
 {
 foreach $row (<MYFILE>)
 {
 @row = $row;
(@row[0],@row[1],@row[2], @row[3], @row[4], @row[5], @row[6])=split(/\
+s+/,$row);
if (@row[2] ne ENERGY)
{
 print  WRITE "@row[1]  $row[2]\n";
 print"@row[1]  $row[2]\n";
 }
    }
 }
 close(MYFILE);
 
 open(MYFILE, $data_file) || die("Could not open file!");
 {
print WRITE "*edges\n";
my $energy;
while ( <MYFILE> ) {
    if ( /\bENERGY\b/ ) {
        ++$energy;
        next;
       }
       
    my ( $key, @fields ) = grep $_, split;
    for my $field ( @fields ) {
    if ( $field ne 'G' && $field ne 'A' && $field ne 'C' && $field ne 
+'G' )
    {
        print WRITE "$key $field $energy\n";
        }
    }
    }
    }
close(WRITE);
[download]

My code rearranges the input file in a way such that the first two columns of the file is written in *vertices portion. And in the *edges portion the 1st column of the input file makes a pair with the 3rd, 4th , 5th and 6th column omitting the pairs containing zero in 3rd, 4th, 5th and 6th column. The 3rd column of *edges consists of the number of “energy block” in which the pair exists. Thus the output file is

*vertices 3
1 G
2 G
3 G
1 G
2 G
3 G
*edges
1 2 1
1 1 1
2 1 1
2 3 1
2 533 1
2 2 1
3 2 1
3 4 1
3 532 1
3 3 1
1 2 2
1 1 2
2 1 2
2 3 2
2 533 2
2 2 2
3 2 2
3 4 2
3 532 2
3 3 2

What I want is in the *vertices portion it should only read the first two columns of the first “energy block” and then it should jump to *edges portion which implies that in this case it should only print 1 G, 2 G and 3 G and then jump to *edges portion. And in *edges portion the first column of input file should not make pair with 6th (last) column. So the output should be like

*vertices 3
1 G
2 G
3 G
*edges
1 2 1
2 1 1
2 3 1
2 533 1
3 2 1
3 4 1
3 532 1
1 2 2
2 1 2
2 3 2
2 533 2
3 2 2
3 4 2
3 532 2

I shall be very thankful for help.

Comment on Spliting file + removing column Download Code

Replies are listed 'Best First'.
Re: Spliting file + removing column by tilly (Archbishop) on Jan 13, 2011 at 15:22 UTC
Let me see if I have this straight. Your input is divided into blocks. Each block starts with a weird line saying ENERGY. Within each block you have lines with a vertex, G, then vertices that the first vertex is connected to. You don't want to count a vertex as being connected to 0 or itself. (In your data this solves your last column issue, and I think is the real requirement.) Every block has the same vertices. (This is a big assumption that is implicit in how you want to process everything.) First you want to go through the first block, then print off all of the vertices. Then for each block, at the end of the block,' For each vertex For each vertex it is connected to print off first vertex, second vertex, and the total multiplicity you have found. Note in particular that in my understanding if a pair of vertices does not appear in a later block, it won't be printed in edges. If this is what you are asking for, the following code should do it: #!/usr/bin/perl -w use strict; # This should really be passed in on the command line or something. my $data_file = "HIVgag.ct"; # Find the vertices. my @vertices; open(my $fh, "<", $data_file) or die "Can't open '$data_file': $!"; while (<$fh>) { if (/energy/i) { if (not @vertices) { # This is the first line of the first block. Do nothing. next; } else { # We have completed the first block. last; } } my @row = split /\s+/, $_; push @vertices, $row[0]; } # Scalar context turns @vertices into the number of elements it has. print "vertices " . @vertices . "\n"; for my $vertex (@vertices) { print "$vertex G\n"; } print "edges\n"; seek($fh, 0, 0); my %connect; my $position = @vertices - 1; while (<$fh>) { $position++; if (/energy/i) { if ($position != @vertices) { die "In line $., too few vertices found"; } $position = -1; next; } my ($this_vertex, $type, @row) = split /\s+/, $_; if ($this_vertex ne $vertices[$position]) { die "Unexpected vertex '$this_vertex' at line $."; } for my $other_vertex (@row) { if (0 == $other_vertex or $this_vertex == $other_vertex) { next; } my $key = "$this_vertex $other_vertex"; $connect{$key}++; print "$key $connect{$key}\n"; } } [download] Things to note. I've used warnings and strict.pm. This is a very good habit that will catch a lot of bugs. Absolutely do this. My error message on the open uses $! as perlstyle recommends. Note all of the sanity checking to verify that the file looks exactly like I think it should. I am extremely puzzled at why you want to print the same pair of vertices multiple times in your edges section. This makes no sense to me. Instead I would recommend taking out the print from my inner loop, and then add a short section at the end that prints out the contents of multiplicity. Like this: `for my $vertex_pair (sort keys %multiplicity) { print "$vertex_pair $multiplicity{$vertex_pair}\n"; }` [download] I have no idea what the purpose of G is in your file. I suspect from your code that it has to do with DNA, and I wonder whether you should be dropping it from the edges section.	[reply] [d/l] [select]
Re^2: Spliting file + removing column by AG87 (Acolyte) on Jan 13, 2011 at 17:24 UTC
yes you are right. I being the biologist do not know much about programming. The first block of the above code you provided is just printing G explicitly in the vertices block. It could either be A, T, G or C. Secondly I also want to have their serial numbers in vertices block. i-e the first block should look like 1 G 2 G 3 G In edges section, like you said it is of no use to print the same pair twice. Is it possible to print the pairs once followed by the number indicating their total sum in file. Its a huge file. I have given a part of it here. For example, the edges portion in this case should be like below where third column indicates the total sum of the pair in whole file which is 2 in this case 1 2 2 2 1 2 2 3 2 2 533 2 3 2 2 3 4 2 3 532 2 Please note that the 1st column of the input file i-e serial number pairs only with 3rd, 4th and 5th column NOT with the 6th column (which is happening in my code above) I shall be very thankful for your guidance	[reply]
Re^3: Spliting file + removing column by tilly (Archbishop) on Jan 13, 2011 at 17:42 UTC
The simplest thing to do is on the first pass collect the correct vertex output into an array at the same time that @vertices is being collected, then print that second arrayfor the vertices section. I already told you how to correct edges. Give it a try and see how it goes.	[reply]
Re^4: Spliting file + removing column by AG87 (Acolyte) on Jan 13, 2011 at 17:59 UTC
Re^5: Spliting file + removing column by tilly (Archbishop) on Jan 13, 2011 at 18:20 UTC
Re: Spliting file + removing column by Anonyrnous Monk (Hermit) on Jan 13, 2011 at 11:02 UTC
Just a quick side note. This `@row = $row; (@row[0],@row[1],@row[2], @row[3], @row[4], @row[5], @row[6])=split(/\ +s+/,$row);` [download] is more compactly written as `my @row = split(/\s+/,$row);` [download] (Strictly speaking, the latter would create more than 7 array elements, in case there are more fields in the input — but you only seem to have 6 anyway. The direct equivalent would be `@row = (split /\s+/,$row)[0..6];`) Also, `@row[3]` (i.e. an array slice with a single element) is generally better written as `$row[3]`.	[reply] [d/l] [select]
Re^2: Spliting file + removing column by jwkrahn (Abbot) on Jan 13, 2011 at 11:15 UTC
`my @row = split(/\s+/,$row);` [download] is more compactly written as `my @row = split' ',$row;` [download]	[reply] [d/l] [select]
Re^3: Spliting file + removing column by Anonyrnous Monk (Hermit) on Jan 13, 2011 at 12:05 UTC
...it's not the same. In other words, it depends on what the OP wants. `split ' '` is "magic" in that it ignores leading white space, while `split /\s+/` does not.	[reply] [d/l] [select]
Re: Spliting file + removing column by pat_mc (Pilgrim) on Jan 13, 2011 at 09:12 UTC
Hi, AG87, Can you provide a reduced and condensed version of the actual problem you are having? This would make it easier for everyone here to help you with the Perl. Like this it looks like an awful lot of problem-specific code the meaning of which you probably know better than anyone else here. Looking forward to helping you out - Pat	[reply]
Re^2: Spliting file + removing column by AG87 (Acolyte) on Jan 13, 2011 at 17:47 UTC
Ok in short my input and output both are divided in blocks. My 1st problem 1) In the output of vertices portion my code prints vertices 3 1 G 2 G 3 G 1 G 2 G 3 G But in actual I want it to print the first two columns of the only first energy block like vertices 3 1 G 2 G 3 G My code for this section is $data_file="HIVgag.ct"; open(MYFILE, $data_file) \|\| die("Could not open file!"); $b=$between=$energy=0; while (<MYFILE>) { if (/energy/i) { $energy++; $between=$b; } elsif ($energy) {$b++} } close(MYFILE); open(MYFILE, $data_file) \|\| die("Could not open file!"); open(WRITE, ">final.net"); print WRITE "vertices $between\n"; while(<MYFILE>) { foreach $row (<MYFILE>) { @row = $row; (@row[0],@row[1],@row[2], @row[3], @row[4], @row[5], @row[6])=split(/\ +s+/,$row); if (@row[2] ne ENERGY) { print WRITE "@row[1] $row[2]\n"; } } } close(MYFILE); [download]	[reply] [d/l]
Re: Spliting file + removing column by locked_user sundialsvc4 (Abbot) on Jan 13, 2011 at 13:08 UTC
Your code is a bit too much for me to “grok” on my first cup of coffee, but when I am presented with a problem like this, I try to identify what are the different kinds of input-lines my program may expect (I see two...), and to write a handler `sub` for each one. I will use a regular-expression to parse the lines as specifically as possible. And, I will aggressively code this part of the program so that it will `die` if it encounters anything that does not conform to my specifications because, in a file of 1,000,000 records, that’ll be the only possible way to find the problem. (The fact that “the program seems to run to completion” doesn’t cut it...) My next concern, once I have verified that I can correctly read the entire file without any untimely deaths, will be to consider the order in which the various types of records may occur. For this, I use “finite-state machine (FSM)” logic, where the program initializes, say, `$state = 'INITIAL_STATE';`, and then examines and changes the value of `$state` to indicate where it is and what it has seen recently. This state-driven logic constructs some kind of in-memory data structure, and, at the appropriate point (`$state`) in time, does something with it.