in reply to Getting data from a file (Operons and Genes).

chrisantha, try this:

#!/perl/bin/perl -w use strict; use Data::Dumper; my $hash = {}; for (<DATA>) { chomp; my @fields = split(/\s{4}/, $_); my @genes = split(',', $fields[3]); @genes = map { (split('\|'))[1] } @genes; push(@{$hash->{$fields[0]}}, \@genes); } print Dumper($hash); __DATA__ malS 1 forward malS|b3571, malT 1 forward malT|b3418, malXY 2 forward malX|b1621,malY|b1622, malZ 1 forward malZ|b0403, manA 1 forward manA|b1613, __OUTPUT__ $VAR1 = { 'malZ' => [ [ 'b0403' ] ], 'malS' => [ [ 'b3571' ] ], 'manA' => [ [ 'b1613' ] ], 'malXY' => [ [ 'b1621', 'b1622' ] ], 'malT' => [ [ 'b3418' ] ] };

Is this what you're looking for?

Updated: Added __OUTPUT__


Where do you want *them* to go today?

Replies are listed 'Best First'.
Re^2: Getting data from a file (Operons and Genes).
by ysth (Canon) on Jun 14, 2007 at 22:56 UTC
    You've got an unnecessary level of array reference there; you could remove it by changing
    push(@{$hash->{$fields[0]}}, \@genes);
    to
    push(@{$hash->{$fields[0]}}, @genes);
    giving output:
    $VAR1 = { 'malZ' => [ 'b0403' ], 'malS' => [ 'b3571' ], 'manA' => [ 'b1613' ], 'malXY' => [ 'b1621', 'b1622' ], 'malT' => [ 'b3418' ] };

      I chose not to assume that an operon was unique thoughout the file, since it wasn't explicitly stated in the spec.

      If I can assume uniqueness, then I'd remove the push and change to a HoA:

      $hash->{$fields[0]} = \@genes;

      Where do you want *them* to go today?
        Thanks, I was trying this but failing.
        #!/usr/bin/perl use strict; my $operon; my %operonHash; while (<>) { chomp; if ( /(\b.+?\b)/ ) { # word boundary + any character at least once, up + to the first word boundary. #print "Matched: |$`<$&?>$'|\n"; $operon = $_; #print $& . " " ; $operonHash{$&} = (); } else { print "No match. \n"; } print "\n"; if ( /\w+\|/ ) { # word boundary + any character at least once, up to +the first word boundary. print "Matched: |$`<<$&>>$'|\n"; } else { print "No match. \n"; } }
        #The problem is, 1. How to get rid of the | from the expression that was found, and 2. How to get MULTIPLE genes before | when more than one gene appears on a line?
      Thanks again. I tried your code on this data
      C0067 1 forward C0067|, C0293 1 forward C0293|, C0343-dbpA 2 forward C0343|,dbpA|b1343, C0465 1 forward C0465|, C0614 1 reverse C0614|, C0719 1 forward C0719|, IS128 1 forward IS128|, aaeR 1 forward aaeR|b3243, aaeXAB 3 reverse aaeA|b3241,aaeB|b3240,aaeX|b3242, aas-ygeD 2 reverse aas|b2836,ygeD|b2835, aat 1 reverse aat|b0885, abgABT-ogt 4 reverse abgA|b1338,abgB|b1337,abgT|b1336,ogt|b13 +35, abgR 1 forward abgR|b1339, abrB 1 reverse abrB|b0715, accA 1 forward accA|b0185, accBC 2 forward accB|b3255,accC|b3256, accD 1 reverse accD|b2316, aceBAK 3 forward aceA|b4015,aceB|b4014,aceK|b4016, ackA-pta 2 forward ackA|b2296,pta|b2297, acnA 1 forward acnA|b1276, acnB 1 forward acnB|b0118, acpH 1 reverse acpH|b0404, acpT 1 forward acpT|b3475, acrAB 2 reverse acrA|b0463,acrB|b0462, acrD 1 forward acrD|b2470, acrEF 2 forward acrE|b3265,acrF|b3266, acrR 1 forward acrR|b0464, acs-yjcH-actP 3 reverse acs|b4069,actP|b4067,yjcH|b4068, ada-alkB 2 reverse ada|b2213,alkB|b2212, add 1 forward add|b1623, ade 1 forward ade|b3665, adhE 1 reverse adhE|b1241, adhP 1 reverse adhP|b1478, adiA 1 reverse adiA|b4117, adiC 1 reverse adiC|b4115, adiY 1 reverse adiY|b4116, adk 1 forward adk|b0474, adrA 1 forward adrA|b0385, aegA 1 reverse aegA|b2468, aer 1 reverse aer|b3072, aes 1 reverse aes|b0476, agaR 1 reverse agaR|b3131, agaS-kbaY-agaBCDI 6 forward agaB|b3138,agaC|b3139,agaD|b3140, +agaI|b3141,agaS|b3136,kbaY|b3137, malS 1 forward malS|b3571, malT 1 forward malT|b3418, malXY 2 forward malX|b1621,malY|b1622, malZ 1 forward malZ|b0403, manA 1 forward manA|b1613, which gives output $VAR1 = { 'malZ' => [ [ 'b0403' ] ], ' => [ 'ada-alkB 2 reverse ada|b2213,alkB|b2212, [] ], 'malS' => [ [ 'b3571' ] ], ' => [ 'acnA 1 forward acnA|b1276, [] ], ' => [ 'aegA 1 reverse aegA|b2468, [] ], ' => [ 'adhP 1 reverse adhP|b1478, [] ], ' => [ 'abgABT-ogt 4 reverse abgA|b1338,abgB|b1337,abgT|b13 +36,ogt|b1335, + [] + ], ' => [ 'acnB 1 forward acnB|b0118, [] ], ' => [ 'acpT 1 forward acpT|b3475, [] ], ' => [ 'ade 1 forward ade|b3665, [] ], ' => [ 'acrD 1 forward acrD|b2470, [] ], ' => [ 'acpH 1 reverse acpH|b0404, [] ], ' => [ 'agaR 1 reverse agaR|b3131, [] ], ' => [ 'C0293 1 forward C0293|, [] ], 'malXY' => [ [ 'b1621', 'b1622' ] ], ' => [ 'aaeR 1 forward aaeR|b3243, [] ], ' => [ 'ackA-pta 2 forward ackA|b2296,pta|b2297, [] ], ' => [ 'aceBAK 3 forward aceA|b4015,aceB|b4014,aceK|b40 +16, [ +] ], ' => [ 'aat 1 reverse aat|b0885, [] ], ' => [ 'acrR 1 forward acrR|b0464, [] ], ' => [ 'accA 1 forward accA|b0185, [] ], ' => [ 'IS128 1 forward IS128|, [] ], ' => [ 'aaeXAB 3 reverse aaeA|b3241,aaeB|b3240,aaeX|b32 +42, [ +] ], 'malT' => [ [ 'b3418' ] ], ' => [ 'abrB 1 reverse abrB|b0715, [] ], ' => [ 'C0465 1 forward C0465|, [] ], ' => [ 'aas-ygeD 2 reverse aas|b2836,ygeD|b2835, [] ], ' => [ 'C0614 1 reverse C0614|, [] ], ' => [ 'adk 1 forward adk|b0474, [] ], 'agaS-kbaY-agaBCDI 6 forward agaB|b3138,agaC|b3139, +agaD|b3140,agaI|b3141,agaS|b3136,kbaY|b3137,' => [ + [] + ], ' => [ 'adiC 1 reverse adiC|b4115, [] ], ' => [ 'aes 1 reverse aes|b0476, [] ], ' => [ 'accBC 2 forward accB|b3255,accC|b3256, [] ], ' => [ 'add 1 forward add|b1623, [] ], ' => [ 'accD 1 reverse accD|b2316, [] ], ' => [ 'adrA 1 forward adrA|b0385, [] ], ' => [ 'adiA 1 reverse adiA|b4117, [] ], ' => [ 'aer 1 reverse aer|b3072, [] ], ' => [ 'acrEF 2 forward acrE|b3265,acrF|b3266, [] ], ' => [ 'acs-yjcH-actP 3 reverse acs|b4069,actP|b4067,y +jcH|b4068, + [] + ], ' => [ 'abgR 1 forward abgR|b1339, [] ], ' => [ 'C0719 1 forward C0719|, [] ], ' => [ 'adiY 1 reverse adiY|b4116, [] ], ' => [ 'C0343-dbpA 2 forward C0343|,dbpA|b1343, [] ], 'manA' => [ [ 'b1613' ] ], ' => [ 'acrAB 2 reverse acrA|b0463,acrB|b0462, [] ], ' => [ 'C0067 1 forward C0067|, [] ], ' => [ 'adhE 1 reverse adhE|b1241, [] ] };
      Which is strange because it works for the example data you sent me which I cut and pasted into the above data, but not for the original data file. Please can you explain this? Yours, Chrisantha
        I think you have tabs separating the fields instead of 4 spaces that the code somebody wrote is expecting, and unexpected carriage return characters at the end of lines. Changing it to
        my @fields = split(' ', $_);
        (or just my @fields = split;, since ' ' and $_ are what split defaults to) solves both problems.
Re^2: Getting data from a file (Operons and Genes).
by chrisantha (Initiate) on Jun 14, 2007 at 23:50 UTC
    Yes, thanks, I was trying somethign like this and failing. <code> #!/usr/bin/perl use strict; my $operon; my %operonHash; while (<>) { chomp; if ( /(\b.+?\b)/ ) { # word boundary + any character at least once, up to the first word boundary. #print "Matched: |$`<$&?>$'|\n"; $operon = $_; #print $& . " " ; $operonHash{$&} = (); } else { print "No match. \n"; } print "\n"; if ( /\w+\|/ ) { # word boundary + any character at least once, up to the first word boundary. print "Matched: |$`<<$&>>$'|\n"; } else { print "No match. \n"; } } <\code> #The problem is, 1. How to get rid of the | from the expression that was found, and 2. How to get MULTIPLE genes before | when more than one gene appears on a line?