in reply to Re: Getting data from a file (Operons and Genes).
in thread Getting data from a file (Operons and Genes).

You've got an unnecessary level of array reference there; you could remove it by changing
push(@{$hash->{$fields[0]}}, \@genes);
to
push(@{$hash->{$fields[0]}}, @genes);
giving output:
$VAR1 = { 'malZ' => [ 'b0403' ], 'malS' => [ 'b3571' ], 'manA' => [ 'b1613' ], 'malXY' => [ 'b1621', 'b1622' ], 'malT' => [ 'b3418' ] };

Replies are listed 'Best First'.
Re^3: Getting data from a file (Operons and Genes).
by thezip (Vicar) on Jun 14, 2007 at 23:04 UTC

    I chose not to assume that an operon was unique thoughout the file, since it wasn't explicitly stated in the spec.

    If I can assume uniqueness, then I'd remove the push and change to a HoA:

    $hash->{$fields[0]} = \@genes;

    Where do you want *them* to go today?
      Thanks, I was trying this but failing.
      #!/usr/bin/perl use strict; my $operon; my %operonHash; while (<>) { chomp; if ( /(\b.+?\b)/ ) { # word boundary + any character at least once, up + to the first word boundary. #print "Matched: |$`<$&?>$'|\n"; $operon = $_; #print $& . " " ; $operonHash{$&} = (); } else { print "No match. \n"; } print "\n"; if ( /\w+\|/ ) { # word boundary + any character at least once, up to +the first word boundary. print "Matched: |$`<<$&>>$'|\n"; } else { print "No match. \n"; } }
      #The problem is, 1. How to get rid of the | from the expression that was found, and 2. How to get MULTIPLE genes before | when more than one gene appears on a line?
        From inspection, I noticed that there always seemed to be four spaces delimiting the columns. In the data set provided here, it does not seem the case (ie. there are tabs instead).

        If tabs are the actual delimiter, then use this line instead:

        my @fields = split(/\t/, $_);

        The initial goal is to split each line of data into four separate fields.


        Where do you want *them* to go today?
Re^3: Getting data from a file (Operons and Genes).
by chrisantha (Initiate) on Jun 15, 2007 at 00:02 UTC
    Thanks again. I tried your code on this data
    C0067 1 forward C0067|, C0293 1 forward C0293|, C0343-dbpA 2 forward C0343|,dbpA|b1343, C0465 1 forward C0465|, C0614 1 reverse C0614|, C0719 1 forward C0719|, IS128 1 forward IS128|, aaeR 1 forward aaeR|b3243, aaeXAB 3 reverse aaeA|b3241,aaeB|b3240,aaeX|b3242, aas-ygeD 2 reverse aas|b2836,ygeD|b2835, aat 1 reverse aat|b0885, abgABT-ogt 4 reverse abgA|b1338,abgB|b1337,abgT|b1336,ogt|b13 +35, abgR 1 forward abgR|b1339, abrB 1 reverse abrB|b0715, accA 1 forward accA|b0185, accBC 2 forward accB|b3255,accC|b3256, accD 1 reverse accD|b2316, aceBAK 3 forward aceA|b4015,aceB|b4014,aceK|b4016, ackA-pta 2 forward ackA|b2296,pta|b2297, acnA 1 forward acnA|b1276, acnB 1 forward acnB|b0118, acpH 1 reverse acpH|b0404, acpT 1 forward acpT|b3475, acrAB 2 reverse acrA|b0463,acrB|b0462, acrD 1 forward acrD|b2470, acrEF 2 forward acrE|b3265,acrF|b3266, acrR 1 forward acrR|b0464, acs-yjcH-actP 3 reverse acs|b4069,actP|b4067,yjcH|b4068, ada-alkB 2 reverse ada|b2213,alkB|b2212, add 1 forward add|b1623, ade 1 forward ade|b3665, adhE 1 reverse adhE|b1241, adhP 1 reverse adhP|b1478, adiA 1 reverse adiA|b4117, adiC 1 reverse adiC|b4115, adiY 1 reverse adiY|b4116, adk 1 forward adk|b0474, adrA 1 forward adrA|b0385, aegA 1 reverse aegA|b2468, aer 1 reverse aer|b3072, aes 1 reverse aes|b0476, agaR 1 reverse agaR|b3131, agaS-kbaY-agaBCDI 6 forward agaB|b3138,agaC|b3139,agaD|b3140, +agaI|b3141,agaS|b3136,kbaY|b3137, malS 1 forward malS|b3571, malT 1 forward malT|b3418, malXY 2 forward malX|b1621,malY|b1622, malZ 1 forward malZ|b0403, manA 1 forward manA|b1613, which gives output $VAR1 = { 'malZ' => [ [ 'b0403' ] ], ' => [ 'ada-alkB 2 reverse ada|b2213,alkB|b2212, [] ], 'malS' => [ [ 'b3571' ] ], ' => [ 'acnA 1 forward acnA|b1276, [] ], ' => [ 'aegA 1 reverse aegA|b2468, [] ], ' => [ 'adhP 1 reverse adhP|b1478, [] ], ' => [ 'abgABT-ogt 4 reverse abgA|b1338,abgB|b1337,abgT|b13 +36,ogt|b1335, + [] + ], ' => [ 'acnB 1 forward acnB|b0118, [] ], ' => [ 'acpT 1 forward acpT|b3475, [] ], ' => [ 'ade 1 forward ade|b3665, [] ], ' => [ 'acrD 1 forward acrD|b2470, [] ], ' => [ 'acpH 1 reverse acpH|b0404, [] ], ' => [ 'agaR 1 reverse agaR|b3131, [] ], ' => [ 'C0293 1 forward C0293|, [] ], 'malXY' => [ [ 'b1621', 'b1622' ] ], ' => [ 'aaeR 1 forward aaeR|b3243, [] ], ' => [ 'ackA-pta 2 forward ackA|b2296,pta|b2297, [] ], ' => [ 'aceBAK 3 forward aceA|b4015,aceB|b4014,aceK|b40 +16, [ +] ], ' => [ 'aat 1 reverse aat|b0885, [] ], ' => [ 'acrR 1 forward acrR|b0464, [] ], ' => [ 'accA 1 forward accA|b0185, [] ], ' => [ 'IS128 1 forward IS128|, [] ], ' => [ 'aaeXAB 3 reverse aaeA|b3241,aaeB|b3240,aaeX|b32 +42, [ +] ], 'malT' => [ [ 'b3418' ] ], ' => [ 'abrB 1 reverse abrB|b0715, [] ], ' => [ 'C0465 1 forward C0465|, [] ], ' => [ 'aas-ygeD 2 reverse aas|b2836,ygeD|b2835, [] ], ' => [ 'C0614 1 reverse C0614|, [] ], ' => [ 'adk 1 forward adk|b0474, [] ], 'agaS-kbaY-agaBCDI 6 forward agaB|b3138,agaC|b3139, +agaD|b3140,agaI|b3141,agaS|b3136,kbaY|b3137,' => [ + [] + ], ' => [ 'adiC 1 reverse adiC|b4115, [] ], ' => [ 'aes 1 reverse aes|b0476, [] ], ' => [ 'accBC 2 forward accB|b3255,accC|b3256, [] ], ' => [ 'add 1 forward add|b1623, [] ], ' => [ 'accD 1 reverse accD|b2316, [] ], ' => [ 'adrA 1 forward adrA|b0385, [] ], ' => [ 'adiA 1 reverse adiA|b4117, [] ], ' => [ 'aer 1 reverse aer|b3072, [] ], ' => [ 'acrEF 2 forward acrE|b3265,acrF|b3266, [] ], ' => [ 'acs-yjcH-actP 3 reverse acs|b4069,actP|b4067,y +jcH|b4068, + [] + ], ' => [ 'abgR 1 forward abgR|b1339, [] ], ' => [ 'C0719 1 forward C0719|, [] ], ' => [ 'adiY 1 reverse adiY|b4116, [] ], ' => [ 'C0343-dbpA 2 forward C0343|,dbpA|b1343, [] ], 'manA' => [ [ 'b1613' ] ], ' => [ 'acrAB 2 reverse acrA|b0463,acrB|b0462, [] ], ' => [ 'C0067 1 forward C0067|, [] ], ' => [ 'adhE 1 reverse adhE|b1241, [] ] };
    Which is strange because it works for the example data you sent me which I cut and pasted into the above data, but not for the original data file. Please can you explain this? Yours, Chrisantha
      I think you have tabs separating the fields instead of 4 spaces that the code somebody wrote is expecting, and unexpected carriage return characters at the end of lines. Changing it to
      my @fields = split(' ', $_);
      (or just my @fields = split;, since ' ' and $_ are what split defaults to) solves both problems.