chrisantha has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'd like to read in this file format below. The first word in each line is the Operon name, and the words followed by | are the genes contained within that operon. I want to make a hash table with operons pointing to a list of genes it contains. I'd then like to store this, and print it out. Please could you help me with this. Yours, Chrisantha Birmingham, UK
malS 1 forward malS|b3571, malT 1 forward malT|b3418, malXY 2 forward malX|b1621,malY|b1622, malZ 1 forward malZ|b0403, manA 1 forward manA|b1613, manXYZ 3 forward manX|b1817,manY|b1818,manZ|b1819, map-glnD-dapD 3 reverse dapD|b0166,glnD|b0167,map|b0168, marC 1 reverse marC|b1529, marRAB 3 forward marA|b1531,marB|b1532,marR|b1530, mbhA 1 forward mbhA|, mcrA 1 forward mcrA|b1159, mcrBC 2 reverse mcrB|b4346,mcrC|b4345, mdaB 1 forward mdaB|b3028, mdh 1 reverse mdh|b3236, mdlAB 2 forward mdlA|b0448,mdlB|b0449, mdoB 1 reverse mdoB|b4359, mdoC 1 reverse mdoC|b1047, mdoD 1 forward mdoD|b1424, mdoGH 2 forward mdoG|b1048,mdoH|b1049, mdtABCD-baeSR 6 forward baeR|b2079,baeS|b2078,mdtA|b2074,mdtB +|b2075,mdtC|b2076,mdtD|b2077, mdtEF 2 forward mdtE|b3513,mdtF|b3514, mdtG 1 reverse mdtG|b1053, mdtH 1 reverse mdtH|b1065, mdtJI 2 reverse mdtI|b1599,mdtJ|b1600, mdtK 1 forward mdtK|b1663, mdtL 1 forward mdtL|b3710, mdtM-yjiN 2 reverse mdtM|b4337,yjiN|b4336, mdtNOP 3 reverse mdtN|b4082,mdtO|b4081,mdtP|b4080, mdtQ 1 reverse mdtQ|b2138, melAB 2 forward melA|b4119,melB|b4120, melR 1 reverse melR|b4118, menA 1 reverse menA|b3930, menFD-yfbB-menBCE 6 reverse menB|b2262,menC|b2261,menD|b2264, +menE|b2260,menF|b2265,yfbB|b2263, metA 1 forward metA|b4013, metBL 2 forward metB|b3939,metL|b3940, metC 1 forward metC|b3008, metE 1 forward metE|b3829, metF 1 forward metF|b3941, metG 1 forward metG|b2114, metH 1 forward metH|b4019, metJ 1 reverse metJ|b3938, metK 1 forward metK|b2942, metNIQ 3 reverse metI|b0198,metN|b0199,metQ|b0197, metR 1 reverse metR|b3828, metT-leuW-glnUW-metU-glnVX 7 reverse glnU|b0670,glnV|b0665,gl +nW|b0668,glnX|b0664,leuW|b0672,metT|b0673,metU|b0666, metY-yhbC-nusA-infB-rbfA-truB-rpsO-pnp 8 reverse infB|b3168,m +etY|b3171,nusA|b3169,pnp|b3164,rbfA|b3167,rpsO|b3165,truB|b3166,yhbC| +b3170, metZWV 3 forward metV|b2816,metW|b2815,metZ|b2814, mfd 1 reverse mfd|b1114, mglBAC 3 reverse mglA|b2149,mglB|b2150,mglC|b2148, mgrB 1 reverse mgrB|b1826, mgsA 1 reverse mgsA|b0963, mgtA 1 forward mgtA|b4242,

Replies are listed 'Best First'.
Re: Getting data from a file (Operons and Genes).
by thezip (Vicar) on Jun 14, 2007 at 22:32 UTC

    chrisantha, try this:

    #!/perl/bin/perl -w use strict; use Data::Dumper; my $hash = {}; for (<DATA>) { chomp; my @fields = split(/\s{4}/, $_); my @genes = split(',', $fields[3]); @genes = map { (split('\|'))[1] } @genes; push(@{$hash->{$fields[0]}}, \@genes); } print Dumper($hash); __DATA__ malS 1 forward malS|b3571, malT 1 forward malT|b3418, malXY 2 forward malX|b1621,malY|b1622, malZ 1 forward malZ|b0403, manA 1 forward manA|b1613, __OUTPUT__ $VAR1 = { 'malZ' => [ [ 'b0403' ] ], 'malS' => [ [ 'b3571' ] ], 'manA' => [ [ 'b1613' ] ], 'malXY' => [ [ 'b1621', 'b1622' ] ], 'malT' => [ [ 'b3418' ] ] };

    Is this what you're looking for?

    Updated: Added __OUTPUT__


    Where do you want *them* to go today?
      You've got an unnecessary level of array reference there; you could remove it by changing
      push(@{$hash->{$fields[0]}}, \@genes);
      to
      push(@{$hash->{$fields[0]}}, @genes);
      giving output:
      $VAR1 = { 'malZ' => [ 'b0403' ], 'malS' => [ 'b3571' ], 'manA' => [ 'b1613' ], 'malXY' => [ 'b1621', 'b1622' ], 'malT' => [ 'b3418' ] };

        I chose not to assume that an operon was unique thoughout the file, since it wasn't explicitly stated in the spec.

        If I can assume uniqueness, then I'd remove the push and change to a HoA:

        $hash->{$fields[0]} = \@genes;

        Where do you want *them* to go today?
        Thanks again. I tried your code on this data
        C0067 1 forward C0067|, C0293 1 forward C0293|, C0343-dbpA 2 forward C0343|,dbpA|b1343, C0465 1 forward C0465|, C0614 1 reverse C0614|, C0719 1 forward C0719|, IS128 1 forward IS128|, aaeR 1 forward aaeR|b3243, aaeXAB 3 reverse aaeA|b3241,aaeB|b3240,aaeX|b3242, aas-ygeD 2 reverse aas|b2836,ygeD|b2835, aat 1 reverse aat|b0885, abgABT-ogt 4 reverse abgA|b1338,abgB|b1337,abgT|b1336,ogt|b13 +35, abgR 1 forward abgR|b1339, abrB 1 reverse abrB|b0715, accA 1 forward accA|b0185, accBC 2 forward accB|b3255,accC|b3256, accD 1 reverse accD|b2316, aceBAK 3 forward aceA|b4015,aceB|b4014,aceK|b4016, ackA-pta 2 forward ackA|b2296,pta|b2297, acnA 1 forward acnA|b1276, acnB 1 forward acnB|b0118, acpH 1 reverse acpH|b0404, acpT 1 forward acpT|b3475, acrAB 2 reverse acrA|b0463,acrB|b0462, acrD 1 forward acrD|b2470, acrEF 2 forward acrE|b3265,acrF|b3266, acrR 1 forward acrR|b0464, acs-yjcH-actP 3 reverse acs|b4069,actP|b4067,yjcH|b4068, ada-alkB 2 reverse ada|b2213,alkB|b2212, add 1 forward add|b1623, ade 1 forward ade|b3665, adhE 1 reverse adhE|b1241, adhP 1 reverse adhP|b1478, adiA 1 reverse adiA|b4117, adiC 1 reverse adiC|b4115, adiY 1 reverse adiY|b4116, adk 1 forward adk|b0474, adrA 1 forward adrA|b0385, aegA 1 reverse aegA|b2468, aer 1 reverse aer|b3072, aes 1 reverse aes|b0476, agaR 1 reverse agaR|b3131, agaS-kbaY-agaBCDI 6 forward agaB|b3138,agaC|b3139,agaD|b3140, +agaI|b3141,agaS|b3136,kbaY|b3137, malS 1 forward malS|b3571, malT 1 forward malT|b3418, malXY 2 forward malX|b1621,malY|b1622, malZ 1 forward malZ|b0403, manA 1 forward manA|b1613, which gives output $VAR1 = { 'malZ' => [ [ 'b0403' ] ], ' => [ 'ada-alkB 2 reverse ada|b2213,alkB|b2212, [] ], 'malS' => [ [ 'b3571' ] ], ' => [ 'acnA 1 forward acnA|b1276, [] ], ' => [ 'aegA 1 reverse aegA|b2468, [] ], ' => [ 'adhP 1 reverse adhP|b1478, [] ], ' => [ 'abgABT-ogt 4 reverse abgA|b1338,abgB|b1337,abgT|b13 +36,ogt|b1335, + [] + ], ' => [ 'acnB 1 forward acnB|b0118, [] ], ' => [ 'acpT 1 forward acpT|b3475, [] ], ' => [ 'ade 1 forward ade|b3665, [] ], ' => [ 'acrD 1 forward acrD|b2470, [] ], ' => [ 'acpH 1 reverse acpH|b0404, [] ], ' => [ 'agaR 1 reverse agaR|b3131, [] ], ' => [ 'C0293 1 forward C0293|, [] ], 'malXY' => [ [ 'b1621', 'b1622' ] ], ' => [ 'aaeR 1 forward aaeR|b3243, [] ], ' => [ 'ackA-pta 2 forward ackA|b2296,pta|b2297, [] ], ' => [ 'aceBAK 3 forward aceA|b4015,aceB|b4014,aceK|b40 +16, [ +] ], ' => [ 'aat 1 reverse aat|b0885, [] ], ' => [ 'acrR 1 forward acrR|b0464, [] ], ' => [ 'accA 1 forward accA|b0185, [] ], ' => [ 'IS128 1 forward IS128|, [] ], ' => [ 'aaeXAB 3 reverse aaeA|b3241,aaeB|b3240,aaeX|b32 +42, [ +] ], 'malT' => [ [ 'b3418' ] ], ' => [ 'abrB 1 reverse abrB|b0715, [] ], ' => [ 'C0465 1 forward C0465|, [] ], ' => [ 'aas-ygeD 2 reverse aas|b2836,ygeD|b2835, [] ], ' => [ 'C0614 1 reverse C0614|, [] ], ' => [ 'adk 1 forward adk|b0474, [] ], 'agaS-kbaY-agaBCDI 6 forward agaB|b3138,agaC|b3139, +agaD|b3140,agaI|b3141,agaS|b3136,kbaY|b3137,' => [ + [] + ], ' => [ 'adiC 1 reverse adiC|b4115, [] ], ' => [ 'aes 1 reverse aes|b0476, [] ], ' => [ 'accBC 2 forward accB|b3255,accC|b3256, [] ], ' => [ 'add 1 forward add|b1623, [] ], ' => [ 'accD 1 reverse accD|b2316, [] ], ' => [ 'adrA 1 forward adrA|b0385, [] ], ' => [ 'adiA 1 reverse adiA|b4117, [] ], ' => [ 'aer 1 reverse aer|b3072, [] ], ' => [ 'acrEF 2 forward acrE|b3265,acrF|b3266, [] ], ' => [ 'acs-yjcH-actP 3 reverse acs|b4069,actP|b4067,y +jcH|b4068, + [] + ], ' => [ 'abgR 1 forward abgR|b1339, [] ], ' => [ 'C0719 1 forward C0719|, [] ], ' => [ 'adiY 1 reverse adiY|b4116, [] ], ' => [ 'C0343-dbpA 2 forward C0343|,dbpA|b1343, [] ], 'manA' => [ [ 'b1613' ] ], ' => [ 'acrAB 2 reverse acrA|b0463,acrB|b0462, [] ], ' => [ 'C0067 1 forward C0067|, [] ], ' => [ 'adhE 1 reverse adhE|b1241, [] ] };
        Which is strange because it works for the example data you sent me which I cut and pasted into the above data, but not for the original data file. Please can you explain this? Yours, Chrisantha
      Yes, thanks, I was trying somethign like this and failing. <code> #!/usr/bin/perl use strict; my $operon; my %operonHash; while (<>) { chomp; if ( /(\b.+?\b)/ ) { # word boundary + any character at least once, up to the first word boundary. #print "Matched: |$`<$&?>$'|\n"; $operon = $_; #print $& . " " ; $operonHash{$&} = (); } else { print "No match. \n"; } print "\n"; if ( /\w+\|/ ) { # word boundary + any character at least once, up to the first word boundary. print "Matched: |$`<<$&>>$'|\n"; } else { print "No match. \n"; } } <\code> #The problem is, 1. How to get rid of the | from the expression that was found, and 2. How to get MULTIPLE genes before | when more than one gene appears on a line?
Re: Getting data from a file (Operons and Genes).
by GrandFather (Saint) on Jun 15, 2007 at 00:46 UTC

    It's not at all clear what the bigger picture is, but if you want to subsequently do some sort of lookup to find the genes associated with a particular operon the generating a hash is more useful. Consider:

    use strict; use warnings; my %operons; while (<DATA>) { chomp; next unless /^(\S+) \s+\S+\s+\S+\s+ (\S+)$/x; my ($operon, $targets) = ($1, $2); my @genes = split /,/, $targets; $operons{$operon} ||= {}; for (@genes) { my ($subOperon, $gene) = /([^|]*)\|(.*)/; $operons{$operon}{$subOperon} = $gene; } } for my $operon (sort keys %operons) { print "$operon: ", join (', ', map {"$_ ($operons{$operon}{$_})"} sort keys %{$op +erons{$operon}}), "\n"; } __DATA__ malS 1 forward malS|b3571, malT 1 forward malT|b3418, malXY 2 forward malX|b1621,malY|b1622, malZ 1 forward malZ|b0403, manA 1 forward manA|b1613, manXYZ 3 forward manX|b1817,manY|b1818,manZ|b1819, map-glnD-dapD 3 reverse dapD|b0166,glnD|b0167,map|b0168, marC 1 reverse marC|b1529, marRAB 3 forward marA|b1531,marB|b1532,marR|b1530, mbhA 1 forward mbhA|, mcrA 1 forward mcrA|b1159, mcrBC 2 reverse mcrB|b4346,mcrC|b4345, mdaB 1 forward mdaB|b3028, mdh 1 reverse mdh|b3236, mdlAB 2 forward mdlA|b0448,mdlB|b0449, mdoB 1 reverse mdoB|b4359, mdoC 1 reverse mdoC|b1047, mdoD 1 forward mdoD|b1424, mdoGH 2 forward mdoG|b1048,mdoH|b1049, mdtABCD-baeSR 6 forward baeR|b2079,baeS|b2078,mdtA|b2074,mdtB +|b2075,mdtC|b2076,mdtD|b2077, mdtEF 2 forward mdtE|b3513,mdtF|b3514, mdtG 1 reverse mdtG|b1053, mdtH 1 reverse mdtH|b1065, mdtJI 2 reverse mdtI|b1599,mdtJ|b1600, mdtK 1 forward mdtK|b1663, mdtL 1 forward mdtL|b3710, mdtM-yjiN 2 reverse mdtM|b4337,yjiN|b4336, mdtNOP 3 reverse mdtN|b4082,mdtO|b4081,mdtP|b4080, mdtQ 1 reverse mdtQ|b2138, melAB 2 forward melA|b4119,melB|b4120, melR 1 reverse melR|b4118, menA 1 reverse menA|b3930, menFD-yfbB-menBCE 6 reverse menB|b2262,menC|b2261,menD|b2264, +menE|b2260,menF|b2265,yfbB|b2263, metA 1 forward metA|b4013, metBL 2 forward metB|b3939,metL|b3940, metC 1 forward metC|b3008, metE 1 forward metE|b3829, metF 1 forward metF|b3941, metG 1 forward metG|b2114, metH 1 forward metH|b4019, metJ 1 reverse metJ|b3938, metK 1 forward metK|b2942, metNIQ 3 reverse metI|b0198,metN|b0199,metQ|b0197, metR 1 reverse metR|b3828, metT-leuW-glnUW-metU-glnVX 7 reverse glnU|b0670,glnV|b0665,gl +nW|b0668,glnX|b0664,leuW|b0672,metT|b0673,metU|b0666, metY-yhbC-nusA-infB-rbfA-truB-rpsO-pnp 8 reverse infB|b3168,m +etY|b3171,nusA|b3169,pnp|b3164,rbfA|b3167,rpsO|b3165,truB|b3166,yhbC­ +|b3170, metZWV 3 forward metV|b2816,metW|b2815,metZ|b2814, mfd 1 reverse mfd|b1114, mglBAC 3 reverse mglA|b2149,mglB|b2150,mglC|b2148, mgrB 1 reverse mgrB|b1826, mgsA 1 reverse mgsA|b0963, mgtA 1 forward mgtA|b4242,

    Prints:

    malS: malS (b3571) malT: malT (b3418) malXY: malX (b1621), malY (b1622) malZ: malZ (b0403) manA: manA (b1613) manXYZ: manX (b1817), manY (b1818), manZ (b1819) map-glnD-dapD: dapD (b0166), glnD (b0167), map (b0168) marC: marC (b1529) marRAB: marA (b1531), marB (b1532), marR (b1530) mbhA: mbhA () mcrA: mcrA (b1159) mcrBC: mcrB (b4346), mcrC (b4345) mdaB: mdaB (b3028) mdh: mdh (b3236) mdlAB: mdlA (b0448), mdlB (b0449) mdoB: mdoB (b4359) mdoC: mdoC (b1047) mdoD: mdoD (b1424) mdoGH: mdoG (b1048), mdoH (b1049) mdtABCD-baeSR: baeR (b2079), baeS (b2078), mdtA (b2074), mdtB (b2075), + mdtC (b2076), mdtD (b2077) mdtEF: mdtE (b3513), mdtF (b3514) mdtG: mdtG (b1053) mdtH: mdtH (b1065) mdtJI: mdtI (b1599), mdtJ (b1600) mdtK: mdtK (b1663) mdtL: mdtL (b3710) mdtM-yjiN: mdtM (b4337), yjiN (b4336) mdtNOP: mdtN (b4082), mdtO (b4081), mdtP (b4080) mdtQ: mdtQ (b2138) melAB: melA (b4119), melB (b4120) melR: melR (b4118) menA: menA (b3930) menFD-yfbB-menBCE: menB (b2262), menC (b2261), menD (b2264), menE (b22 +60), menF (b2265), yfbB (b2263) metA: metA (b4013) metBL: metB (b3939), metL (b3940) metC: metC (b3008) metE: metE (b3829) metF: metF (b3941) metG: metG (b2114) metH: metH (b4019) metJ: metJ (b3938) metK: metK (b2942) metNIQ: metI (b0198), metN (b0199), metQ (b0197) metR: metR (b3828) metT-leuW-glnUW-metU-glnVX: glnU (b0670), glnV (b0665), glnW (b0668), +glnX (b0664), leuW (b0672), metT (b0673), metU (b0666) metY-yhbC-nusA-infB-rbfA-truB-rpsO-pnp: infB (b3168), metY (b3171), nu +sA (b3169), pnp (b3164), rbfA (b3167), rpsO (b3165), truB (b3166), yh +bC­ (b3170) metZWV: metV (b2816), metW (b2815), metZ (b2814) mfd: mfd (b1114) mglBAC: mglA (b2149), mglB (b2150), mglC (b2148) mgrB: mgrB (b1826) mgsA: mgsA (b0963) mgtA: mgtA (b4242)

    DWIM is Perl's answer to Gödel
      Thanks, That is precisely what I want, except when I run it, it does not get past this line. next unless /^(\S+) \s+\S+\s+\S+\s+ (\S+)$/; The lines don't match this, and so it skips the loop. This is very strange, considering you were able to get the output. Might it be a problem with how I sent the data to this website? Did the data format get changed? Yours, Chrisantha
      Sorry, actually it works nicely. I was saving my file as RTF, not txt. I like the way you've done the next unless, I wouldn't have through of doing it that way. Yours, Chrisantha
Re: Getting data from a file (Operons and Genes).
by scmason (Monk) on Jun 14, 2007 at 22:28 UTC
    So, I am not sure which part you need help with. Here is how you break each line into 'parts'
    foreach my $line (@lines){ @lineFields = split( ' ', $line ); }
    that gives us an array of the words in the line, so it has the following parts:

    $lineFields[0] is the Operon name
    $lineFields[1] is 1 2 3...
    $lineFields[2] is forward or reverse
    $lineFileds[3] is the 'genes'

    Then we parse the 'genes' in a similar manner

    #get the comman separated list of genes my @geneItems = split( ',', $lineFileds[3] ); #now look at each one in turn foreach my @item ( @geneItems ){ #just continue processing and store in hash }
    That is not the complete answer, but it will get you where you want to go.
    "Never take yourself too seriously, because everyone knows that fat birds dont fly" -FLC