in reply to Re^3: formating data input
in thread formating data input

Here is exactly my input data pasted below that I am using this script for. It chokes when in the first cell (if you were to import the whole data into excel) when it has more than the form <filed1> | <filed2> | <field3> | <field4> all this is one annotation in one CELL.Look at the code 1 below which is the actual data in the form I need to only have <filed1> | <filed2> of this cell but still retain the rest of the values in other cells. look at code 2 thats the form I intend to have. The previous one line code U sent does strip and put the data in the format I am desiring but takes away all the other data in my table which is not intended. Looking at the 2 examples pasted here as code(tab file) I am sure you will understand what I am asking
ARRAYS V1.4HFChip-1 V1.4HFChip-10 V1.4HFChip-100 V1.4HFChi +p-11 V1.4HFChip-12 V1.4HFChip-13 37133 | Tula virus | Bunyaviridae | V1.3_110017:22.1 0.539026 0. +357762 0.801409 0.315076 0.207579 0.946322 263532 | Possum enterovirus W6 | Picornaviridae | V1.3_116027:82.1 +0.242743 0.712059 0.474686 0.738211 0.26494 0.529945 271479 | Papaya leaf curl Guandong virus | Geminiviridae | V1.3_105649 +:75.8 0.291412 0.726736 0.277159 0.893388 0.24579 0 +.904211 12202 | Lettuce mosaic virus | Potyviridae | V1.3_118815:65.4 0.391 +46 0.567612 0.771404 0.671439 0.427434 0.855816 116056 | Pelargonium zonate spot virus | Bromoviridae | V1_111931:65.5 + 0.704965 0.750921 0.66365 0.835392 0.654149 0.0426 +2 45709 | Sabia virus | Arenaviridae | V1_112261:16.8 0.392471 0.7 +40175 0.584603 0.861441 0.434677 0.758832 130556 | Culex nigripalpus NPV | Baculoviridae | V1_112047:15.8 0.3 +15955 0.882084 0.551393 0.909915 0.088346 0.745482 312349 | Procyon lotor papillomavirus type 1 | Papillomaviridae | V1.3 +_113827:83.8 0.652409 0.200222 0.65569 0.239118 0.5376 +55 0.889673 243550 | Calicivirus isolate TCG | Caliciviridae | V1.3_115411:78.6 + 0.324359 0.820308 0.238306 0.88163 0.311354 0.741035 150285 | Garlic virus E | Flexiviridae | V1.3_103783:90.0 0.267302 + 0.809609 0.55432 0.908932 0.193653 0.718928
ARRAYS V1.4HFChip-1 V1.4HFChip-10 V1.4HFChip-100 V1.4HFChi +p-11 V1.4HFChip-12 V1.4HFChip-13 37133 | Tula virus 0.539026 0.357762 0.801409 0.315076 + 0.207579 0.946322 263532 | Possum enterovirus W6 0.242743 0.712059 0.474686 + 0.738211 0.26494 0.529945 271479 | Papaya leaf curl Guandong virus 0.291412 0.726736 0 +.277159 0.893388 0.24579 0.904211 12202 | Lettuce mosaic virus 0.39146 0.567612 0.771404 0. +671439 0.427434 0.855816 116056 | Pelargonium zonate spot virus 0.704965 0.750921 0.6 +6365 0.835392 0.654149 0.04262 45709 | Sabia virus 0.392471 0.740175 0.584603 0.861441 + 0.434677 0.758832 130556 | Culex nigripalpus NPV 0.315955 0.882084 0.551393 + 0.909915 0.088346 0.745482 312349 | Procyon lotor papillomavirus type 1 0.652409 0.200222 + 0.65569 0.239118 0.537655 0.889673 243550 | Calicivirus isolate TCG 0.324359 0.820308 0.238306 + 0.88163 0.311354 0.741035 150285 | Garlic virus E 0.267302 0.809609 0.55432 0.90893 +2 0.193653 0.718928
Hope this makes my question clear. I restate that I just want the curation to happen in the first cell but retain all the values and just for information I have several thousands of cells like this

Replies are listed 'Best First'.
Re^5: formating data input
by BrowserUk (Patriarch) on Jun 27, 2007 at 17:22 UTC

    Try this. It appears to split the lines as you want them. See the dumper output produced. I haven't attempted to verify the sort as it's not clear to me what you are attempting to achieve.

    Sorting the Array name (number) and the virus name (string) in amongst the data values (Reals) using a numeric sort doesn't make a lot of sense (to me). (If you had warnings enabled, perl would yell at you about that.)

    Your indentation leave something to be desired, but that may be an artifact of c&ping the code.

    Using $#{ @profile_names } instead of $#$profile_names looks weird also, but perl seems to silently DWYM on that also.

    #!/usr/bin/perl # Rank array data in accordance with their P-values # Script author: AR, 2007. use Data::Dumper; my $pvalues_file = $ARGV[0]; my $MAX_TOP_ENTRIES = 20; #open PVAL, "< $pvalues_file" or die "Error: Can't open $pvalues_file: + $!"; my %arrays; my @profile_names; while (my $line = <DATA>) { chomp $line; if ($line =~ /^ARRAYS/) { my $hdr; ( $hdr, @profile_names ) = split( ' ', $line ); } else { ## Split first on the pipes, leaving the values attached to th +e end of the last field my( $array_name, @pvalues ) = split( '\|', $line ); ## Then split the last field on whitespace and overlay the fie +lds you which to discard ## Discarding the unwanted first two fields of the second spli +t at the same time. @pvalues[ 1 .. 6 ] = ( split( '\s+', $pvalues[ 2 ] ) )[ 2 .. 7 + ]; for my $i ( 0 .. $#profile_names ) { $arrays{ $array_name }{ $profile_names[$i] } = $pvalues[$i +]; } } } print Dumper \%arrays; exit; foreach $array (keys %arrays) { my $top_count = 0; print "$array\t"; %pvalues = %{$arrays{$array}}; @profiles_sorted = sort { $pvalues{$a} <=> $pvalues{$b} } ( ke +ys %pvalues ); foreach $key (@profiles_sorted) { $top_count++; if ($top_count <= $MAX_TOP_ENTRIES) { print "$key:$pvalues{$key}\t"; } } print "\n"; } __DATA__ ARRAYS V1.4HFChip-1 V1.4HFChip-10 V1.4HFChip-100 V1.4HFChi +p-11 V1.4HFChip-12 V1.4HFChip-13 37133 | Tula virus | Bunyaviridae | V1.3_110017:22.1 0.539026 0. +357762 0.801409 0.315076 0.207579 0.946322 263532 | Possum enterovirus W6 | Picornaviridae | V1.3_116027:82.1 +0.242743 0.712059 0.474686 0.738211 0.26494 0.529945 271479 | Papaya leaf curl Guandong virus | Geminiviridae | V1.3_105649 +:75.8 0.291412 0.726736 0.277159 0.893388 0.24579 0 +.904211 12202 | Lettuce mosaic virus | Potyviridae | V1.3_118815:65.4 0.391 +46 0.567612 0.771404 0.671439 0.427434 0.855816 116056 | Pelargonium zonate spot virus | Bromoviridae | V1_111931:65.5 + 0.704965 0.750921 0.66365 0.835392 0.654149 0.0426 +2 45709 | Sabia virus | Arenaviridae | V1_112261:16.8 0.392471 0.7 +40175 0.584603 0.861441 0.434677 0.758832 130556 | Culex nigripalpus NPV | Baculoviridae | V1_112047:15.8 0.3 +15955 0.882084 0.551393 0.909915 0.088346 0.745482 312349 | Procyon lotor papillomavirus type 1 | Papillomaviridae | V1.3 +_113827:83.8 0.652409 0.200222 0.65569 0.239118 0.5376 +55 0.889673 243550 | Calicivirus isolate TCG | Caliciviridae | V1.3_115411:78.6 + 0.324359 0.820308 0.238306 0.88163 0.311354 0.741035 150285 | Garlic virus E | Flexiviridae | V1.3_103783:90.0 0.267302 + 0.809609 0.55432 0.908932 0.193653 0.718928

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.