manu7495 has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all Monks, I am newbee in perl and badly stuck in a project with deadline hanging over my head. I need to format my input data Eg. Data to be formatted <filed1> | <field2> | <field3> | <field4>....as shown below (real data)
271479 | Papaya leaf curl Guandong virus | Geminiviridae | V1.3_105649 +:75.8 12202 | Lettuce mosaic virus | Potyviridae | V1.3_118815:65.4 116056 | Pelargonium zonate spot virus | Bromoviridae | V1_111931:65.5 45709 | Sabia virus | Arenaviridae | V1_112261:16.8
FORMAT that I need <filed1> | <field2>
271479 | Papaya leaf curl Guandong virus 12202 | Lettuce mosaic virus 116056 | Pelargonium zonate spot virus 45709 | Sabia virus 130556 | Culex nigripalpus NPV
The actual code I wrote to sort and rank...But this doesnt have the code to accomodate the input data of the form shown above and hence it fails unless I can pre process the data and then run the script. Here is the actual code;
#!/usr/bin/perl # Rank array data in accordance with their P-values # Script author: AR, 2007. my $pvalues_file = $ARGV[0]; my $MAX_TOP_ENTRIES = 20; open PVAL, "< $pvalues_file" or die "Error: Can't open $pvalues_file: +$!"; my %arrays; my @profile_names; while (my $line = <PVAL>) { chomp $line; if ($line =~ /^ARRAYS/) { my $hdr, $i; ($hdr, @profile_names) = split("\t", $line); for $i (0 ... $#{@profile_names}) { $profile_names[$i] =~ s/^\s+//; $profile_names[$i] =~ s/\s+$//; } } else { (my $array_name, @pvalues) = split("\t", $line); for $i (0 ... $#{@profile_names}) { $arrays{$array_name}{$profile_names[$i]} = $pvalues[$i]; } } } foreach $array (keys %arrays) { my $top_count = 0; print "$array\t"; %pvalues = %{$arrays{$array}}; @profiles_sorted = sort { $pvalues{$a} <=> $pvalues{$b} } ( ke +ys %pvalues ); foreach $key (@profiles_sorted) { $top_count++; if ($top_count <= $MAX_TOP_ENTRIES) { print "$key:$pvalues{$key}\t"; } } print "\n"; }
ANY HELP WILL BE GREATLY APPRECIATED. If I can just run a one line perl command to do this on command line that is also fine PLEASE HELP.......THANKS

Replies are listed 'Best First'.
Re: formating data input
by talexb (Chancellor) on Jun 27, 2007 at 13:51 UTC

    A one-liner that would do the job is (not tested)

    perl -ne '@a=split('\|',$_,3);pop;print join('|',@a)."\n";' <file

    But it helps tremendously if you learn a bit of Perl to understand what's going on. Just copying and pasting a solution is a very dangerous thing.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: formating data input
by BrowserUk (Patriarch) on Jun 27, 2007 at 14:00 UTC

    Using -a (autosplit) -F (split delimiter), -n (read file line by line) -l (add newlines): See perlrun

    Switch "s for 's on a unix system.

    C:\test>perl -aF"\|" -nle"print join '|', @F[0,1]" junk6.pl 271479 | Papaya leaf curl Guandong virus 12202 | Lettuce mosaic virus 116056 | Pelargonium zonate spot virus 45709 | Sabia virus

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      THANKS TONS TONS AND TONS for all the wonderful suggestions. I could get what I wanted on the shell. Thanks again Is there any way I could use this inside the code I have posted and still do what I am intending to do. I just tried and it screwed up a bit.....so would offer any suggestions. Again thanks a lot.....I very much appreciate the help Thanks

        I do not know how to answer you because the code you posted doesn't seem to relate to the data posted--either before or after the reduction to two fields.

        The program is looking for lines that begin the word ARRAY and ignores any other lines, but none of your posted data has line that begin with ARRAY?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: formating data input
by syphilis (Archbishop) on Jun 27, 2007 at 13:46 UTC
    Hi,
    We need to be able to (easily) see the input you're getting and the output you want. The input you're getting is not readily apparent from what you've posted. Is it:
    271479 | Papaya leaf curl Guandong virus | Geminiviridae | V1.3_105649:75.8 12202 | Lettuce mosaic virus | Potyviridae | V1.3_118815:65.4 116056 | Pelargonium zonate spot virus | Bromoviridae + | V1_111931:65.5 45709 | Sabia virus | Arenaviridae | V1_112261:16.8 FORMAT that I need <filed1> | <field2> 271479 | Papaya +leaf curl Guandong virus 12202 | Lettuce mosaic virus 116056 | Pelargonium zonate spot virus 45709 | Sabia virus 130556 | Culex nigripalpus
    Does that represent the input ? (If not, please amend.) And please indicate the output you wish to produce.

    Cheers,
    Rob
    Update: What can I say ? ... the original (updated-from-when-I-first-replied-to-it) post merely reveals what we all knew all along ... that I'm just a braindead moron dickhead .... thanks manu7495 .... /me will go away now ... never to be seen again ... for at least 3 minutes :-)
Re: formating data input
by swampyankee (Parson) on Jun 27, 2007 at 14:59 UTC

    I'm not quite sure I don't understand your question: usually input data are pre-existing, and the program has to be written to process them.

    What you seem to want is the first two fields of your input data. There are, of course, several ways to do that. What I would probably do would look like this:

    #!/usr/bin/perl use strict; use warnings; my $input_file = pop(@ARGV) or die "Please enter file name on command +line\n"; open(my $input, "<", $input_file) or die "Could not open $input_file b +ecause $!\n"; while(<$input>) { my @line = split(/\s*\|\s*/, $_); print join(' | ', @line[0..1]) . "\n"; } close ($input);

    emc

    Any New York City or Connecticut area jobs? I'm currently unemployed.

    There are some enterprises in which a careful disorderliness is the true method.

    —Herman Melville
      Here is exactly my input data pasted below that I am using this script for. It chokes when in the first cell (if you were to import the whole data into excel) when it has more than the form <filed1> | <filed2> | <field3> | <field4> all this is one annotation in one CELL.Look at the code 1 below which is the actual data in the form I need to only have <filed1> | <filed2> of this cell but still retain the rest of the values in other cells. look at code 2 thats the form I intend to have. The previous one line code U sent does strip and put the data in the format I am desiring but takes away all the other data in my table which is not intended. Looking at the 2 examples pasted here as code(tab file) I am sure you will understand what I am asking
      ARRAYS V1.4HFChip-1 V1.4HFChip-10 V1.4HFChip-100 V1.4HFChi +p-11 V1.4HFChip-12 V1.4HFChip-13 37133 | Tula virus | Bunyaviridae | V1.3_110017:22.1 0.539026 0. +357762 0.801409 0.315076 0.207579 0.946322 263532 | Possum enterovirus W6 | Picornaviridae | V1.3_116027:82.1 +0.242743 0.712059 0.474686 0.738211 0.26494 0.529945 271479 | Papaya leaf curl Guandong virus | Geminiviridae | V1.3_105649 +:75.8 0.291412 0.726736 0.277159 0.893388 0.24579 0 +.904211 12202 | Lettuce mosaic virus | Potyviridae | V1.3_118815:65.4 0.391 +46 0.567612 0.771404 0.671439 0.427434 0.855816 116056 | Pelargonium zonate spot virus | Bromoviridae | V1_111931:65.5 + 0.704965 0.750921 0.66365 0.835392 0.654149 0.0426 +2 45709 | Sabia virus | Arenaviridae | V1_112261:16.8 0.392471 0.7 +40175 0.584603 0.861441 0.434677 0.758832 130556 | Culex nigripalpus NPV | Baculoviridae | V1_112047:15.8 0.3 +15955 0.882084 0.551393 0.909915 0.088346 0.745482 312349 | Procyon lotor papillomavirus type 1 | Papillomaviridae | V1.3 +_113827:83.8 0.652409 0.200222 0.65569 0.239118 0.5376 +55 0.889673 243550 | Calicivirus isolate TCG | Caliciviridae | V1.3_115411:78.6 + 0.324359 0.820308 0.238306 0.88163 0.311354 0.741035 150285 | Garlic virus E | Flexiviridae | V1.3_103783:90.0 0.267302 + 0.809609 0.55432 0.908932 0.193653 0.718928
      ARRAYS V1.4HFChip-1 V1.4HFChip-10 V1.4HFChip-100 V1.4HFChi +p-11 V1.4HFChip-12 V1.4HFChip-13 37133 | Tula virus 0.539026 0.357762 0.801409 0.315076 + 0.207579 0.946322 263532 | Possum enterovirus W6 0.242743 0.712059 0.474686 + 0.738211 0.26494 0.529945 271479 | Papaya leaf curl Guandong virus 0.291412 0.726736 0 +.277159 0.893388 0.24579 0.904211 12202 | Lettuce mosaic virus 0.39146 0.567612 0.771404 0. +671439 0.427434 0.855816 116056 | Pelargonium zonate spot virus 0.704965 0.750921 0.6 +6365 0.835392 0.654149 0.04262 45709 | Sabia virus 0.392471 0.740175 0.584603 0.861441 + 0.434677 0.758832 130556 | Culex nigripalpus NPV 0.315955 0.882084 0.551393 + 0.909915 0.088346 0.745482 312349 | Procyon lotor papillomavirus type 1 0.652409 0.200222 + 0.65569 0.239118 0.537655 0.889673 243550 | Calicivirus isolate TCG 0.324359 0.820308 0.238306 + 0.88163 0.311354 0.741035 150285 | Garlic virus E 0.267302 0.809609 0.55432 0.90893 +2 0.193653 0.718928
      Hope this makes my question clear. I restate that I just want the curation to happen in the first cell but retain all the values and just for information I have several thousands of cells like this