in reply to Re: Parsing csv without changing dimension of original file
in thread Parsing csv without changing dimension of original file

Thank you very much for your reply. Here is the sample of my original data "kegg_pathway_title.txt":

PVX_088085 Protein processing in endoplasmic reticulum PVX_114095 Protein processing in endoplasmic reticulum PVX_123055 Ribosome biogenesis in eukaryotes PYYM_1032000 - PYYM_1120600 - PCYB_031930 Purine metabolism; Metabolic pathways; DNA replication; + Pyrimidine metabolism

The orhtogroups_3.csv has 13 columns

Cparvum Bmicroti Tparva Pberghei Pchabaudi Pcynomolgi + Pfalciparum Pknowlesi Preichenowi Pvivax Pyoelii Pma +lariae Tgondii OG0000000 PBANKA_0000600, PBANKA_0000701, PBANKA_000080 +1, PBANKA_0001001, PBANKA_0001101, PBANKA_0001201, PBANKA_0001301, PB +ANKA_0001401, PBANKA_0001501, PBANKA_0006300, PBANKA_0006401, PBANKA_ +0006501, PBANKA_0006600, PBANKA_0006701, OG0000001 PmUG01_000101 +00.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmUG01_00010500. +1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01_00010800.1-p +1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00011300.1-p1, +PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011700.1-p1, PmU +G01_00012100.1-p1, PmUG01_00012200.1-p1,

Expected output:

Cparvum Bmicroti Tparva Pberghei Pchabaudi Pcynomol +gi Pfalciparum Pknowlesi Preichenowi Pvivax Pyoelii + Pmalariae Tgondii OG0000000 - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , +- , - , - , - , - , - , - , - , - , - , + - , - , - , - , - , - , - , - , - , - +, - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , +- , - , - , - , - , - , - , - , - , - , + - , - , - , - , - , - , - , - , - , - +, - , - OG0000024 - , - , - , - - , - , - - + , - , - - , - , - , - - , - , - + Protein processing in endoplasmic reticulum , - , - , - + , - - , - , - - , - , - - , - + , - - , - , - , - - , - , - - + , - , - , - - , - , - , - , - , - , +- , - , - , - , - , - , - , - , - OG0000025 - , - , - - , - , - , - - + , - , - , - - , - , - , - - , - , + - , - Protein processing in endoplasmic reticulum , Pro +tein processing in endoplasmic reticulum , - , Ribosome biogene +sis in eukaryotes - , - , - , - - , - , +- , - - , - , - , - - , Protein processi +ng in endoplasmic reticulum , Protein processing in endoplasmic re +ticulum , Ribosome biogenesis in eukaryotes - , - , - + , - - , - , - , - - , - , - , - + , - , - , - OG0000026

I want the column number (13) in orthogroups_3.csv and the parsed results to be same. Best regards Zillur

Replies are listed 'Best First'.
Re^3: Parsing csv without changing dimension of original file
by huck (Prior) on Mar 07, 2017 at 01:08 UTC

    Well there are many many things wrong here

    in your original example page http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary you missed the part where it says "I have a dictionary(dict.txt). It is space separated and it reads like this:. Your kegg_pathway_title.txt instead has a tab after the replace-from field. In a way that is easy to fix, change the Line to

    my %dict = map { chomp; split "\t", $_, 2 } <$fh>;

    Next

    > grpsTbl <- read.csv("Orthogroups_3.csv", header=T, sep = "\t", row. +names = 1, stringsAsFactors=F)
    implies that the fields are separated by a tab (\t). Yes your column headers are separated by a tab, and there are tabs in your other rows, but row OG0000000 only has 5 tab separated fields, three of them being blank due to consecutive tabs, and
    "PBANKA_0000600, PBANKA_0000701, PBANKA_0000801, PBANKA_0001001, PBANK +A_0001101, PBANKA_0001201, PBANKA_0001301, PBANKA_0001401, PBANKA_000 +1501, PBANKA_0006300, PBANKA_0006401, PBANKA_0006501, PBANKA_0006600, + PBANKA_0006701,"
    being considered as one field, being number 5. There is a tab however after the OG0000000 at least

    The next line OG0000001 does have 13 tab delimited fields, OG0000001 has a tab after it to put it in its own column, followed by 11 blank fields due to consecutive tabs, and

    "PmUG01_00010100.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmU +G01_00010500.1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01 +_00010800.1-p1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00 +011300.1-p1, PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011 +700.1-p1, PmUG01_00012100.1-p1, PmUG01_00012200.1-p1,"
    being considered the contents of the 13th column

    Given the following as your dictionary

    PVX_088085 Protein processing in endoplasmic reticulum PVX_114095 Protein processing in endoplasmic reticulum PVX_123055 Ribosome biogenesis in eukaryotes PYYM_1032000 - PYYM_1120600 - PCYB_031930 Purine metabolism; Metabolic pathways; DNA replication; + Pyrimidine metabolism
    one notices that none of the replace-from fields in it even occur in your sample Orthogroups_3.csv file at all, so your expected output is a myth.

    and besides the tab after the replace-from in your dictionary file, adding this code identifies a major problem

    for my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); }
    result
    for lookup:PVX_114095 tab in field:Protein processing in endoplasmic r +eticulum for lookup:PYYM_1032000 tab in field:- for lookup:PVX_088085 tab in field:Protein processing in endoplasmic r +eticulum
    Those new tabs introduce "extra columns" to the output.

    The code that identifies all these problems is

    # This script was excerpted from http://stackoverflow.com/questions/11 +678939/replace-text-based-on-a-dictionary use strict; use warnings; #use Text::CSV; use Data::Dumper; local $Data::Dumper::Deepcopy=1; local $Data::Dumper::Purity=1; local $Data::Dumper::Sortkeys=0; local $Data::Dumper::Indent=3; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; my %dict = map { chomp; split "\t", $_, 2 } <$fh>; warn Dumper \%dict; for my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); } #my %dict = map { chomp; split ' ', $_, 2 } <$fh>; my $re = join '|', keys %dict; #close $fh; open $fh, '<', 'Orthogroups_3.csv' or die $!; while (<$fh>) { print $. ."\n"; next if $. < 2; my @a0=split("\t",$_); warn Dumper \@a0; s/($re)/$dict{$1}/g; print; }

    All this leads me to think you dont have much of a clue as to what you are doing and are just trying cookie-cutter examples found on the web. This is a bad thing to do

    Edit: code tabs around the huge fields, but im not sure its any better

      Thank you very much for your comment. Sorry for the inconvenience. Using "sep=\t" in your first script solved the problem and this give me the exact output as the latest script

      my %dict =  map { chomp; split '\t', $_, 2 } <$fh>;

      I have another problem. In the result, I still have previous text in many cells. Its strange, some cells replaced exactly some not. Either they might not be replaced or replaced by the 1st column of the 'egg_pathway_title.txt'. I was trying to delete those strings but failed. What I have done

      ut -f1 bioDBnet_db2db_KEGG_Title_final.txt > exclude-these.txt

      but failed. I have tried in many ways. Is there any way to upload my files here? Best Regards Zillur