Re^2: Parsing csv without changing dimension of original file

Thank you very much for your reply. Here is the sample of my original data "kegg_pathway_title.txt":

 PVX_088085    Protein processing in endoplasmic reticulum    
PVX_114095    Protein processing in endoplasmic reticulum    
PVX_123055    Ribosome biogenesis in eukaryotes
PYYM_1032000    -    
PYYM_1120600    -
PCYB_031930    Purine metabolism; Metabolic pathways; DNA replication;
+ Pyrimidine metabolism
[download]

The orhtogroups_3.csv has 13 columns

Cparvum    Bmicroti    Tparva    Pberghei    Pchabaudi    Pcynomolgi  
+  Pfalciparum    Pknowlesi    Preichenowi    Pvivax    Pyoelii    Pma
+lariae    Tgondii
OG0000000                PBANKA_0000600, PBANKA_0000701, PBANKA_000080
+1, PBANKA_0001001, PBANKA_0001101, PBANKA_0001201, PBANKA_0001301, PB
+ANKA_0001401, PBANKA_0001501, PBANKA_0006300, PBANKA_0006401, PBANKA_
+0006501, PBANKA_0006600, PBANKA_0006701,
OG0000001                                                PmUG01_000101
+00.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmUG01_00010500.
+1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01_00010800.1-p
+1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00011300.1-p1, 
+PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011700.1-p1, PmU
+G01_00012100.1-p1, PmUG01_00012200.1-p1,
[download]

Expected output:

    Cparvum    Bmicroti    Tparva    Pberghei    Pchabaudi    Pcynomol
+gi    Pfalciparum    Pknowlesi    Preichenowi    Pvivax    Pyoelii   
+ Pmalariae    Tgondii
OG0000000                -    , -    , -    , -    , -    , -    , -  
+  , -    , -    , -    , -    , -    , -    , -    , -    , -    , - 
+   , -    , -    , -    , -    , -    , -    , -    , -    , -    , -
+    , -    , -    , -    , -    , -    , -    , -    , -    , -    , 
+-    , -    , -    , -    , -    , -    , -    , -    , -    , -    ,
+ -    , -    , -    , -    , -    , -    , -    , -    , -    , -    
+, -    , -    , -    , -    , -    , -    , -    , -    , -    , -   
+ , -    , -    , -    , -    , -    , -    , -    , -    , -    , -  
+  , -    , -    , -    , -    , -    , -    , -    , -    , -    , - 
+   , -    , -    , -    , -    , -    , -    , -    , -    , -    , -
+    , -    , -    , -    , -    , -    , -    , -    , -    , -    , 
+-    , -    , -    , -    , -    , -    , -    , -    , -    , -    ,
+ -    , -    , -    , -    , -    , -    , -    , -    , -    , -    
+, -    , -    
OG0000024    -    , -    , -    , -        -    , -    , -        -   
+ , -    , -        -    , -    , -    , -        -    , -    , -     
+   Protein processing in endoplasmic reticulum    , -    , -    , -  
+  , -        -    , -    , -        -    , -    , -        -    , -  
+  , -        -    , -    , -    , -        -    , -    , -        -  
+  , -    , -    , -        -    , -    , -    , -    , -    , -    , 
+-    , -    , -    , -    , -    , -    , -    , -    , -    
OG0000025    -    , -    , -        -    , -    , -    , -        -   
+ , -    , -    , -        -    , -    , -    , -        -    , -    ,
+ -    , -        Protein processing in endoplasmic reticulum    , Pro
+tein processing in endoplasmic reticulum    , -    , Ribosome biogene
+sis in eukaryotes        -    , -    , -    , -        -    , -    , 
+-    , -        -    , -    , -    , -        -    , Protein processi
+ng in endoplasmic reticulum    , Protein processing in endoplasmic re
+ticulum    , Ribosome biogenesis in eukaryotes        -    , -    , -
+    , -        -    , -    , -    , -        -    , -    , -    , -  
+  , -    , -    , -    
OG0000026
[download]

I want the column number (13) in orthogroups_3.csv and the parsed results to be same. Best regards Zillur

Comment on Re^2: Parsing csv without changing dimension of original file Select or Download Code

Replies are listed 'Best First'.
Re^3: Parsing csv without changing dimension of original file by huck (Prior) on Mar 07, 2017 at 01:08 UTC
Well there are many many things wrong here in your original example page http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary you missed the part where it says "I have a dictionary(dict.txt). It is space separated* and it reads like this:*. Your kegg_pathway_title.txt instead has a tab after the replace-from field. In a way that is easy to fix, change the Line to `my %dict = map { chomp; split "\t", $_, 2 } <$fh>;` [download] Next `> grpsTbl <- read.csv("Orthogroups_3.csv", header=T, sep = "\t", row. +names = 1, stringsAsFactors=F)` [download] implies that the fields are separated by a tab (\t). Yes your column headers are separated by a tab, and there are tabs in your other rows, but row OG0000000 only has 5 tab separated fields, three of them being blank due to consecutive tabs, and `"PBANKA_0000600, PBANKA_0000701, PBANKA_0000801, PBANKA_0001001, PBANK +A_0001101, PBANKA_0001201, PBANKA_0001301, PBANKA_0001401, PBANKA_000 +1501, PBANKA_0006300, PBANKA_0006401, PBANKA_0006501, PBANKA_0006600, + PBANKA_0006701,"` [download] being considered as one field, being number 5. There is a tab however after the OG0000000 at least The next line OG0000001 does have 13 tab delimited fields, OG0000001 has a tab after it to put it in its own column, followed by 11 blank fields due to consecutive tabs, and `"PmUG01_00010100.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmU +G01_00010500.1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01 +_00010800.1-p1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00 +011300.1-p1, PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011 +700.1-p1, PmUG01_00012100.1-p1, PmUG01_00012200.1-p1,"` [download] being considered the contents of the 13th column Given the following as your dictionary `PVX_088085 Protein processing in endoplasmic reticulum PVX_114095 Protein processing in endoplasmic reticulum PVX_123055 Ribosome biogenesis in eukaryotes PYYM_1032000 - PYYM_1120600 - PCYB_031930 Purine metabolism; Metabolic pathways; DNA replication; + Pyrimidine metabolism` [download] one notices that none of the replace-from fields in it even occur in your sample Orthogroups_3.csv file at all, so your expected output is a myth. and besides the tab after the replace-from in your dictionary file, adding this code identifies a major problem `for my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); }` [download] result `for lookup:PVX_114095 tab in field:Protein processing in endoplasmic r +eticulum for lookup:PYYM_1032000 tab in field:- for lookup:PVX_088085 tab in field:Protein processing in endoplasmic r +eticulum` [download] Those new tabs introduce "extra columns" to the output. The code that identifies all these problems is # This script was excerpted from http://stackoverflow.com/questions/11 +678939/replace-text-based-on-a-dictionary use strict; use warnings; #use Text::CSV; use Data::Dumper; local $Data::Dumper::Deepcopy=1; local $Data::Dumper::Purity=1; local $Data::Dumper::Sortkeys=0; local $Data::Dumper::Indent=3; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; my %dict = map { chomp; split "\t", $_, 2 } <$fh>; warn Dumper \%dict; for my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); } #my %dict = map { chomp; split ' ', $_, 2 } <$fh>; my $re = join '\|', keys %dict; #close $fh; open $fh, '<', 'Orthogroups_3.csv' or die $!; while (<$fh>) { print $. ."\n"; next if $. < 2; my @a0=split("\t",$_); warn Dumper \@a0; s/($re)/$dict{$1}/g; print; } [download] All this leads me to think you dont have much of a clue as to what you are doing and are just trying cookie-cutter examples found on the web. This is a bad thing to do Edit: code tabs around the huge fields, but im not sure its any better	[reply] [d/l] [select]
Re^4: Parsing csv without changing dimension of original file by zillur (Novice) on Mar 07, 2017 at 03:54 UTC
Thank you very much for your comment. Sorry for the inconvenience. Using "sep=\t" in your first script solved the problem and this give me the exact output as the latest script `my %dict = map { chomp; split '\t', $_, 2 } <$fh>;` I have another problem. In the result, I still have previous text in many cells. Its strange, some cells replaced exactly some not. Either they might not be replaced or replaced by the 1st column of the 'egg_pathway_title.txt'. I was trying to delete those strings but failed. What I have done `ut -f1 bioDBnet_db2db_KEGG_Title_final.txt > exclude-these.txt` but failed. I have tried in many ways. Is there any way to upload my files here? Best Regards Zillur	[reply] [d/l] [select]