Re^3: Parsing csv without changing dimension of original file

Well there are many many things wrong here

in your original example page http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary you missed the part where it says "I have a dictionary(dict.txt). It is space separated and it reads like this:. Your kegg_pathway_title.txt instead has a tab after the replace-from field. In a way that is easy to fix, change the Line to

my %dict =  map { chomp; split "\t", $_, 2 } <$fh>;
[download]

 > grpsTbl <- read.csv("Orthogroups_3.csv", header=T, sep = "\t", row.
+names = 1, stringsAsFactors=F)
[download]

implies that the fields are separated by a tab (\t). Yes your column headers are separated by a tab, and there are tabs in your other rows, but row OG0000000 only has 5 tab separated fields, three of them being blank due to consecutive tabs, and

"PBANKA_0000600, PBANKA_0000701, PBANKA_0000801, PBANKA_0001001, PBANK
+A_0001101, PBANKA_0001201, PBANKA_0001301, PBANKA_0001401, PBANKA_000
+1501, PBANKA_0006300, PBANKA_0006401, PBANKA_0006501, PBANKA_0006600,
+ PBANKA_0006701,"
[download]

being considered as one field, being number 5. There is a tab however after the OG0000000 at least

The next line OG0000001 does have 13 tab delimited fields, OG0000001 has a tab after it to put it in its own column, followed by 11 blank fields due to consecutive tabs, and

"PmUG01_00010100.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmU
+G01_00010500.1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01
+_00010800.1-p1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00
+011300.1-p1, PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011
+700.1-p1, PmUG01_00012100.1-p1, PmUG01_00012200.1-p1,"
[download]

being considered the contents of the 13th column

Given the following as your dictionary

PVX_088085    Protein processing in endoplasmic reticulum    
PVX_114095    Protein processing in endoplasmic reticulum    
PVX_123055    Ribosome biogenesis in eukaryotes
PYYM_1032000    -    
PYYM_1120600    -
PCYB_031930    Purine metabolism; Metabolic pathways; DNA replication;
+ Pyrimidine metabolism
[download]

one notices that none of the replace-from fields in it even occur in your sample Orthogroups_3.csv file at all, so your expected output is a myth.

and besides the tab after the replace-from in your dictionary file, adding this code identifies a major problem

for my $k (keys %dict) {
  my $v=$dict{$k}; 
  warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); 
  }
[download]

result

for lookup:PVX_114095 tab in field:Protein processing in endoplasmic r
+eticulum
for lookup:PYYM_1032000 tab in field:-
for lookup:PVX_088085 tab in field:Protein processing in endoplasmic r
+eticulum
[download]

Those new tabs introduce "extra columns" to the output.

The code that identifies all these problems is

 
# This script was excerpted from http://stackoverflow.com/questions/11
+678939/replace-text-based-on-a-dictionary
use strict;
use warnings;
#use Text::CSV;
use Data::Dumper; 
  local $Data::Dumper::Deepcopy=1;
  local $Data::Dumper::Purity=1;
  local $Data::Dumper::Sortkeys=0;
  local $Data::Dumper::Indent=3; 

open my $fh, '<', 'kegg_pathway_title.txt' or die $!;
my %dict =  map { chomp; split "\t", $_, 2 } <$fh>;
warn Dumper \%dict; 
for my $k (keys %dict) {
  my $v=$dict{$k}; 
  warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); 
  }
#my %dict =  map { chomp; split ' ', $_, 2 } <$fh>;
my $re = join '|', keys %dict;
#close $fh;
open $fh, '<', 'Orthogroups_3.csv' or die $!;

while (<$fh>) {
print $. ."\n";
  next if $. < 2;
  my @a0=split("\t",$_); warn Dumper \@a0; 
  s/($re)/$dict{$1}/g;
  print;
}
[download]

All this leads me to think you dont have much of a clue as to what you are doing and are just trying cookie-cutter examples found on the web. This is a bad thing to do

Edit: code tabs around the huge fields, but im not sure its any better

Comment on Re^3: Parsing csv without changing dimension of original file Select or Download Code