in reply to Re^2: Parsing csv without changing dimension of original file
in thread Parsing csv without changing dimension of original file
Well there are many many things wrong here
in your original example page http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary you missed the part where it says "I have a dictionary(dict.txt). It is space separated and it reads like this:. Your kegg_pathway_title.txt instead has a tab after the replace-from field. In a way that is easy to fix, change the Line to
my %dict = map { chomp; split "\t", $_, 2 } <$fh>;
Next
implies that the fields are separated by a tab (\t). Yes your column headers are separated by a tab, and there are tabs in your other rows, but row OG0000000 only has 5 tab separated fields, three of them being blank due to consecutive tabs, and> grpsTbl <- read.csv("Orthogroups_3.csv", header=T, sep = "\t", row. +names = 1, stringsAsFactors=F)
being considered as one field, being number 5. There is a tab however after the OG0000000 at least"PBANKA_0000600, PBANKA_0000701, PBANKA_0000801, PBANKA_0001001, PBANK +A_0001101, PBANKA_0001201, PBANKA_0001301, PBANKA_0001401, PBANKA_000 +1501, PBANKA_0006300, PBANKA_0006401, PBANKA_0006501, PBANKA_0006600, + PBANKA_0006701,"
The next line OG0000001 does have 13 tab delimited fields, OG0000001 has a tab after it to put it in its own column, followed by 11 blank fields due to consecutive tabs, and
being considered the contents of the 13th column"PmUG01_00010100.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmU +G01_00010500.1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01 +_00010800.1-p1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00 +011300.1-p1, PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011 +700.1-p1, PmUG01_00012100.1-p1, PmUG01_00012200.1-p1,"
Given the following as your dictionary
one notices that none of the replace-from fields in it even occur in your sample Orthogroups_3.csv file at all, so your expected output is a myth.PVX_088085 Protein processing in endoplasmic reticulum PVX_114095 Protein processing in endoplasmic reticulum PVX_123055 Ribosome biogenesis in eukaryotes PYYM_1032000 - PYYM_1120600 - PCYB_031930 Purine metabolism; Metabolic pathways; DNA replication; + Pyrimidine metabolism
and besides the tab after the replace-from in your dictionary file, adding this code identifies a major problem
resultfor my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); }
Those new tabs introduce "extra columns" to the output.for lookup:PVX_114095 tab in field:Protein processing in endoplasmic r +eticulum for lookup:PYYM_1032000 tab in field:- for lookup:PVX_088085 tab in field:Protein processing in endoplasmic r +eticulum
The code that identifies all these problems is
# This script was excerpted from http://stackoverflow.com/questions/11 +678939/replace-text-based-on-a-dictionary use strict; use warnings; #use Text::CSV; use Data::Dumper; local $Data::Dumper::Deepcopy=1; local $Data::Dumper::Purity=1; local $Data::Dumper::Sortkeys=0; local $Data::Dumper::Indent=3; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; my %dict = map { chomp; split "\t", $_, 2 } <$fh>; warn Dumper \%dict; for my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); } #my %dict = map { chomp; split ' ', $_, 2 } <$fh>; my $re = join '|', keys %dict; #close $fh; open $fh, '<', 'Orthogroups_3.csv' or die $!; while (<$fh>) { print $. ."\n"; next if $. < 2; my @a0=split("\t",$_); warn Dumper \@a0; s/($re)/$dict{$1}/g; print; }
All this leads me to think you dont have much of a clue as to what you are doing and are just trying cookie-cutter examples found on the web. This is a bad thing to do
Edit: code tabs around the huge fields, but im not sure its any better
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Parsing csv without changing dimension of original file
by zillur (Novice) on Mar 07, 2017 at 03:54 UTC |