Alfumao has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I'm trying to solve a very annoying problem I've been having lately.

The thing is that I have a file containing a tab separated "table" with thousands of rows and dozens of columns, and another tab separated text file, where the names in the table are stored correlatively to their, let's say, "translation". e.g:

Table_name1 New_name1

Table_name2 New_name2

...

Table_name3 New_name3

The table has the following format:

Table_name1 Table_name43 Table_name17 Table_name1245 ...

Table_name2 Table_name4 Table_name37 Table_name125 ...

Table_name3 Table_name51 Table_name69 Table_name342 ...

...

Where any name can appear at any position (and all the names in the table are present in the text file to be "translated".

I tried the following script, that calls a one-liner in order to perform the edition on the table file, replacing each entry from the text file, if it is found.

My problem is that the replacements are not done correctly and I end up with a table full of replicated values that don't correspond to the original. I think this may be som issue with the for loop that contains the call to the one-liner, but I can't seem to be able to fix it.

Maybe the code is easier to understand than the intended explanation of the issue...so here it is:

#!/usr/bin/perl -w use strict; use Getopt::Long; #usage example: perl GetbackIDs.pl -p /path_to_files -e [table fil +e extension] #requires a table file and a "IDs database" in ".txt" format that +share their name my ($path, $ext); GetOptions( 'path=s' => \$path, 'extension=s' => \$ext, ); print "$path\n"; chdir $path or die "ERROR: Unable to enter $path: $!\n"; opendir (TEMP , "."); my @files = readdir (TEMP); closedir TEMP; print "@files\n"; my $name; my @db; for my $file (@files) { if($file=~/(\w+).$ext/){ $name = "$1"; print"This is the Filename: $file\n"; open (INFILE, "$file") || die ("cannot open input file"); chomp(my @data = <INFILE>); my$file2= "$name.bd"; print"This is the DBname:$file2\n"; open (DB, "$file2") || die ("cannot open input file"); chomp(@db = <DB>); } #Edition "on the fly" via One-Liner for(@db){ my ($dbid,$firstid) = split(/\t/, $_); chomp $firstid; print"This is my $dbid and its $firstid\n"; ##ONELINER #if id matches, replace id my$susti=`perl -pi -e 's/$dbid/$firstid/g' $name.$ext`; } }

Examples of data

#Database of table names and new names Aspergillus_clavatus_1 XP_001276684.1 pectate lyase, putative [Aspe +rgillus clavatus NRRL 1] Aspergillus_fumigatus_2 XP_001276694.1 conserved hypothetical prote +in [Aspergillus fumigatus NRRL 1] Aspergillus_flavus_3 XP_001276726.1 tyrosinase central domain prote +in [Aspergillus flavus NRRL 1] Aspergillus_terreus_4 XP_001276738.1 endoglucanase, putative [Asper +gillus terreus NRRL 1] #Lines of the table to be renamed Aspergillus_clavatus_1 Aspergillus_flavus_198 Aspergillus_terreu +s_166 Aspergillus_fumigatus_2 Aspergillus_clavatus_1 Aspergillus_flavus_3 Aspergillus_terreus_ +4 Aspergillus_fumigatus_2 Aspergillus_clavatus_3 Aspergillus_flavus_198 Aspergillus_terreu +_166 Aspergillus_fumigatus_16 #Expected result (See that in some cases there's no replacement to be +done, if the ID is not present in the names "database" file XP_001276684.1 pectate lyase, putative [Aspergillus clavatus NRRL 1] + Aspergillus_flavus_198 Aspergillus_terreus_166 XP_001276694.1 + conserved hypothetical protein [Aspergillus fumigatus NRRL 1] XP_001276684.1 pectate lyase, putative [Aspergillus clavatus NRRL 1] + XP_001276726.1 tyrosinase central domain protein [Aspergillus flavu +s NRRL 1] XP_001276738.1 endoglucanase, putative [Aspergillus terr +eus NRRL 1] XP_001276694.1 conserved hypothetical protein [Aspergi +llus fumigatus NRRL 1] Aspergillus_clavatus_3 Aspergillus_flavus_198 Aspergillus_terreu +_166 Aspergillus_fumigatus_16

Thanks in advance for your help

Best

I'd appreciate your counsel on how to rename the title of the post conveniently, because I do not think it is illustrative enough in its current form...

*Update*

As suggested by BrowserUk, I found out that after a few records, names may overlap e.g.(Aspergillus_fumigatus_1 overlaps Aspergillus_fumigatus_10 or Aspergillus_fumigatus_17).

So I guess that is the main source of error during translation.

Replies are listed 'Best First'.
Re: Replace table values from text database
by BrowserUk (Patriarch) on Mar 14, 2016 at 14:04 UTC
    My problem is that the replacements are not done correctly and I end up with a table full of replicated values that don't correspond to the original.

    Does the set of replacement names overlap with the set of original names?

    (And, on the face of it, processing the whole file completely to perform each substitution is a nuts way to approach the problem. Horribly inefficient, when the whole process can be done in a single pass.)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      (And, on the face of it, processing the whole file completely to perform each substitution is a nuts wayto approach the problem. Horribly inefficient, when the whole process can be done in a single pass.)

      I totally agree, but I do not know how to achieve such thing. Would you please show/link an example I can use to figure out how to do it?

      Thanks in advance

        Sure. If you answer my question?

        And, as requested elsewhere, post some real data: inputs and expected output. It only need be a dozen lines of each file; preferably that connect with each other.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Replace table values from text database
by 1nickt (Canon) on Mar 14, 2016 at 14:02 UTC

    I do not think it is illustrative enough in its current form

    Perhaps, but I would say the main cause of that is the lack of sample data and expected output, not the node title. My own eye was drawn to the unescaped dot in your regexp, and a couple of other small things, but it's impossible to tell whether they are affecting your script, because we can't see any data. So please post a small sample of the input data and the expected output.

    Also, I would suggest using a hash, loop, and substitution inside your program rather than shelling out to call a Perl one-liner for each line you are comparing; just seems cleaner, not to mention more efficient and safer.


    The way forward always starts with a minimal test.
      How can I upload a sample of the test files to this thread?
Re: Replace table values from text database
by Alfumao (Initiate) on Mar 16, 2016 at 08:36 UTC

    Thank you BrowserUk and 1nickt for your help and answers.

    Best

    A.