marylein has asked for the wisdom of the Perl Monks concerning the following question:

Hello, this is a very naive question from someone who forgot how to use Perl..here is my problem:

i have two tab delimited text files, each one has 2 columns. File 1: line 1: goes | go; line 2: leaves | leave; line 3: sees | see; line 4: eats | eat. And File 2: line 1: went | go; line 2: jumped | jump; line 3: saw | see; line 4: ate | eat.

I need to create File 3, which contains all verbs which are contained BOTH in the 2nd column of file 1 and in the 2nd column of file 2, and combines their information. I would look like this, 3 columns: File 3: line 1: go | goes | went; line 2: see | sees | saw; line 3: eat | eats | ate.

From what I remember there is a very fast way to do it in Perl..but I don't know how..What would be the script which does it?

Thank you!!!!

Replies are listed 'Best First'.
Re: script to merge files
by davido (Cardinal) on Sep 23, 2011 at 16:17 UTC

    Are the files in the same order? ....seems like not, and that's somewhat relevant to the solution. A set of ordered lists where each line is a direct match for the same line number in the other file would lead to an O(n) solution. If the files don't have a "line number for line number" relationship, then the solution becomes computationally more complex. A hash could be used, but the computational complexity does increase.

    By the way; this problem does seem like something that may have been approached in some form previously. And in fact there is a module Lingua::EN::Conjugate. But it addresses the general need for conjugation, not the specific task of matching up verb tenses and persons.

    Here's one solution that could work if the word-sets can fit into memory all at once. With the magic of references and the uniqueness as well as associativity of hash keys, the solution just sort of falls into place with a single hash.

    use strict; use warnings; use v5.14; use autodie; my %word_sets; while( <> ) { chomp; my ( $conjugation, $infinitive ) = split /\s*\|\s*/; push @{ $word_sets{ $infinitive }, $conjugation; } open my $outfh, '>', 'combined_conjugations.txt'; while( my( $infinitive, $conjugations ) = each %word_sets ) { next unless @{$conjugations} > 1; # Skip items that didn't appear +twice. say $outfh join( ' | ', $infinitive, @{ $conjugations } ); } close $outfh;

    By your problem description it looks like we can assume that in each file the alternate conjugation comes before the infinitive, whereas your solution file would list the infinitive first on each line, followed by the alternate conjugations. I took care to preserve what seemed to be your intent in this respect. What I didn't preserve, however, is any notion of line ordering. You probably do want some form of sorting, but it wasn't apparent in your question. If you do need the list to be sorted, or to preserve some original order, you'll have to modify the solution provided.


    Dave

Re: script to merge files
by aaron_baugher (Curate) on Sep 23, 2011 at 16:55 UTC

    Since you only want the ones that are contained in both files, you only need one hash, containing a hash or an array (your choice, really) for each infinitive found. Then go through the hash and output all those that have both parts filled.

    #!/usr/bin/perl my %k; # hash to hold infinitives and their other parts process( 'thirdperson', 'file1'); process( 'past', 'file2'); for (keys %k){ if( $k{$_}{thirdperson} and $k{$_}{past} ){ print "$_\t$k{$_}{thirdperson}\t$k{$_}{past}\n"; } } sub process { my $part = shift; my $filename = shift; open my $fn, '<', $filename or die $!; while(<$fn>){ chomp; my($other, $infinitive) = split /\t/; $k{$infinitive}{$part} = $other; } close $fn; }

    I used a hash of hashes, because I thought that would make what I was doing clearer. Each hash within %k is only printed if it contains values for both keys 'thirdperson' and 'past'. But you could also use a two-element array, as long as you keep track of which form of the verb goes in which numbered element.

    Edited to add: The unix command 'join' also does this, printing lines from multiple files that share a common field. Both files have to be sorted first by the field being compared, though.

    sort -k2 -o file1 file1 sort -k2 -o file2 file2 join -j 2 -t ' ' file1 file2 # inserted a tab between the quote +s with Ctrl-V
Re: script to merge files
by pvaldes (Chaplain) on Sep 23, 2011 at 16:11 UTC

    1 open file 1 and load its contents to a hash

    2 open and load the file 2 to another hash

    I need to create File 3, which contains all verbs which are contained BOTH in the 2nd column of file 1 and in the 2nd column of file 2

    so you want to compare 2 hashes, see more examples in supersearch

Re: script to merge files
by graff (Chancellor) on Sep 24, 2011 at 04:58 UTC
    I wrote a script a while back that does something similar to what you describe: cmpcol.

    If your two input files have just the vertical-bar character as the column delimiter (without spaces), then this command line:

    cmpcol -i -lb '|' -d '|' file1:2 file2:2
    would produce output like this:
    goes|go|went|go sees|see|saw|see eats|eat|ate|eat
    (that is, if there are in fact just those three lines that involve matching verbs). If the data files have spaces around the vertical bars, just include those in the command-line option values.

    Then all you have to do is re-arrange the columns to suit your taste:

    cmpcol -i -lb '|' -d '|' file1:2 file2:2 | perl -F'\|' -lape '$_=join +"|",@F[1,0,2]'
    (The quote-mark usage above is based on using a bourne/bash shell.)