Select only desired features from a text

remluvr has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone.
Here I am with a new problem I can't solve.
I have two input files. One contains a list of semantic relations structured like the following (lets' call it INPUT1):

alligator-n        amphibian_reptile    attri    long-j
alligator-n        amphibian_reptile    attri    old-j
alligator-n        amphibian_reptile    coord    crocodile-n
alligator-n        amphibian_reptile    coord    frog-n
alligator-n        amphibian_reptile    event    walk-v
alligator-n        amphibian_reptile    hyper    animal-n
[download]

And another one that is like the following (obviously the following is just a very reduced version):

frog-n    about    adage-n    8.8016
frog-n    appearance-1    broad-j    11.9640
frog-n    coord    albino-n    6.7667
frog-n    be    jumper-n    6.0272
frog-n    be    key-n    3.8779
frog-n    of    body-n    8.3063
frog-n    of    bone-n    20.7982
frog-n    of    book-n    0.4229
crocodile-n    be    key-n    3.2572
crocodile-n    of    chorus-n    24.9515
crocodile-n    of    book-n    2.3460
crocodile-n    obj    sit-v    3.1857
crocodile-n    obj    size-v    57.3257
crocodile-n    obj    skewer-v    6.1105
animal-n    coord-1    investigation-n    0.9666
animal-n    coord-1    irrigation-n    2.6058
animal-n    coord-1    isolation-n    1.4074
animal-n    coord-1    isotope-n    2.7420
[download]

I need to check input1 for relations eq "coord" (third field of the rows) and search input2 for occurrences of fourth field of the row element in it. In this case I have crocodile-n and frog-n. I have to build another file that looks like input2 but contains every row whose first field is crocodile-n or frog-n. If one element is already found, I need not to repeat it, but sum the score it has with the one I already found.
I understand this explanation is not really clear, so here it is an example of desired output:

not_alligator-n about        adage-n    8.8016
not_alligator-n    appearance-1    broad-j    11.9640
not_alligator-n    coord    albino-n    6.7667
not_alligator-n    be    jumper-n    6.0272
not_alligator-n    be    key-n    7.1351(3.8779+3.2572)
not_alligator-n    of    body-n    8.3063
not_alligator-n    of    chorus-n    24.9515
not_alligator-n    of    bone-n    20.7982
not_alligator-n    of    book-n    2.7689(0.4229+2.3460)
not_alligator-n    obj    sit-v    3.1857
not_alligator-n    obj    size-v    57.3257
not_alligator-n    obj    skewer-v    6.1105
[download]

I have no idea where to start. Less than one month since I started back using perl, and still a lot I have to learn
Every suggestion, tip, indication on what to do would be really appreciated
I need it because I'm analyzing some statistical measure to be used on semantic relation for my ph.D Theses.
Thanks to all
Giulia

Comment on Select only desired features from a text Select or Download Code

Replies are listed 'Best First'.
Re: Select only desired features from a text by JavaFan (Canon) on Mar 19, 2012 at 15:44 UTC
I'm getting the impression, the same question, with similar data, is asked every few days here. The only thing that seems to be changing is the name of the animal. Given the size of the file, and the fact it seems you need to do this over and over again, I'd say take a 2-day basic SQL course, load your data in a database, and run some SQL queries. Considering how you're struggling with Perl, the 2 day investment should pay itself of in about 2.1 days!	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Select only desired features from a text by moritz (Cardinal) on Mar 19, 2012 at 13:10 UTC
The general procedure is to first read the file that contains the interesting mapping, read the mapping into a hash, and then traverse the second file and do the transformation of these lines based on the hash. Something like this: `use 5.010; use strict; use warnings; use autodie; my %map; open my $IN, '<', 'f1'; while (<$IN>) { my ($first, undef, $type, $fourth) = split; $map{$fourth} = $first if $type eq 'coord'; } close $IN; open $IN, '<', 'f2'; while (<$IN>) { my ($first, $rest) = split /\s/, $_, 2; if ($map{$first}) { print "not_$map{$first} $rest" } } close $IN;` [download] Note that the variable names are quite terrible, because I don't know what the values stand for. Perl 6 - second systems done right	[reply] [d/l]
Re^2: Select only desired features from a text by remluvr (Sexton) on Mar 19, 2012 at 15:15 UTC
Thanks, this was really useful, but my problem is I don't want to have duplicates. Given this output: `not_alligator-n about adage-n 8.8016 not_alligator-n appearance-1 broad-j 11.9640 not_alligator-n coord albino-n 6.7667 not_alligator-n be jumper-n 6.0272 not_alligator-n be key-n 3.8779 not_alligator-n of body-n 8.3063 not_alligator-n of bone-n 20.7982 not_alligator-n of book-n 0.4229 not_alligator-n be key-n 3.2572 not_alligator-n of chorus-n 24.9515 not_alligator-n of book-n 2.3460 not_alligator-n obj sit-v 3.1857 not_alligator-n obj size-v 57.3257 not_alligator-n obj skewer-v 6.1105` [download] I'd like for not_alligator-n be key-n 3.8779 and not_alligator-n be key-n 3.2572 to appear just once, but with their score summed up. How can I achieve that? Thanks Giulia	[reply] [d/l]
Re^3: Select only desired features from a text by moritz (Cardinal) on Mar 19, 2012 at 18:03 UTC
Use a second hash to store those (partial) lines that you've already seen, and only print out those lines that aren't in the hash yet. Perl 6 - second systems done right	[reply]
Re^3: Select only desired features from a text by bitingduck (Deacon) on Mar 19, 2012 at 15:29 UTC
You might want to consider loading the whole thing into a database if it's that large and you need to do a lot of key lookup (e.g. to avoid dupes) as you process the data, particularly if you need to sort on it in different ways or pull out subsets based on certain conditions.	[reply]
Re: Select only desired features from a text by aaron_baugher (Curate) on Mar 19, 2012 at 17:01 UTC
The basic answer is to make a hash from the relationships in input1, and use that to parse and process the information you need from input2. If I understand your problem, in this case I would probably create a hash of arrays, keyed on the values from column4, so I'd have something like this: `%hoa = ( 'frog-n' => ['alligator-n'], 'crocodile-n' => ['alligator-n'], );` [download] (I'd use a hash of arrays instead of a simple hash because I assume other values from column1 could have a relationship with 'frog-n'. If that's not true, then this could be a simple hash.) Even if input1 is 4GB, since you're only interested in parts of certain lines, your hash may be much smaller. Then I'd start going through input2, building a new multilevel hash based on the array elements from %hoa, with sub-keys from the new file, so I would be assigning values like this: `# from the first line: frog-n about adage-n 8.8016 for $key (@$hoa{frog-n}){ $newhash{$key}{about}{adage-n} += 8.8016; }` [download] That will sum up repeated patterns as it goes, and it won't matter if they are consecutive. When it's done, go through that second hash and print it out in whatever format you like. There are still details to work out (like if you really want the sum elements displayed next to the sum like that, you may want to store them as an array and sum them in the last step), but that's the basic structure. Aaron B. My Woefully Neglected Blog, where I occasionally mention Perl.	[reply] [d/l] [select]
Re^2: Select only desired features from a text by Anonymous Monk on Mar 19, 2012 at 22:19 UTC
Aaron, thanks a lot. I tried writing my code based on your suggestions and I succeeded. Thanks!!!!	[reply]
Re: Select only desired features from a text by RichardK (Parson) on Mar 19, 2012 at 12:24 UTC
Well, that depends on how many lines there are in second file. The easiest way is to store the matched records in a hash. You might find it useful to look at the perl data structures cookbook perldsc BTW, there is lots of documentation shipped with your copy of perl - try 'man perl' or 'perldoc perl' ;)	[reply]
Re^2: Select only desired features from a text by remluvr (Sexton) on Mar 19, 2012 at 14:12 UTC
Problem is, it is a 4G file..	[reply]