Re: using hashes
by BrowserUk (Patriarch) on Sep 26, 2013 at 14:56 UTC
|
and iterate through all the hash keys
Don't ever iterate hash keys! (Well, hardly ever :)
The major purpose of hashes is that you can lookup the value associated with any key directly, avoiding iteration.
For your purpose, the major part of the code should be something like:
while( <$names_to_be_replaced_file> ) { ## read each line
s[\b([a-z]+)\b][ $name_id{ $1 } ]ge; ## find words, look them up
+ and replace them with the id
print; ## Send the modified lines
+to stdout
}
Simple and very efficient.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Thanks for the help Browser, but apparently I'm way behind in Perl knowledge yet, as I don't really get the code... That's how I know I'm overcomplicating something that is really simple :|
Is the $1 var pointing to the value of the hash?
| [reply] |
|
|
Is the $1 var pointing to the value of the hash?
$1 captures the words in the string one at a time. This $hash{ $1 } looks that word up in the hash and returns the associates value (id). The ge causes the ids to be substituted for every word in the line.
Perhaps this will clarify things?
%hash = ( brown=>1, fox=>2, quick=>3, the=>4 );;
$line = 'the quick brown fox';;
$line =~ s[\b([a-z]+)\b][ $hash{ $1 } ]ge;;
print $line;;
4 3 1 2
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
|
|
$hash{ $1 } // $1
which means if $1 is not found in your hash, then replace your word with itself, ie leave it unchanged.
| [reply] [d/l] [select] |
|
|
s[\b([a-z]+)\b][ $name_id{ $1 } ]ge;
The 's' at the beginning says to find a pattern and replace it. The 'g' at the end says to repeat this process as many times as possible. The 'e' at the end says that the replacement part should be evaluated as code, not treated as literal text.
In the first part, the pattern, the \b matches a "word boundary," the boundary between word characters and non-word characters like your commas. [a-z]+ means a string of 1 or more consecutive lowercase letters. The parentheses around that capture whatever is matched within them and save it in the special variable $1.
In the replacement part, $1 contains the matched word, so this becomes a simple lookup for that word as a key in the %name_id hash, replacing it with the value corresponding to that key. As mentioned before, because of the 'g', this entire process is repeated for each match found in the line.
Aaron B.
Available for small or large Perl jobs; see my home node.
| [reply] [d/l] [select] |
Re: using hashes
by kennethk (Abbot) on Sep 26, 2013 at 15:06 UTC
|
First, what mtmcc said. Second, a quote from the illustrious prophet: Doing linear scans over an associative array is like trying to club someone to death with a loaded Uzi. -- TimToady
You should put your keys into a hash, yes, but then just iterate over your array. The array values are exactly what you need to access the hash values. So it might look like:
my %id = (bananas => 456,
oranges => 23,
peaches => 897236,
kiwis => 3726,
);
my @replaces = ('kiwis','oranges','bananas','bananas');
for my $i (0 .. $#replaces) {
$replaces[$i] = $id{$replaces[$i]};
}
If I were going to actually write this, I'd take advantage of the fact that the loop iterator for Foreach Loops is an lvalue for the array element ($_ = $id{$_} for @replaces;), but that might be a little to magical for your taste given your familiarity with the language.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] [select] |
|
|
Thanks Kenneth, I understand your code, but I may have more than one name in the same line, such as:
bananas,peaches,kiwis
peaches,peaches
pineapple
(...)
So we couldn't use that kind of cycling on the array positions, right? Or am I missing something?
| [reply] |
|
|
... am I missing something?
You're missing what BrowserUk said here, an approach that processes an entire line at a time.
The next part to think about is what happens if you encounter a 'word' in a line that doesn't exist in your translation hash, e.g., the line
"peaches,peaches,foobar,kiwis\n"
(hints: exists, next, maybe // (defined-or) or ?: (ternary/conditional operator) – see perlop for the latter two).
| [reply] [d/l] [select] |
Re: iterating hash keys?
by kcott (Archbishop) on Sep 26, 2013 at 18:59 UTC
|
G'day R56,
Welcome to the monastery.
Firstly, a word about your data.
The term list has a special meaning in Perl: see "perldata: List value constructors". I've taken what you've described as lists to be records in files.
Given you wrote "... the 'names to be replaced' file ...", that seems correct for the second list; although, until I had read that far, I initially thought you might have been talking about a list of lists (which is something different — see perllol).
Anyway, this means you (probably) have a CSV (comma-separated values) file which is best read using a module like Text::CSV.
The reason for this is that there are all sorts of gotchas with CSV files which have already been coded for in these modules.
As an example, consider two records: "apples, red,cherries" and "apples, red cherries". If you had an ID for "apples, red", how would you handle the replacement in those two records.
So, I'd suggest you check whether your data really is as simple as the examples you've posted; and consider the chances of it staying that way in the future.
You may need to revisit whatever solution you choose based on those findings.
The solution I provide below assumes nothing more complex than what you currently show.
Here's my take on a solution.
I create a hash mapping names to IDs (same as you).
Next, I use the keys of that hash to create a regex with an alternation (e.g. bananas|oranges|...) such that only the names with IDs will be matched.
Finally, the replacements are made and the new data is output.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $in_file_name_id = 'pm_1055846_name_id_data.txt';
my $in_file_name_replace = 'pm_1055846_name_replace_data.txt';
my $out_file_name_replaced = 'pm_1055846_name_replaced_out.txt';
open my $in_id_fh, '<', $in_file_name_id;
my %id_for = map { split } <$in_id_fh>;
close $in_id_fh;
my $re = '\b(' . join('|', keys %id_for) . ')\b';
open my $in_replace_fh, '<', $in_file_name_replace;
open my $out_replaced_fh, '>', $out_file_name_replaced;
while (<$in_replace_fh>) {
s/$re/$id_for{$1}/g;
print $out_replaced_fh $_;
}
Here's the files. Notice I added "pineapples", which didn't have an ID, and so wasn't replaced.
$ cat pm_1055846_name_id_data.txt
bananas 456
oranges 23
peaches 897236
kiwis 3726
$ cat pm_1055846_name_replace_data.txt
bananas,oranges
peaches,peaches,peaches
kiwis
oranges
kiwis,oranges,bananas,bananas
bananas,oranges,pineapples,peaches,kiwis
$ cat pm_1055846_name_replaced_out.txt
456,23
897236,897236,897236
3726
23
3726,23,456,456
456,23,pineapples,897236,3726
| [reply] [d/l] [select] |
|
|
Hey Ken, good to be here :)
Thank you for the patience to write all that.
I don't know yet if the data will be this simple at all times, but it's always better to cover all the options if it doesn't sacrifice speed.
Will definitely try out your code to see if I can improve this!
| [reply] |
|
|
Well, comparing to what I had, your code is faster than the speed of light!
Is there a simple way for the s// to also include names with hyphens in the middle?
| [reply] |
|
|
"Well, comparing to what I had, your code is faster than the speed of light!"
That's a good start. :-)
"Is there a simple way for the s// to also include names with hyphens in the middle?"
The short answer is: yes.
The longer answer depends on details.
I found a reference you made to input data with hyphens in "Re^8: using hashes"; however, you provided no indication of the output you wanted (except that 20-10,25 was the wrong output when bana-na,banana was the input).
The following is based on the code I provided earlier.
Given these input files:
$ cat pm_1055846_name_id_data.txt
bananas 456
oranges 23
peaches 897236
kiwis 3726
banana 25
bana 20
bana-na 15
na 10
$ cat pm_1055846_name_replace_data.txt
bananas,oranges
peaches,peaches,peaches
kiwis
oranges
kiwis,oranges,bananas,bananas
bananas,oranges,pineapples,peaches,kiwis
bana-na,banana
ba-na-na,bana-bana,bana-nana
If you want output like this:
$ cat pm_1055846_name_replaced_out.txt
456,23
897236,897236,897236
3726
23
3726,23,456,456
456,23,pineapples,897236,3726
15,25
ba-10-10,20-20,20-nana
Change
my $re = '\b(' . join('|', keys %id_for) . ')\b';
to
my $re = '\b(' . join('|', sort { $b cmp $a } keys %id_for) . ')\b';
If you want output like this:
$ cat pm_1055846_name_replaced_out.txt
456,23
897236,897236,897236
3726
23
3726,23,456,456
456,23,pineapples,897236,3726
15,25
ba-na-na,bana-bana,bana-nana
Change
my $re = '\b(' . join('|', keys %id_for) . ')\b';
to
my $re = '(^|,)(' . join('|', sort { $b cmp $a } keys %id_for) . ')(?=
+,|$)';
and
s/$re/$id_for{$1}/g;
to
s/$re/$1$id_for{$2}/g;
If you want something different to these, and are unable to work it out for yourself, provide details as outlined in the "How do I post a question effectively?" guidelines.
It would also be useful to advise what version of Perl you're using: I wrote those changes for v5.8; a more efficient version could have been written for a later version.
As a hint for doing this yourself, see (?<=pattern) \K under Look-Around Assertions in "perlre: Extended Patterns" — \K was introduced in v5.10.0 (see "perl5100delta: Regular expressions" for this, and other, regex enhancements).
| [reply] [d/l] [select] |
|
|
Re: using hashes
by mtmcc (Hermit) on Sep 26, 2013 at 14:51 UTC
|
| [reply] |
|
|
for my $line (@lines) {
while(my ($find, $replace) = each %ids) {
s/$find/$replace/g
}
}
| [reply] [d/l] |
|
|
This should work, and is clear to read. While it is not optimally efficient, efficiency shouldn't be your concern at this stage. If this isn't working, you need to post more information about your actual script. Posting real input, expected output, and actual code (all wrapped in <code> tags) will greatly facilitate the debugging. As discussed in How do I post a question effectively?.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] |
|
|
|
|
|
|
|
For an effective solution to your problem, see BrowserUK's comment below. As to why the code you've shown doesn't work, it's probably because you're storing each line of your file/array in $line, but doing your substitution against $_. Try this: $line =~ s/$find/$replace/g.
| [reply] [d/l] [select] |
|
|