Re: reading/writing to a file
by TedPride (Priest) on Jun 18, 2005 at 18:44 UTC
|
You're also reading every line of the dictionary file for every word, which is horribly inefficient. Given that most of the time spent is going to be in file I/O, it would be much better to read a large chunk of the dictionary file at once (if not all of it), and cycle through all the word / new word combinations at once, moving new words from the original array to a new array as a match is found. You'll only have to read the dictionary file once.
And there's no need to read and write at the same time. You can read first and then open it for append after and add all the new words in one print.
Depending on the number of words you're checking each run, you might do better just loading the entire dictionary file into memory as a hash and checking the new words that way. Perhaps you can have the script choose between the two methods depending on how many words there are to check?
Or you could even keep the dictionary file in alphabetic order, which will significantly cut down on the number of matches you have to do if you don't use a hash. New words would go into a second dictionary file, which would be unsorted and checked only if the first file didn''t match everything, and you could run a process every so often to merge the new words into the main dictionary file in alphabetic order (which can't be done every run since it would require rewriting most of the file). | [reply] |
|
|
I thought the only way to read in from a file was one line at a time or all at once? And the files are too big to read in all at once.
| [reply] |
Re: reading/writing to a file
by davidrw (Prior) on Jun 18, 2005 at 19:34 UTC
|
Above solutions look good, but just to mention another tool in the toolbox: If this is on *nix, keep on mind the sort and uniq commands. For example, your perl could just create a raw dictionary file, not worrying about duplicates (and thus eliminating the need for a possibly very large in-memory hash), and then just invoke sometihng like:
system("sort raw_outfile | uniq > real_outfile");
unlink "raw_outfile";
Not sure if it's the best use here, but in general sort/uniq on the cmdline is very useful. | [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
Re: reading/writing to a file
by tlm (Prior) on Jun 18, 2005 at 19:54 UTC
|
See if this works for you:
Update: BTW, I neglected to mention that some aspects of your original code made no sense to me (though I didn't change them). Specifically, I think that instead of
@count = split(//, $word); ### why split before removing \W ???
$word=~s/\W//g;
next if @count < 4; ### short circuit (keep nesting down); don't
### need scalar
what you want is something more like
$word =~ s/\W+//g;
next if length $word < 4;
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
|
|
I don't follow you on the autovivification point. No keys in %$exclude are autovivified in the code I posted.
The decision to use two hashes was one of several that I made for the sake of clarity alone, since I thought that in that way it would be easier for the OP to adapt it to his/her needs. (I.e., I agree with the quote at the beginning of your post :) )
| [reply] |
Re: reading/writing to a file
by thundergnat (Deacon) on Jun 18, 2005 at 19:15 UTC
|
How about something like this? Ignores any punctuation except for internal apostrophes (contractions), works with any Unicode words, not just ASCII or Latin-1.
#! /usr/bin/perl
use warnings;
use strict;
if(scalar(@ARGV) != 3){
die "Usage: $0 inputfile.txt excludefile.txt outputfile.txt\n" ;
}
my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/;
my %excluded;
open my $exclude, '<', $ARGV[1] or die "Error opening exclude file: $!
+\n";
while (my $line = <$exclude>)
{
while ($line =~ /($word('$word)?)/g)
{
undef $excluded{$1};
}
}
close $exclude;
my %included;
open my $input, '<', $ARGV[0] or die "Error opening input file: $!\n";
while (my $line = <$input>)
{
while ($line =~ /($word('$word)?)/g)
{
undef $included{$1} unless exists $excluded{$1};
}
}
close $input;
open my $output, '>>', $ARGV[2] or die "Error opening output file: $!\
+n";
print $output join "\n", sort keys %included;
Update: Changed open for output to append instead of truncate, as pointed out by tlm
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: reading/writing to a file
by Fletch (Bishop) on Jun 18, 2005 at 18:22 UTC
|
You're trying to read from the OUT filehandle which you've opened for writing, not reading.
--
We're looking for people in ATL
| [reply] [d/l] |
|
|
But i opened it using "+>>$ARGV2" so shouldnt that open it for both reading and writing? If it doesnt can you suggest how i might do this?
| [reply] |
|
|
You can use "+" with ">" to open a file for reading and writing, and the initial status is that the "file pointer" (the internal element of the file handle structure that says where the next read or write will occur) is set at the start of the file. (This means that if you write before reading, you'll replace existing content, if any.)
You can use ">>" to write to a file in append mode, and the initial status is that the file pointer is set at the end of the file, so if there is any existing content, your write operations will show up after that.
Using "+>>" is nonsensical because file access begins with the file pointer at EOF, where there's nothing to read. You can get acquainted with the "tell" and "seek" functions, to query and control the file pointer, but that gets complicated, and I don't recommend it.
| [reply] |