reading/writing to a file

nnp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: reading/writing to a file by TedPride (Priest) on Jun 18, 2005 at 18:44 UTC
You're also reading every line of the dictionary file for every word, which is horribly inefficient. Given that most of the time spent is going to be in file I/O, it would be much better to read a large chunk of the dictionary file at once (if not all of it), and cycle through all the word / new word combinations at once, moving new words from the original array to a new array as a match is found. You'll only have to read the dictionary file once. And there's no need to read and write at the same time. You can read first and then open it for append after and add all the new words in one print. Depending on the number of words you're checking each run, you might do better just loading the entire dictionary file into memory as a hash and checking the new words that way. Perhaps you can have the script choose between the two methods depending on how many words there are to check? Or you could even keep the dictionary file in alphabetic order, which will significantly cut down on the number of matches you have to do if you don't use a hash. New words would go into a second dictionary file, which would be unsorted and checked only if the first file didn''t match everything, and you could run a process every so often to merge the new words into the main dictionary file in alphabetic order (which can't be done every run since it would require rewriting most of the file).	[reply]
Re^2: reading/writing to a file by nnp (Initiate) on Jun 18, 2005 at 18:51 UTC
I thought the only way to read in from a file was one line at a time or all at once? And the files are too big to read in all at once.	[reply]
Re: reading/writing to a file by davidrw (Prior) on Jun 18, 2005 at 19:34 UTC
Above solutions look good, but just to mention another tool in the toolbox: If this is on *nix, keep on mind the `sort` and `uniq` commands. For example, your perl could just create a raw dictionary file, not worrying about duplicates (and thus eliminating the need for a possibly very large in-memory hash), and then just invoke sometihng like: `system("sort raw_outfile \| uniq > real_outfile"); unlink "raw_outfile";` [download] Not sure if it's the best use here, but in general sort/uniq on the cmdline is very useful.	[reply] [d/l] [select]
Re^2: reading/writing to a file by tlm (Prior) on Jun 18, 2005 at 20:14 UTC
I agree that using Unix utilities is a good alternative for this problem, but note that For the purpose of this problem at least (and, AFAIK, always), `sort foo.txt \| uniq` [download] can be replaced with a single `sort` command: `sort -u foo.txt` [download] By itself, `sort -u` (or `sort ... \| uniq`) is not enough to solve this problem. Something like GNU's `comm` is also required (zsh, YMMV): `% (comm -2 -3 sorted_new.txt sorted_exclude.txt; < dict.txt) \| sort -u + \ > tmp % mv tmp dict.txt` [download] (BTW, if anyone knows how to avoid the temporary file above, I'd love to hear about it.) the lowliest monk	[reply] [d/l] [select]
Re: reading/writing to a file by tlm (Prior) on Jun 18, 2005 at 19:54 UTC
See if this works for you: Read more... (1328 Bytes) Update: BTW, I neglected to mention that some aspects of your original code made no sense to me (though I didn't change them). Specifically, I think that instead of `@count = split(//, $word); ### why split before removing \W ??? $word=~s/\W//g; next if @count < 4; ### short circuit (keep nesting down); don't ### need scalar` [download] what you want is something more like `$word =~ s/\W+//g; next if length $word < 4;` [download] the lowliest monk	[reply] [d/l] [select]
Re^2: reading/writing to a file by polettix (Vicar) on Jun 18, 2005 at 20:37 UTC
Just to make it clear that I know this: A premature optimisation is a Bad Thing Now, I'd like to oberve that you don't really need two separate hashes, but only the dictionary (or the exclude, if you prefer) one. Basically, words that were already seen (either because already present in the output file, or because you see them as you iterate over the INPUT ONE) are to be excluded, so you can mix the two sets. The consequence is ~~two-fold~~: you have a single hash access instead of two, which is hopefully better; ~~you don't waste resources auto-vivificating the `%$exclude` hash.~~ Update: corrected the auto-vivification beasty comment. I wonder how much wrong assumptions I've got inside my head - at least the magnitude order. Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply] [d/l]
Re^3: reading/writing to a file by tlm (Prior) on Jun 18, 2005 at 20:49 UTC
I don't follow you on the autovivification point. No keys in `%$exclude` are autovivified in the code I posted. The decision to use two hashes was one of several that I made for the sake of clarity alone, since I thought that in that way it would be easier for the OP to adapt it to his/her needs. (I.e., I agree with the quote at the beginning of your post :) ) the lowliest monk	[reply]
Re: reading/writing to a file by thundergnat (Deacon) on Jun 18, 2005 at 19:15 UTC
How about something like this? Ignores any punctuation except for internal apostrophes (contractions), works with any Unicode words, not just ASCII or Latin-1. #! /usr/bin/perl use warnings; use strict; if(scalar(@ARGV) != 3){ die "Usage: $0 inputfile.txt excludefile.txt outputfile.txt\n" ; } my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/; my %excluded; open my $exclude, '<', $ARGV[1] or die "Error opening exclude file: $! +\n"; while (my $line = <$exclude>) { while ($line =~ /($word('$word)?)/g) { undef $excluded{$1}; } } close $exclude; my %included; open my $input, '<', $ARGV[0] or die "Error opening input file: $!\n"; while (my $line = <$input>) { while ($line =~ /($word('$word)?)/g) { undef $included{$1} unless exists $excluded{$1}; } } close $input; open my $output, '>>', $ARGV[2] or die "Error opening output file: $!\ +n"; print $output join "\n", sort keys %included; [download] Update: Changed open for output to append instead of truncate, as pointed out by tlm Read more... Further Updates (2 kB)	[reply] [d/l] [select]
Re^2: reading/writing to a file by tlm (Prior) on Jun 18, 2005 at 19:54 UTC
AFAICT, this doesn't do what the OP wants, since it truncates the contents of `$ARGV[2]`. the lowliest monk	[reply]
Re: reading/writing to a file by Fletch (Bishop) on Jun 18, 2005 at 18:22 UTC
You're trying to read from the `OUT` filehandle which you've opened for writing, not reading. -- We're looking for people in ATL	[reply] [d/l]
Re^2: reading/writing to a file by nnp (Initiate) on Jun 18, 2005 at 18:26 UTC
But i opened it using "+>>$ARGV2" so shouldnt that open it for both reading and writing? If it doesnt can you suggest how i might do this?	[reply]
Re^3: reading/writing to a file by graff (Chancellor) on Jun 19, 2005 at 01:08 UTC
You can use "+" with ">" to open a file for reading and writing, and the initial status is that the "file pointer" (the internal element of the file handle structure that says where the next read or write will occur) is set at the start of the file. (This means that if you write before reading, you'll replace existing content, if any.) You can use ">>" to write to a file in append mode, and the initial status is that the file pointer is set at the end of the file, so if there is any existing content, your write operations will show up after that. Using "+>>" is nonsensical because file access begins with the file pointer at EOF, where there's nothing to read. You can get acquainted with the "tell" and "seek" functions, to query and control the file pointer, but that gets complicated, and I don't recommend it.	[reply]