best way to read and save long list of words

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to save a growing list of words (>100.000), and occasionally update it. What would be the best way to do it? I need the list in an array to create a regex. At the moment I save it in a plani text file, one word per line. I read it like this:

    my @words;
    my $filename="terms.txt";
    if (open my $FH, "<:encoding(UTF-8)", $filename) {
        while (my $line = <$FH>) {
            chomp $line;
            push @words, $line;
        }
        close $FH;
    } 
    my $wordsRX = join "|", map quotemeta, @commonwords;
[download]

I write it like this

#I read with the script above my @terms and then add new words to it, 
+eliminate duplicates, if any, and write it back.
    my @wordsfiltered = uniq_array(@terms);
    my $fh = openFileAndWrite($filename);
    foreach (@wordsfiltered){
        print $fh $_ . "\n";
    }
    close $fh;
[download]

This runs fine, but I wondering if there are better ways to do it. I am thinking for example at serialisation, incredibly compact. Any drowback?

store(\@wordsfiltered, "terms.array");
my @words= @{ retrieve("terms.array") };
[download]

PS: Is there in Perl an easy way to compare the performances (speed) of the two approaces without working directly with timestamps myself?

Comment on best way to read and save long list of words Select or Download Code

Replies are listed 'Best First'.
Re: best way to read and save long list of words by tybalt89 (Monsignor) on Apr 19, 2020 at 01:24 UTC
Instead of using explicit while(){} and for(){} loops, I tend to use perl's built-in looping constructs. `# read use Path::Tiny; my @words = ( path($filename)->slurp ) =~ /^.+$/gm;` [download] or `# optional read without Path::Tiny my @words = do { local(@ARGV, $/) = $filename; <> =~ /^.+$/gm }; # or my @words = do { local(@ARGV, $/) = $filename; split /\n/, <> };` [download] and for write `# write use Path::Tiny; use List::Util qw( uniq ); path($filename)->spew(join "\n", uniq(@words), '');` [download] The regex on input was a win for me reading /usr/share/dict/words on my system which has over 123,000 lines in it. General rule: Don't do things item by item when you can do multiple things at once.	[reply] [d/l] [select]
Re: best way to read and save long list of words by Fletch (Bishop) on Apr 18, 2020 at 23:09 UTC
Use the Benchmark module to time your approaches. Oops edit: Additionally if you're going to be uniq'ing the list it might make more sense to keep a hash and use keys on that instead of just an array. As you read the words in just set the corresponding key to 1 and then you've already done the uniq pass (in effect). The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re: best way to read and save long list of words by roboticus (Chancellor) on Apr 19, 2020 at 00:15 UTC
Anonymous Monk: So long as the time spent reading it and writing it is insignificant, I'd suggest leaving the file as it is. It's nice to be able to be able to edit the file without worrying about breaking a serialization format. If reading and writing the file costs enough time to be worth looking at serialization, then I'd use serialization. But in that event, if it's a file you read often but edit infrequently, then I'd keep the text file as the 'master' so you can edit it and then run your serializer against it to generate the file you'd read frequently. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^2: best way to read and save long list of words by GrandFather (Saint) on Apr 19, 2020 at 00:24 UTC
or if time becomes significant maybe the OP needs to rethink and start using a database instead? The rethink part may mean replacing the regex with some other technique of course. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]
Re^3: best way to read and save long list of words by davido (Cardinal) on Apr 19, 2020 at 04:52 UTC
I had considered earlier mentioning that if there's a list of words that may need to have additions once in awhile, DBD::SQLite isn't a bad alternative. But then I couldn't put my finger on what problem we're solving for here. Optimizing disk usage? Optimizing read speed? I understand he's constructing a big regex out of the words, but is that even the best approach at solving whatever the problem is that's being solved? Maybe a SELECT would be useful? It's just a little hard to recommend an approach without understanding the objective. Dave	[reply]
Re^4: best way to read and save long list of words by GrandFather (Saint) on Apr 19, 2020 at 05:41 UTC
Re^2: best way to read and save long list of words by BillKSmith (Monsignor) on Apr 19, 2020 at 15:43 UTC
I agree. I also have a long list of English words sorted alphabetically. I received the original file in .zip format and keep that as backup. I keep the working copy in plain text, one word per line. On average, I add one new word every few months. It is convenient to do this with my editor (gvim). I originally planed to develop a library of functions to access the list so I could change the format without changing my applications. My 'lazy' approach worked so well that I could never justify the effort. Bill	[reply]
Re: best way to read and save long list of words by GrandFather (Saint) on Apr 18, 2020 at 23:25 UTC
When you run the script (any script) does it take too long to run (for some context specific "too long")? If not, whatever the script is doing is fast enough - no need to waste more time on it. See XKCD: Is It Worth the Time?. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.