Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Words in Words

by BrowserUk (Patriarch)
on Sep 30, 2011 at 18:57 UTC ( [id://928897]=note: print w/replies, xml ) Need Help??


in reply to Words in Words

Try this. I project that it should complete your 410 billion comparisons in a little under 10 hours.

The main attempt at efficiency here is to invoke the regex once in global mode (/g) for each word, against a single large string containing all the words and have it return all the matches. It then filters just the matching ones for your specific exclusions.

#! perl -slw use strict; my @words = do{ local @ARGV = 'words.txt'; <> }; chomp @words; my $all = join ' ', @words; my $start = time; my $n = 0; for my $i ( @words ) { for my $j ( $all =~ m[ ([^ ]*$i[^ ]*) ]g ) { next if $j eq $i or $j eq "${i}s" or $j eq "${i}'s"; # print "$j contains $i"; } } printf STDERR "Took %d seconds for %d words\n", time() - $start, scalar @words;

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Words in Words
by Lotus1 (Vicar) on Sep 30, 2011 at 21:27 UTC

    Your approach does n^2 comparisons with n=(number of words in list). I took a different aproach of sorting the list then comparing each word only with the following words in the array. This cuts the comparisons by half when n gets large as in this case. I'm curious how grep with the regex will stand up against your approach.

    As for the extra overhead for sorting the word list my guess is the extra sorts at the beginning pay off as n gets large. What do you think? I have been wanting to try out benchmarking for some time after seeing you use it so much. This could be a good opportunity.

    use warnings; use strict; open LOG, ">", 'wordsinwords_LOG.txt' or die $!; #Sort alphabetically first so identical words end up together. chomp (my @words = sort { length($a) <=> length($b) } sort <DATA>); @words = map {lc} @words; for (my $i=0; $i<$#words; $i++) { my $word = $words[$i]; next if $word eq $words[$i+1]; my @matched = grep {/$word/ and $_ ne "${word}s" and $_ ne "${word +}'s" } @words[$i+1..$#words]; print LOG "$word => @matched\n" if @matched; } __DATA__ at Ate crate tree Trip tread read ad at ads crate's

    Log file contains:

    ad => read tread at => ate crate crate's ate => crate crate's read => tread

    This is interesting. http://wordlist.sourceforge.net/ SCOWL (Spell Checker Oriented Word Lists) has 652,000 words plus a range of smaller lists.

      Sorry, but only comparing words with those that come after it alphabetically doesn't work.

      For example, the work 'call' appears in all these words that precede it alphabetically:

      abiogenically abiotically academically acerbically achromatically acoustically acrobatically acronymically acrostically actinically adiabatically adrenergically aerobically aerodynamically aeronautically aesthetically agonistically agronomically alchemically alcoholically algebraically algorithmically allegorically allopatrically allosterically allotypically alogically alphabetically altruistically amitotically anacoluthically anaerobically anagogically analogically analytically anaphorically anarchically anatomically anchoritically anecdotically anemically anesthetically angelically animatronically anisotropically anodically antibiotically antically antidromically antigenically antiseptically antithetically aoristically apathetically aperiodically aphetically aphoristically apically apocalyptically apodictically apolitically apologetically apomictically apoplectically aposematically apotropaically aquatically archaically arctically arithmetically aromatically arthritically artistically ascetically aseptically asthmatically astrologically astronautically astronomically astrophysically asymmetrically asymptotically asyndetically atavistically atheistically athletically atmospherically atomically atomistically atypically authentically autistically autocratically autographically automatically autonomically autotrophically axenically axiologically axiomatically ballistically barbarically barometrically basically bathetically bathymetrically beatifically biblically biochemically biogenetically biographically biologically biomechanically birdcall birdcalls bolometrically bombastically botanically buccally bucolically caecally call

      And the word 'the' appears in 1300 words that precede it in my 178,000 word dictionary.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Sorry, but only comparing words with those that come after it alphabetically doesn't work.

        True, but in the solution I posted I sort by length after the alphabetical sort. The alphabetical sort is probably overkill but needed for this particular solution.

        chomp (my @words = sort { length($a) <=> length($b) } sort <DATA>);

        This process only needs to identify 1 word in the list that meets the criteria resulting in a unique list of the words that were contained within other words.

        I suppose there would still be a problem if this process doesn't look at the words that preceed each word because something could be missed.

        I have this code working and it sure is fast at producing a list of words. It just needs to return 1 word instead of all of the words that match and ensure that it searches the entire list.

        Thank you!

      SCOWL is the wordlist that I am using!. I just combined the 256 files and pulled a distinct list of words

      I am starting to look at trying the code you provided but need to figure out where you are defining what is loaded into <DATA>

      I will let you know what I find

      Thank you.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://928897]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 19:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found