Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I'd like to show a failed attempt at this interesting but ill-posed problem. I say ill-posed because the expected output is not deducible from the input alone, but expects patterns which form English words.

The LZx family of compression algorithms depend on finding and cataloging repeated substrings. I had the notion to use a modified LZW compression routine to find a list of candidate patterns. Only the dictionary is built, and I generate no compressed stream (take that, Unisys!). The dictionary keeps frequencies rather than unique numeric identifiers.

After preliminaries,

#!/usr/bin/perl -w -- use strict; # Usage: ./Pattern [n [string]]
I shift in arguments or set to a default, then clear out control characters. $depth provides for multiple scanning of the string, part of why this approach is flawed. As a stream-oriented algorithm, LZW is does not predict frequent substrings at the beginning, and is greedy about tacking extra characters onto a candidate. We'll see these effects later in the output listing.
my $depth = shift || 2; my $string = shift || "helloworldhellohellohihellohiworld"; $string =~ s/[\000-\037\0177]/ /g; # strip non-printable
Per LZW, we prime the dictionary with our alphabet. We set up a current working string, $j, then scan the input string one character ($_) at a time. If $j.$_ has been seen, we go on to the next character. If it has not, we increment $j's count, add $j.$_ to the dictionary, and reset $j to $_.
# seed dictionary with printable characters... my %dict = map {(chr$_ => 0)} '32'..'126'; # and populate it from the input string for (1..$depth) { my $j = ''; for (split //, $string) { my $tmp = $j . $_; $j = defined $dict{$tmp} ? $tmp : do{ $dict{$j}++; $dict{$tmp}=0; $_}; } }
Print the collected substrings, filtering out the single characters. I make a crude attempt to sort by desirability of a pattern, accounting for both frequency and length. A Data::Dumper spill of the dictionary can be uncommented for closer study.
# the following sort routine is arbitrary, # chosen because it behaves the way I want... sort of. { local $, = " "; local $\ = "\n"; print sort { $dict{$a}*length($a)**2 <=> $dict{$b}*length($b)**2 #|| } grep {length$_>1} keys %dict; } #use Data::Dumper; #print Dumper(\%dict); __END__
Run without arguments, the program prints:
ow ohi llo worl dh ldh orl rld hellohi hi hellow lohi iwo ihe ell ld h +e ll lo iw rl oh ih el or wo hel wor loh hell helloh hello

Clearly, the limited view of the data taken by a stream oriented algorithm is not good enough to recognize the two occurances of 'world' disjoint to 'hello''s four. The lookaside and capture facilities of perl's regex engine are superior for this task.

Props to jepri, who motivated me to look at the LZ clan a while back. I didn't use his code for this; these warts are all mine. I originally was going to try a similar trick in a cryptanalytic tookit, but this exercise has showed me that I need to find a better idea. I think I'll find several in the other replies here.

++artist for a stimulating question to think about.

After Compline,
Zaxo


In reply to And Now For Something Completely Different... Re: Pattern Finding by Zaxo
in thread Pattern Finding by artist

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-19 15:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found