Check this out. It finds the reoccuring phrases "of", "on", "Leonardo da Vinci", "da Vinci", and "Vinci".

$_ = <<"..."; Leonard of Quirm, a character in the Discworld series of novels, is based largely on Leonardo da Vinci. Leonardo da Vinci died at Clos Lucé, France, on 2nd May, 1519. ... # Normalize the whitespace s/\s+/ /g; my $RX = qr/ # NODE EXPLANATION # -------------------------------------------------------------------- +-- \b # the boundary between a word char (\w) and # something that is not a word char # -------------------------------------------------------------------- +-- ( # group and capture to \1: # -------------------------------------------------------------------- +-- \w+ # word characters (a-z, A-Z, 0-9, _) (1 o +r # more times (matching the most amount # possible)) # -------------------------------------------------------------------- +-- (?: # group, but do not capture (0 or more # times (matching the most amount # possible)): # -------------------------------------------------------------------- +-- \s+ # whitespace (\n, \r, \t, \f, and " ") # (1 or more times (matching the most # amount possible)) # -------------------------------------------------------------------- +-- \w+ # word characters (a-z, A-Z, 0-9, _) (1 # or more times (matching the most # amount possible)) # -------------------------------------------------------------------- +-- )* # end of grouping # -------------------------------------------------------------------- +-- ) # end of \1 # -------------------------------------------------------------------- +-- \b # the boundary between a word char (\w) and # something that is not a word char # -------------------------------------------------------------------- +-- .+? # any character except \n (1 or more times # (matching the least amount possible)) # -------------------------------------------------------------------- +-- \b # the boundary between a word char (\w) and # something that is not a word char # -------------------------------------------------------------------- +-- \1 # what was matched by capture \1 # -------------------------------------------------------------------- +-- \b # the boundary between a word char (\w) and # something that is not a word char /xms; while ( /$RX/sg ) { pos() = $-[0] + 1; print "<$1>\n"; }

⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊


In reply to Re: Finding recurring phrases by diotalevi
in thread Finding recurring phrases by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.