comment on

The standard way to do this type of text analysis on large bodies of text is to use MapReduce. The workflow for that looks like this:

Take text, emit key/value pairs.
Sort the key/value pairs by key then value.
For each key, ~~organize~~ process the sorted values.

With a typical framework like Hadoop you only have to write the first and third steps, which are called Map and Reduce respectively. All three steps can also be distributed across multiple machines, allowing you to scale the work across a cluster.

In your example you can benefit from the same approach, even using just one machine, even without a framework.

Your fundamental problem is that you have 1 GB of text to handle. You're not going to succeed in keeping it all in memory. (Particularly not with how wasteful Perl is.) So don't even try, you need to plan on using the disk. And map-reduce uses disk in a way that is very friendly to how disks like to be used. (Stream data to and from, don't seek.)

What you should do is read your original file, and print out all of your n-grams to a second file. It will have lines of the form $n_gram: $line_number Then call the unix sort utility on the second file to get a third file that will have the exact same lines, only sorted by $n_gram, then line number. (Line numbers will be sorted asciibetically, not numerically.) Now take one pass through the third file to collapse to a file with lines of the form $n_gram: @line_numbers. (This file will be trivially sorted. If you care, you can sort your line numbers correctly before printing this file.) And now you can use the built-in module Search::Dict to quickly look up any n-gram of interest in that file. (But if you have to do any significant further processing of this data, I would recommend trying to think about that processing using a similar MapReduce idea. Doing lots of lookups means that you'll be seeking to disk a lot, and disk seeks are slow.)

In reply to Re: statistics of a large text by tilly
in thread statistics of a large text by perl_lover_always

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.