I think halley has it right with building an index using an MD5 checksum of your paragraphs is the way to go.
I though it worth mentioning that you can easily read a file, paragraph by paragraph by setting local $/ = '';
The following script builds an in memory index (filename/byte offset) for every file in a list given on the command line.
It indexed every paragraph in 512 html files in my perl distribution in under 10 seconds and 6MB.
#! perl -slw use strict; use G; ## Expands command line wildcards use Data::Dumper; use Digest::MD5 qw[ md5_hex ]; local $/ = ''; ## paragraph mode. my( $pos, %index ) = 0; ## The first para start at offset 0 while( <> ) { ## build a HoAoAs, MD5 is the key ## The values are arrays of [ filename, offset ]. push @{ $index{ md5_hex( $_ ) } }, [ $ARGV, $pos ]; ## Getthe next offset $pos = tell ARGV; ## Back to 0 if we reached the EOF $pos = 0 if eof( ARGV ); } print Dumper \%index; __END__ C:\Perl\html>p:359522 *.html *\*.html *\*\*.html Processed 512 files and 5829 paragraphs into 3293 unique signatures.
By storing the offset, you can seek to the place n the file(s) that have a matching MD5 and verify that the paras are indeed identical. There is a small, but statistically possible chance of collisions.
In reply to Re: Efficiency: Finding if a file contains a paragraph
by BrowserUk
in thread Efficiency: Finding if a file contains a paragraph
by C_T
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |