Hi,
I have a base directory containing (amongst other files/dirs) eleven sub directories (Sub1-Sub11), each of these contains approx 10000 html files (amongst other files), which don't actually contain html, just a list of keywords (30 max). I wish to produce a list of sorted unique keywords. There is a lot of repetition, and new keywords can be added at any time to any file (hence this needs to be done dynamically). Worst case is that i might have to sort and uniq a list of (11x10000x30) 3300000 items. I can't flatten the directory structure as the OS (windows) has difficulty with directories containing such large numbers of small files. My code is definately not optimised, I've never really worried abut optimisation for saving milliseconds, but given the magnitude here, I am happy to save minutes! Apologies for any offensive code!
Any solutions, pointers, references, or just thoughts and comments will be well received.
Thanks
ant
use strict;
use Cwd;
my ($PWD) = getcwd;
my (@DirList);
my ($DirItem);
my (@SUBDirList);
my ($SUBDirItem);
my ($CurrentHtmlFile);
my (@Lines);
my ($Line);
my (%een);
my (@KeyWords);
opendir(DIR, $PWD) || die "Cannot Open The Directory \"$PWD\"\n";
@DirList = readdir(DIR);
closedir DIR;
foreach $DirItem (@DirList)
{
if ($DirItem =~ /^Sub/)
{
opendir(SUBDIR, "$PWD\\$DirItem") || die "Cannot Open The Dire
+ctory \"$PWD\\$DirItem\"\n";
@SUBDirList = readdir(SUBDIR);
closedir SUBDIR;
foreach $SUBDirItem (@SUBDirList)
{
if ( $SUBDirItem =~ /html$/)
{
$CurrentHtmlFile = "$PWD\\$DirItem\\$SUBDirIte
+m" ;
open (READ, "<$CurrentHtmlFile") || die "Could
+n't Read From $CurrentHtmlFile";
$Line = <READ>;
@Lines = split (/,/,$Line);
close (READ);
push (@KeyWords, @Lines);
}
}
}
}
foreach (@KeyWords) {++$een{$_};}
print sort keys %een;
a bit of background. i have an image library, keywords for image \d+.jpg are stored in \d+.html, and this can be updated
Slight paradigm shift. Am not going to run this dynamically each time a keyword file is updated, but instead just update the keyword list. (troll through the keyword list, if seen, nothing, else appened new keyword to the end) keywords can only be added, not removed, should reduce load considerably.
Thanks to all that commented, it will definately help with the initial collection, and i'm a better perl programmer now too!
cheers
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.