text file parsing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Can anyone help me with this, why when parsing a log text file to display to the browser (its size is 300KB) it hangs. Here is the code I have...

my $start = $getdata;
my $end   = $getdata2;

my $dir = '../weblog/';
@ARGV     = ();

die "start must be less than end" if $start >= $end;
die "no dir $dir here" unless -d $dir;


find sub {
   my $numb = (fileparse($_,'.txt'))[0];
   return unless $numb =~ /^\d+$/;
   push @ARGV, $File::Find::name if $numb >= $start and $numb <= $end;
}, $dir;

die "no .txt files found in $dir" unless @ARGV;
my %replace = (
    iapw_p1 => "<font color=#000000 size=\"2\"><b><i>P1 = (search inqu
+iry, launch page)</i></b></font>\n",
    iapw_p2 => "<font color=#000000 size=\"2\"><b><i>P2 = (policy cove
+rages, endorsements, operators)</i></b></font>\n",
    iapw_p3 => "<font color=#000000 size=\"2\"><b><i>P3 = (policy note
+pad)</i></b></font>\n",
    iapw_b1 => "<font color=#000000 size=\"2\"><b><i>B1 = (billing inq
+uiry: auto, home and properties)</i></b></font>",
    iapw_p0 => "<font color=#000000 size=\"2\"><b><i>P0 = (policy sear
+ch)</i></b></font>\n",
    iapw_c0 => "<font color=#000000 size=\"2\"><b><i>C0 = (record sear
+ch)</i></b></font>\n",
    iapw_c1 => "<font color=#000000 size=\"2\"><b><i>C1 = (record_two 
+inquiry)</i></b></font>\n",
    iapw_c3 => "<font color=#000000 size=\"2\"><b><i>C3 = (noted notep
+ad)</i></b></font>\n",
    iapw_h1 => "<font color=#000000 size=\"2\"><b><i>H1 = (owners and 
+rental inquiry and history)</i></b></font>\n");

while (<>)
{
    foreach my $key ( keys %replace )
    {
        s/$key/$replace{$key}/g ;
    }
    print;
}
[download]

Comment on text file parsing Download Code

Replies are listed 'Best First'.
Re: Text File Parsing / Homegrown Template by tadman (Prior) on Jun 18, 2002 at 21:48 UTC
You might want to consider using something like Text::Template which could do this for you. That being said, you certainly want to reduce the complexity of your looping. The first while can be killed off if you read in the entire file before processing: `my $file = join('', <>); foreach (keys %replace) { $file =~ s/$_/$replace{$_}/g; }` [download] Further, you could likely generalize this into something like so, assuming a certain consistency to your tags: `$file =~ s/(iapw_..)/$replace($_}/g;` Loop free, this should be much faster. Your technique of pushing things onto @ARGV is clever, and yet at the same time kind of disturbing. If this were production code, I'd suggest doing something a little more formalised.	[reply] [d/l] [select]
(jeffa) 3Re: Text File Parsing / Homegrown Template by jeffa (Bishop) on Jun 18, 2002 at 22:11 UTC
"Your technique of pushing things onto @ARGV is clever, and yet at the same time kind of disturbing." Looks like the code i gave out at (jeffa) Re: Print contents of a range of text files to browser (was: text files). That is a trick i picked up from merlyn at •Re: Re: Listing Files. "If this were production code, I'd suggest doing something a little more formalised." Let's see what merlyn has to say about that: •Re: Re: •Re: Re: Listing Files. I'm with merlyn on this one. UPDATE: Although ... i really do believe that using @ARGV in this manner is what is causing the script to hang - if any of the the conditions that cause a die in the code above are met without that die - the code will hang. Maybe Anony should be a bit more careful and explicitly open the files like you suggest. Myself, i will still use that technique anyway. You just have to know what you are doing. ;) jeffa Scissors are for running	[reply]
Re^4: Text File Parsing / Homegrown Template by tadman (Prior) on Jun 18, 2002 at 22:22 UTC
I can see where merlyn is coming from. It is true that `@ARGV` and `<>` are intertwined, that this behaviour is a fundamental property of Perl, but even so, it's still kind of odd. This use of `@ARGV` is that sometimes irritating "Swiss-Army" property of Perl. Sure, it slices and dices, but just because you can doesn't mean you should. merlyn has his reasons for promoting it, I'm sure, but I'm not totally sold. The thing I don't like about it is that, apart from smelling a little too much of shell-script programming, it isn't re-entrant. If you need to process a sub-file in the same manner, things are going to get a little hairy because you're using a global. You could make it local, presumably, but I was under the impression that local was going to walk the plank in short order. What I'd probably do is sub-class IO::File and make something that could read from a group (or glob) of files. That way the "file chaining" stuff is safely contained.	[reply] [d/l] [select]
•Re: Re^4: Text File Parsing / Homegrown Template by merlyn (Sage) on Jun 18, 2002 at 22:34 UTC
Re: text file parsing by jarich (Curate) on Jun 19, 2002 at 00:08 UTC
`find sub { my $numb = (fileparse($_,'.txt'))[0]; return unless $numb =~ /^\d+$/; push @ARGV, $File::Find::name if $numb >= $start and $numb <= $end; }, $dir;` [download] Basically, this code goes to a directory and read all the files in it. Sounds like a job for readdir. `opendir DIR, $dir or die "Failed to open directory $dir: $!"; foreach my $file (readdir DIR) { # much the same as using fileparse my ($number, $ext) = split /\./, basename($file); next unless $ext eq "txt"; next unless $number =~ /^\d+$/; next unless $number < $start \|\| $number > $end; push @ARGV, "$dir/$file"; }` [download] but that's just one suggestion. It seems to me that using File::Find is somewhat overkill for this problem as File::Find will happily recurse your entire directory tree and that might take quite some time. Of course if you need it to recurse your directory tree then readdir will be useless and you've chosen the correct tool. If you're certain that the problem does not lie in your call to find, perhaps you should print out @ARGV and have a look at its contents. Something like: `{ local $, = "\n"; print @ARGC; }` [download] should do the trick nicely. Note that when you print in your while loop at the bottom you're concatenating all the files together when they were originally separate. So if @ARGV ended up containing 10 files then your output will be all 10 files concatenated after the substitution. Hope this helps. jarich	[reply] [d/l] [select]
Re: text file parsing by kvale (Monsignor) on Jun 18, 2002 at 21:49 UTC
I don't see any obvious spots where the program might go into an infinite loop and without more information it is hard to tell where to start. A general strategy for debugging these sorts of problems is to put print statements in the code saying "I got here" and then run it to see if you got there. If you got there, check later in the program, if not, check earlier. This will allow you to soon find the pathological code. To help us solve your problem, you might also describe how the code hangs. Is any output produced? Did you try -w to catch typos? -Mark	[reply]
Re: text file parsing by caedes (Pilgrim) on Jun 18, 2002 at 21:52 UTC
What does "find" do?	[reply]
Re: Re: text file parsing by Anonymous Monk on Jun 18, 2002 at 22:09 UTC
Basically, this code goes to a directory and read all the files in it. They are all log files and are named individually like, 04132002.txt till today’s date 06182202.txt. Then from a form with two select menus the user choose two dates to check the activity for a particular action. Like choosing dates from 04132002 to 0682002 the program will have to display to the browser all the users for that date and what page they visited during that period on the website. Complications arrive when parsing the file, the program hangs while doing some substitutions, or it could be for other unknown reason.	[reply]
Re^3: Large Text File Collections by tadman (Prior) on Jun 18, 2002 at 22:29 UTC
This might be totally unreleated. Just as a note, but on most filesystems, directories with a large number of entries (i.e. 10,000) can start to get really slow to access. Sometimes you don't realize that merely finding the problem is an issue until you try and open just one, and that can take some time as the OS scans for the requisite directory entry.	[reply]