tsk1979 has asked for the wisdom of the Perl Monks concerning the following question:

I can do this with 2 passes over the file, but I was looking for a way to do this in one pass as files are very big. This is what I intend to do. Imagine a txt file with the following data
Some garbage More garbage data -start <some string> \ -intermediate <some string> \ -intermadiate <some string> \ . . -end <some string> Some garbage More garbage data -start <some string> \ -intermediate <some string> \ -intermadiate <some string> \ . . -end <some string> Some garbage More garbage data -start <some string> \ -intermediate <some string> \ -intermadiate <some string> \ . . -end <some string> . . .
I want the output file to contain
data -start <string> -end <string> data -start <string> -end <string> data -start <string> -end <string> . . .
The catch? After removing intermediates, there will be lots of duplicates, which I want to remove. In my current flow, I read in the file, write out an array, and then unique the array 2 pass process seems to be a waste of time. If I can get a one pass algo, it will be great!

Replies are listed 'Best First'.
Re: Sorting and subsituting a data file, one pass
by ikegami (Patriarch) on Jun 21, 2010 at 06:05 UTC
    my %seen; while (<>) { if (s/\\\n\z/ /) { my $next = <>; if (defined($next)) { if ($next =~ /^\s*-end\s/) { $_ .= $next; } elsif ($next =~ /(\\\n)\z/) { $_ .= "\\\n"; } else { $_ .= "\n"; } redo; } } print if /^data\s/ && !$seen{$_}++; }
      This should be the solution. But I am stumped at <>. Won't this stop for user input at every stage?
        Only if you don't pass filenames, or redirect a file
        $ perl myprogram.pl file1 file2 file3 $ myprogram.pl < file4
Re: Sorting and subsituting a data file, one pass
by CountZero (Bishop) on Jun 21, 2010 at 06:27 UTC
    Go through your data file line by line, assembling your data -start <string> -end <string> as you go along. Once each item is assembled store it as the key of a hash (the value can be anything you like or left empty). Duplicate hash keys will disappear automatically and you can then sort the keys.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James