Text Conversion

artist has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I need the help with text conversion job.
The heart of the program is..

while(<IN>){
    my $source = new Source(_text => $_);
    $source->parsing;
    $destination = new Destination(_source => $source);
    $destination->conversion;
    $destination->print;
}
[download]

$source->parsing takes the text and devide into serval small segements and creates hash like $source->{_section1}, $source->{_section2} etc. $source->{_section2} is also further sectioned into $source->{_section21}, $source->{_section22} etc..

$destination->conversion takes the source object and creates its own parts such as $destination->{_part1}, $destination->{_part2} etc..

$destination->print combines the different parts of destination and prints the final text appropriately

Here is the problem.

If I have to process single line of text input at at time it works fine. But, some of the input lines are in continuation of the previous line and can be identified by a special marker sign '\' at the beginning of $source->{_section2} which can be achieved only after $source->parsing.

Now it's possible to have 0,1 or more continuation lines.

What I would like to do is to attach 'new' $source->{_section2} with previous $source->{_section2} if new $source->{_section2} has a continuation marker, so when I pass the $source object as a parameter to $destination it should be single source object (which includes data from any continuation lines it has). In other words, I like to wait for next input line to see if it is in continuation with the current line.

Also note, there is no indication whether the current line has further continuation or not.

Size: The single line size is around 400 characters at max and there 200,000 lines in the file. Each section can be from 0 to 300 characters

Frequency: This is not a one time job

I appreciate Any Suitable architecture.

Thanks,
Artist

Comment on Text Conversion Download Code

Replies are listed 'Best First'.
Re: Text Conversion by talexb (Chancellor) on Dec 30, 2002 at 20:20 UTC
Leave your existing script alone, and instead write a separate script that simply takes care of handling continuation lines. Then you run that, pipe the output into the input stream of your existing script, and you are done. --t. alex Life is short: get busy!	[reply]
Re: Text Conversion by Mr. Muskrat (Canon) on Dec 30, 2002 at 20:41 UTC
The following is recipe 8.1 "Reading Lines with Continuation Characters" from The Perl Cookbook. (Well worth the money btw.) Tom and Nate, I hope you don't mind... `while (defined($line = <FH>) ) { chomp $line; if ($line =~ s/\\$//) { $line .= <FH>; redo unless eof(FH); } # process full record in $line here }` [download] Perhaps you could modify it to suit your needs.	[reply] [d/l]
Re: Text Conversion by pg (Canon) on Dec 30, 2002 at 20:22 UTC
Update: Okay as you explained more, now here is some code works for the sample data you give. For me, the best way to interpret your requirement is, to base everything on the sample you gave, so you may have to make some changes, to make it fit: `use strict; open(IN, "<", "data.txt"); my @in = <IN>; close(IN); open(OUT, ">", "out.txt"); my $out = ""; foreach my $in (@in) { chomp($in); if ($in =~ m/^(.?)\\(.?)\s$/) { $out .= " $2"; } else { if ($out ne "") { print OUT "$out\n"; } $in =~ m/^(.?)\s$/; $out = $1; } } print OUT "$out\n"; close(OUT);` [download] Oringinal Reply:* The way you store your data is really problematic. Use a flatened structure like _section1, _section2, _section12, _section13 etc is not a good idea at all. It is much more natural to look at your data structure as a tree. For this purpose, there are lots of implementation choices. One choice is array-ref of array-ref... So first you would have, $source{_source} which is an array ref. For example, your _section1, would be `$source{_source}->[0]` [download] If your case is that you need to store some text for section1 itself, even though it is not a leaf, then store those things in: `$source{_source}->[0][0]#dedicate 0 to myself` [download] If this section1 is not a leaf, itself would be another array-ref, and its first child would be `$source{_source}->[0][1]#my 1st child` [download] This goes on and on. There is a big chance that you could make your parse function recursive. A piece of sample code: `use Data::Dumper; use strict; my %source; $source{_source} = []; $source{_source}->[0] = []; $source{_source}->[0][0] = "section1";#myself $source{_source}->[0][1] = "section11";#1st child $source{_source}->[0][1] = "section12";#2nd child print Dumper(\%source);` [download]	[reply] [d/l] [select]
Re: Re: Text Conversion by artist (Parson) on Dec 30, 2002 at 20:54 UTC
Hi, To all those, who are attempting to answer, and asking me to change the data structure: My source object has methods which can be called from outside. Thus, to the destination object I access the method with source->section1 rather than source->{_section1}. Also section1,section2 are just the place holder here for the actual names such as account_number or key_code. Here is the sample data suggested by jdporter. foo1 bar1 etc1 foo21 bar21 etc21 foo22 \bar22 etc22 foo3 bar3 etc3 foo31 bar31 etc31 foo32 \bar32 etc32 foo33 \bar33 etc33 Output required: foo1 bar1 etc1 foo21 bar21 etc21 bar22 etc22 foo3 bar3 etc3 foo31 bar31 etc31 bar32 etc32 bar33 etc33 Thanks, Artist	[reply]
Re: Re: Text Conversion by artist (Parson) on Dec 30, 2002 at 21:52 UTC
Hi pg, I like your reply and it works fine. There is one problem. my input file is about 30 MB and I don't want to store the entire thing in memory as that may create problem. So `my @in = <IN>;` [download] is not a good option for this case. Also Since I have to parse further, I would like to have 1 items added.At what point I would add that? $key = foo_item $value = bar_items process($key,$value) Thanks Artist	[reply] [d/l]
Re: Text Conversion by poj (Abbot) on Dec 30, 2002 at 22:06 UTC
My offering to remove the continuation lines into the correct format `open (IN,"in.txt") or die "$!"; open (OUT,">out.txt") or die "$!"; $_=<IN>; chomp; print OUT ; while (<IN>){ chomp; if (m!\\(.*)!){ print OUT (" ".$1); } else { print OUT "\n$_"; } } print OUT "\n";` [download] poj	[reply] [d/l]
Re: Text Conversion by John M. Dlugosz (Monsignor) on Dec 30, 2002 at 21:36 UTC
I like recipe 8.1, already posted by someone else, to read the continuation lines. It only affects the top of the loop: change your while to the block of several lines. A little messier in your case since you have to stay one line ahead. But if that's too much, an elegant idea that's half way between this and the pipeline as a separate process/script is to use a tied file handle. Have the "read a line" logic do the continuation, reading one ahead. This isolates the state into the object and doesn't change your main code at all. —John	[reply]