Re: Efficiency issues in text parsing

Hi, Thanks for your comments. One more small doubt i have.

here is the input data

<input>
This is to test. this is to test
<p>This is to test. This is
to test</p>
<p>This is to test. This is
to test</p>
This is to test.
this is to test
</input>
[download]

<output>
This is to test. this is to test
<p>This is to test. This is to test</p>
<p>This is to test. This is to test</p>
This is to test.
this is to test
</output>
[download]

i.e. i want to make the ... as single line. i mean delete the carrage returns only inside ... my following code does the job, but only for the last .... i don't know how to loop it here. pls suggest

$infile = $ARGV[0];

open(IN, '<', "temp.in") || die "\nCan't open temp.in \n";
open(OUT, '>' "temp.out");
$/="";
while(<IN>)
{
    if($_=~s/(.*)&lt;p&gt;(.*)\<\/p\>(.*)//ms)
    {
        $pre =  $1;
        $par =  $2;
        $pos =  $3;

        $par=~s#\n# #ig;
        print OUT "$pre&lt;p&gt;$par\<\/p\>$pos";
    }
}
close(IN);
close(OUT);
[download]

Note: also please let me know how to include the source code in this page, any special tags for that? i mean the code formatting is often getting messed when i post _{edited by ybiC: Reformatted to avoid lateral scrolling in browser window - balanced <code>tags around example input+output and code}

Comment on Re: Efficiency issues in text parsing Select or Download Code

Replies are listed 'Best First'.
Re: Re: Efficiency issues in text parsing by CombatSquirrel (Hermit) on Aug 27, 2003 at 01:04 UTC
Have a look at Writeup Formatting Tips. To your problem: The following program did the trick for me: #!perl use strict; use warnings; { # braces for localization of $/ local $/ = '<p>'; # end of record is now <p> print scalar <DATA>; # first chunk contains everything before first <p> tag, just pri +nt for (<DATA>) { s@([\d\D]*?</p>)@ my $var = $1; $var =~ s!\n! !g; $var @e; # substitute newlines by spaces before the closing </p> tag print; } } __DATA__ This is to test. this is to test <p>This is to test. This is to test</p> <p>This is to test. This is to test</p> This is to test. this is to test [download] Hope this helped. CombatSquirrel. Entropy is the tendency of everything going to hell.	[reply] [d/l]
Re: Efficiency issues in text parsing by texuser74 (Monk) on Aug 27, 2003 at 06:25 UTC
Your code does the magic. Thanks you very much but one small doubt: what does "$var @e;" mean, particularly "@e", what does it mean here. can you please suggest me some good perl book to handle this kind of stuffs. once again, thanks a lot	[reply]
Re: Re: Efficiency issues in text parsing by CombatSquirrel (Hermit) on Aug 27, 2003 at 09:50 UTC
The "@" is just the seperator for the RegEx which starts with "s@". The "e" is a modifier that specifies that the substitution part should be evaluated and the result be taken as the real substitute. And since the last line is always the return value, I just put $var as the last value, because it contains the substitue. Cheers, CombatSquirrel. Entropy is the tendency of everything going to hell.	[reply]
Multiline Mode by Anonymous Monk on Sep 05, 2003 at 08:40 UTC
Re: Multiline Mode by CombatSquirrel (Hermit) on Sep 06, 2003 at 14:37 UTC
Some notes below your chosen depth have not been shown here