in reply to Efficiency issues in text parsing
in thread replacing text in specific tags

Hi, Thanks for your comments. One more small doubt i have.

here is the input data

<input> This is to test. this is to test <p>This is to test. This is to test</p> <p>This is to test. This is to test</p> This is to test. this is to test </input> <output> This is to test. this is to test <p>This is to test. This is to test</p> <p>This is to test. This is to test</p> This is to test. this is to test </output>
i.e. i want to make the <p>...</p> as single line. i mean delete the carrage returns only inside <p>...</p> my following code does the job, but only for the last <p>...</p>. i don't know how to loop it here. pls suggest
$infile = $ARGV[0]; open(IN, '<', "temp.in") || die "\nCan't open temp.in \n"; open(OUT, '>' "temp.out"); $/=""; while(<IN>) { if($_=~s/(.*)&lt;p&gt;(.*)\<\/p\>(.*)//ms) { $pre = $1; $par = $2; $pos = $3; $par=~s#\n# #ig; print OUT "$pre&lt;p&gt;$par\<\/p\>$pos"; } } close(IN); close(OUT);
Note: also please let me know how to include the source code in this page, any special tags for that? i mean the code formatting is often getting messed when i post

edited by ybiC: Reformatted - balanced <code> tags around sample input and code

Replies are listed 'Best First'.
Re: Re: Efficiency issues in text parsing
by CombatSquirrel (Hermit) on Aug 25, 2003 at 08:51 UTC
    Your formatting is pretty much screwed up. To be honest, I don't see what you are saying. Try to fix it and I'll do my best to help you. In the meantime I'd recommend the HTML::TokeParser Tutorial. In general, if you do extensive HTML or XML processing, consider using a module.
    Cheers, CombatSquirrel.
      Hi, Thanks for your comments. One more small doubt i have.

      here is the input data

      <input> This is to test. this is to test <p>This is to test. This is to test</p> <p>This is to test. This is to test</p> This is to test. this is to test </input>
      <output> This is to test. this is to test <p>This is to test. This is to test</p> <p>This is to test. This is to test</p> This is to test. this is to test </output>
      i.e. i want to make the <p>...</p> as single line. i mean delete the carrage returns only inside <p>...</p> my following code does the job, but only for the last <p>...</p>. i don't know how to loop it here. pls suggest
      $infile = $ARGV[0]; open(IN, '<', "temp.in") || die "\nCan't open temp.in \n"; open(OUT, '>' "temp.out"); $/=""; while(<IN>) { if($_=~s/(.*)&lt;p&gt;(.*)\<\/p\>(.*)//ms) { $pre = $1; $par = $2; $pos = $3; $par=~s#\n# #ig; print OUT "$pre&lt;p&gt;$par\<\/p\>$pos"; } } close(IN); close(OUT);
      Note: also please let me know how to include the source code in this page, any special tags for that? i mean the code formatting is often getting messed when i post

      edited by ybiC: Reformatted to avoid lateral scrolling in browser window - balanced <code>tags around example input+output and code

        Have a look at Writeup Formatting Tips.
        To your problem: The following program did the trick for me:
        #!perl use strict; use warnings; { # braces for localization of $/ local $/ = '<p>'; # end of record is now <p> print scalar <DATA>; # first chunk contains everything before first <p> tag, just pri +nt for (<DATA>) { s@([\d\D]*?</p>)@ my $var = $1; $var =~ s!\n! !g; $var @e; # substitute newlines by spaces before the closing </p> tag print; } } __DATA__ This is to test. this is to test <p>This is to test. This is to test</p> <p>This is to test. This is to test</p> This is to test. this is to test
        Hope this helped.
        CombatSquirrel.
        Entropy is the tendency of everything going to hell.