catch22 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
Hoping you might assist with code that would allow me to parse a text file
Basically, I would like to delete out all text in a file that is in between two specific points in the file. These points occur numerous times. Specifically:
1. Starting at text that reads (without the quotes): "<?xml version="1.0" encoding="UTF-8"?>"
2. Ending at text that is the html code ending a table: /table> (with the < ). This is the first /table> after my start point and is always preceded by html code: 5 tabs /b
4 tabs /div>
3 tabs /td>
2 tabs /tr>
2 tabs /table>
Then the file continues on to text I would like to keep and tables in the text that I also do not want deleted.

Basically looks like this... (Start point text,<?xml version="1.0" encoding="UTF-8"?>) - text to delete - (end point text,/table>) - text to keep - repeat the pattern with the start point text again.

Was hoping to run this from the command line (#!/usr/bin/perl) and specifiy the input and output file.

FYI backgound..I would describe myself as a well computer enthusiast (good with linux, html, php, mysql...), but have only recently began dabbling in perl. I have been scouring the internet for some code snippets to cob together and perlmonks.org fom some knowledge for this project, but have been unsuccessful.

Any assistance is greatly appreciated.

Replies are listed 'Best First'.
Re: file parsing
by Tanktalus (Canon) on Jul 11, 2008 at 03:25 UTC

    Well, dragonchild suggests HTML::Parser because it sounds like HTML. However, with that <&xml version... at the beginning, it sounds like it's actually XML (or, more specifically, probably XHTML). While HTML::Parser would probably do the job, so would XML::Twig. I suspect, though I've never used HTML::Parser, that XML::Twig is probably a bit more difficult to learn. That said, it seems to be a tool in my toolbox that I keep coming back to, time and again. I can use it for general XML as well as for XHTML processing. And I do both. In that way, it's nice to have such a general tool that is so usable and so reliable as a tool. Any excuse to learn this module is likely to come back and pay for itself many times over.

    I'm not saying that learning HTML::Parser is a waste of time. Only that XML::Twig is more general and so probably even more worth the time investment.

Re: file parsing
by martin (Friar) on Jul 11, 2008 at 05:04 UTC
    While XML parsing might help a lot if you wanted to process the structure of the file, weeding out stuff between certain markers is a lot easier. I see a classical use case for the flip-flop operator here. You want to copy input to output, skipping some of the input. Seeing certain patterns should switch between the two modes. Perl can do that like this:
    #!/usr/bin/perl while (<>) { next if m{^\Q<?xml version="1.0" encoding="UTF-8"?>} ... m{</table>}; print; }
    Note the use of \Q to protect a literal string in a regex. The three dots are not something to fill in but precisely three dots here. This is (in scalar context) a flip-flop operator evaluating either the left or the right expression according to its state. To catch beginning and end markers on the same line, you'd have to replace the three dots by two dots. Also look up the -n and the -i switch in perlrun if you like to edit your files in-place with a single short command line.
Re: file parsing
by dragonchild (Archbishop) on Jul 11, 2008 at 02:56 UTC
    In other words, you want to parse HTML. HTML::Parser is probably a good place to start.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: file parsing
by wfsp (Abbot) on Jul 11, 2008 at 12:16 UTC
    If you could show us a cut down example of your file and examples of what you need to extract it would be easier to give you a pointer.

    (and put it between <code>..</code> tags)

      Thank you for the replies monks! Here is a cut down example in response to wfsp's post.
      The code to remove begins here:
      <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/ +xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang= +"en"> <head> <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charse +t=ISO-8859-1">

      And it continues with a bunch of code I would like to remove. My end marker for removal would be here:
      </b> </div> </td> </tr> </table>
      This bunch of code repeats numerous times in the file.
      After this code-to-remove, I have the code I would like to keep, tags and all, untouched by any parsing or modification.

      After this code-to-keep, begins the cycle of code-to-remove again, as above:
      <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/ +xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang= +"en"> <head> <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charse +t=ISO-8859-1">

      and so on...
      I was reading up on and trying some code found from the internet and this post (Thank you Martin for your post), but can't seem to have the code-to-keep untouched by the parser.

      Thank you
        Could still do with some more info. :-)

        Is there a pattern in what you want to keep? It would be easier that way round. Also in your "end marker" there is a closing div and a closing table. What do the opening tags look like? Are there any identifiable attributes?