As you do not give an example of the original text and of the result you would expect for it, it is very difficult to answer you.
A few ideas though:
- if your input is plain text, not XML, use either Parse::RecDescent or just plain regexps, coupled with XML::Writer or a SAX writer (XML::Handler::YAWriter or XML::SAX::Writer),
- if your input is already XML, then you can use a SAX filter to isolate the bits you want to process further, and treat them as plain text, printing the result or emitting SAX events on it,
- if you know a bit about SGML, you could try writing the SGML DTD and taking advantage of the minimization features (wherer the parser infers the markup from the DTD, using the DTD structure and for example line returns as an element delimiter) to see if by any chance your text is not already a valid SGML document, or cannot simply be made into. Going from SGML to XML is as simple as using sx (also called osx in some linux distributions),
And finally, because no post of mine is complete without a shameless XML::Twig plug, if all you want is wrap the lines in a list in the appropriate tags, then you can use something like this:
#!/usr/bin/perl -w
use strict;
use XML::Twig;
XML::Twig->new( # process just list elements
twig_roots => { list => \&process_list },
# output the rest as is
twig_print_outside_roots => 1,
)
->parse( \*DATA);
sub process_list
{ my( $t, $list)= @_;
# wrap (non-empty) lines in a listitem element
my @listitems= $list->split( qr/(^.+)\n/m => 'listitem');
# add the p and extract tags within each listitem
foreach my $listitem (@listitems)
{ $listitem->insert( 'p', 'extract'); }
$list->print ;
}
__DATA__
<extract>
<line>
<p>
<extract>
<show>
<list>
first item
second item
third item
</list>
</show>
</extract>
</p>
</line>
</extract>
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.