comment on

I'm trying to parse a text file and convert it to XML. The .txt file consists of a list of entries, separated by line breaks. So, two sample entries look like this:

Leyson, Captain Burr.  "With or Without Gadgets."  Boys' Life.  Nov 19
+49.  p. 6.  "An old-timer knew what he had to do in a jam.  He didn't
+ need hundreds of those gadgets to guide him to safety."  @gauge %tri
+vial

"The Battle Against Baldness."  Kiplinger's Personal Finance.  Feb 194
+9.  "A little home hair-cutter gadget--a comb with a razor attached--
+ has zipped its way into fame in recent months.  Barbers pooh-pooh it
+ as a threat, but sales are going strong."  @tool %american
[download]

Each entry contains bibliographic data, a quotation from that source, and two sets of tags: a set of primary and secondary classifications, one using @tags and the other using %tags, all on a single line.

The most important information for me to extract from each entry is year and tags. So, I came up with the following script:

#!/usr/bin/perl -w

my $year = ""; 

while (<>) {
    chomp;
    if ($_ eq "") {next;}
    elsif ($_ =~ /^\d\d\d\d$/) {
        $_ = $year;
    }
    else {
        s/\@(\w*)/ <keyword> $1 <\/keyword>/g;
        s/\%(\w*)/ <tag> $1 <\/tag>/g;
        print "<entry>$_ <year> $year </year> </entry>\n";
    }
}
[download]

The @tags and %tags are recognized just fine. Problem is, entries and years are not located. My program doesn't differentiate between entries: I get <entry> at the very beginning of the output and </entry> at the very end. Similarly, there's only a single, blank <year></year> right before </entry>.

I realize there's probably a very simple solution to this, but I'm still at the circumference of a circle, knock-knock-joke stage of perl programming, so your expertise would be very much appreciated. Thanks!

In reply to Converting a Text file to XML by monk8148n038

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.