comment on

Hi predrag,

It sounds like the trickiest part of your current solution is probably figuring out whether you're in some part of the HTML code or whether you're in the text, since obviously tags shouldn't be converted to Cyrillic. Unfortunately, parsing HTML is a pretty difficult task (a humorous post about the topic). So I'd like to encourage you to look at one of the parser modules again.

Two classic modules are HTML::Parser and HTML::TreeBuilder, but there are several others, such as Mojo::DOM. If the input is always XHTML, there's XML::Twig and many more XML-based modules. These modules generally break down the HTML into their structure, including elements (<tags>) with their attributes, comments, or text. Some of the modules then represent the HTML as a Document Object Model (DOM), which is also worth reading a little about. It sounds like you only want to operate on text, and maybe on some elements' attributes (such as title="..." attributes).

Operating only on text is relatively easy: for example, in a HTML::Parser solution, you could register a handler on the text event, which does the appropriate conversions, and register a default handler which just outputs everything else unchanged:

use warnings;
use strict;

use HTML::Parser;

my $p = HTML::Parser->new(
    api_version => 3,
    unbroken_text => 1 );
$p->handler(text => sub {
    my ($text) = @_;
    # ### Your filter here ###
    $text=~s/foo/bar/g;
    print $text;
}, 'text');
$p->handler(default => sub {
    print shift;
}, 'text');

my $infile = '/tmp/in.html';
my $outfile = '/tmp/out.html';

open my $out, '>', $outfile
    or die "open $outfile: $!";
# "select" redirects the "print"s
my $previous = select $out;
$p->parse_file($infile);
close $out;
select $previous;
print "$infile -> $outfile\n";
[download]

Operating on attributes will require you to handle opening elements (tags) as well. Note also that the same basic principle I described above applies to the other modules: they all break the HTML down into its components, so that you can operate on only the textual parts, leaving the others unchanged.

BTW, have you seen Lingua::Translit?

Hope this helps,
-- Hauke D

In reply to Re^3: Begginer's question: If loops one after the other. Is that code correct? by haukex
in thread Begginer's question: If loops one after the other. Is that code correct? by predrag

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.