Hello there, I have to compare HTML strings, and I need to ignore the presence of certain tags so that if two strings differ only because of one of those tags, the strings are considered as identical. I'm actually successfully doing this right now, but I have a serious performance issue. The script running on a medium-sized file takes 4 hours to complete, and this is not acceptable. To do this right now, I have created a sub-routine returning TRUE (if the strings are identical), or FALSE if the strings are different. This will ignore the tags that we not need to consider:
sub compare { # We ignore the standalone, opening and closing tags <ph />, <bpt x= +"y">, <ept x="z">, <i /> my $cString1 = $_[0]; my $cString2 = $_[1]; $cString1 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi; $cString2 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi; ("$cString1" eq "$cString2") ? return 1 : return 0; }
I use this sub-routine later in my code:
if (compare($unit{$x}{string1}, $string2)) { # Do some stuff }
As we can see, I actually do the substitution on the strings before comparing them, and I suspect that's very heavy, so I would want to simply ignore the patterns instead of removing them. Anyway, I do not want to modify any of the strings, this is for comparison only. Any way I could improve the performance? Thanks in advance! TA

In reply to Ignoring patterns when comparing strings by TravelAddict

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.