How to compare the data of two files (.xml and .html) using perl(regex)?

flora has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to compare the data of two files (.xml and .html) using perl(regex)? by choroba (Cardinal) on Sep 20, 2014 at 20:09 UTC
Without seeing the code and possibly even the data sample, we can't help you much. But the general advice is: don't use regular expressions to process XML and HTML, use parsers (e.g. XML::LibXML). لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re: How to compare the data of two files (.xml and .html) using perl(regex)? by Laurent_R (Canon) on Sep 20, 2014 at 20:15 UTC
Hi flora, if you think about it, I think that you can easily realize that it is almost impossible to answer such a general question, without having at least some specifics: samples of each type of file, the code that you have been trying and apparently does not work properly, the logic or rules according to which data in the HTML can be matched with data in the XML files, etc. Update: alright, choroba was a bit faster than me, but we have essentially the same questions.	[reply]
Re: How to compare the data of two files (.xml and .html) using perl(regex)? by graff (Chancellor) on Sep 21, 2014 at 21:31 UTC
Unlike the other folks who replied above, I ~~tried to extract the~~ seem to be replying to a version of the OP that contains perl code and data that you attempted to include in your post. First: Please update your post and place "code" tags around your sample data and code snippet, like this: `<code> <- you need to type in these tags exactly as shown # xml data: <gname>abc</gname> <-- then you can type these exactly as shown. <pname>xyz</pname> # html data: <p>ABC</p> <p><i>xyz</i></p> </code>` [download] and like this: <code> open(F2,"<F2>"); my $xml_list1="(.*)\.html"; # here the data enclosed inside the parentheses also appears when # printed. I want say the file name is "abc.html" so i want to keep # "abc" as interchangeable, so that i dont need to write/modify the # code if any filename other than abc.html occurs. close F2; #print $xml_list1."\n"; foreach my $f (@filenames) { #print $f."\n"; open(F1,"<F1>"); my $data=join("",<F1>); close F1; my $filename=substr($f,0,index($f,'.')); my $xml_list=$filename.".xml"; while($xml_list=~m//ig) # the code doesn't enter the while/if loop, seems that it finds some # error in reading the filename $xml_list, but i tried using # $xml_list1 too... but still the loop doesnt work. </code> [download] Now that your data and snippet are more visible to people, I would say that there's still not enough for us to work with. Your sample data is too sparse, the code incomplete and makes no sense, and you don't give a clear explanation of what you're actually trying to accomplish. What should your script produce as output? You might want to look at this code I posted a while ago, for using XPath expressions on a command line to extract portions of XML: Re: XPath command line utility... -- it might give you a starting point for extracting "gname/pname" elements from your xml file, and might work on the html data also.	[reply] [d/l] [select]
Re^2: How to compare the data of two files (.xml and .html) using perl(regex)? by Laurent_R (Canon) on Sep 21, 2014 at 21:56 UTC
well, graff, unlike the other folks who replied above (including myself), you saw a post that is totally different from what they saw (and from what I saw). There was not a single line of code in the original post, no sample data, no example, almost nothing... The post had just 2 lines. It is good that flora now provided that information requested, but that information just wasn't there when the "other folks" replied. To flora: nothing against you, you are new here and are really welcome, but when you change your post, it is considered good practice to leave the original content unchanged and add new elements clearly as new stuff, or at least to explain what you have changed to the original content.	[reply]
Re^3: How to compare the data of two files (.xml and .html) using perl(regex)? by graff (Chancellor) on Sep 22, 2014 at 01:04 UTC
Thanks for clarifying, Laurent_R -- I'm sorry to have ignorantly cast aspersions on my esteemed fellow monks. (I've been away for awhile, but I should have known better.)	[reply]
Re: How to compare the data of two files (.xml and .html) using perl(regex)? by Anonymous Monk on Sep 20, 2014 at 23:56 UTC
Without any code, sample input, or expected output, it's difficult to know exactly what you're looking for - see I know what I mean. Why don't you? Perhaps it's something like XML::Diff, XML::SemanticDiff, or xml_grep? Or roll your own with one of the many XML modules out there (XML::LibXML, XML::Twig, ...)	[reply]
Re: How to compare the data of two files (.xml and .html) using perl(regex)? by Anonymous Monk on Sep 22, 2014 at 19:18 UTC
When you edit your node, please do not remove previous content, only append new content at the end with an appropriate "Update" marking. See also How do I change/delete my post? Your node is still incomplete as it is missing code (again!), and I cannot fully answer your question because I know I am missing information which I only caught a glimpse of earlier. Is there any other way to proceed other than using parsers Theoretically, yes. Practically, no - one should not parse XML with regexes, and always use a parser. If the format of your XML is well known and you have full control over its source, and you know it will never change, then maybe, in some cases, parsing XML with regexes is excusable. Such cases are rare, and in practice, the parsing of XML with regexes is a bad idea. What is the problem with using a parser?	[reply]