HTML Parsing /Regex Qstn

sri1230 has asked for the wisdom of the Perl Monks concerning the following question:

I have a bunch of div tags in a text file as shown below. What i am trying to do is parse out the different things inside My output should be a hash which should look like. Whats the best way to do this. Can i use regex, html parsing of some sort? Thanks for your help in advance

OUTPUT VAR1 = { 'http://xxx.com/Java-Architect-Technisource-Richmond-VA_0ad856f5dc92e277ee34526dc9d3b973.html' => ['Java Architect - Technisource - Richmond, VA', 'Mon, 18 Jan 2010 02:17:16 GMT', 's degree in Information Technology or related systems... technology standards Ability to keep up with industry trends, relevant system development technologies...'] etc }

INPUT FILE

<div><a class="titlefield" title="Java Architect - Technisource - Richmond, VA" href="http://xxx.com/Java-Architect-Technisource-Richmond-VA_0ad856f5dc92e277ee34526dc9d3b973.html">Java Architect - Technisource - Richmond, VA</a> <br/><span class="datefield">Mon, 18 Jan 2010 02:17:16 GMT</span> <span class="labelfield">[Listed at Indeed.com]</span> <br/> s degree in Information Technology or related systems... technology standards Ability to keep up with industry trends, relevant system development technologies... From Dice - 18 Jan 2010 02:17:16 GMT - save job, email, more... </div>

<div><a class="titlefield" title="Senior Java Developer - Technisource - Richmond, VA" href="http://xxx.com/Senior-Java-Developer-Technisource-Richmond-VA_c21a277e3f459c5334eea3c70a364463.html">Senior Java Developer - Technisource - Richmond, VA</a> <br/><span class="datefield">Thu, 21 Jan 2010 02:06:39 GMT</span> <span class="labelfield">[Listed at Indeed.com]</span> <br/> s degree in Information Technology or related systems... architecture experience and Java enterprise knowledge (JEE) - Passion for web technologies and experience... From Technisource - 21 Jan 2010 02:06:39 GMT - save job, email, more... </div>

<div><a class="titlefield" title="Senior Java Developer - TEKsystems - Richmond, VA" href="http://xxx.com/Senior-Java-Developer-TEKsystems-Richmond-VA_4655ae7f56afc20d39daa75b1592767a.html">Senior Java Developer - TEKsystems - Richmond, VA</a> <br/><span class="datefield">Mon, 18 Jan 2010 06:13:02 GMT</span> <span class="labelfield">[Listed at Indeed.com]</span> <br/> Computer Science, Information Technology, or Business or related work experience/certification. 5+ years of relevant general Information Technology experience... From TEKSystems - 18 Jan 2010 06:13:02 GMT - save job, email, more... </div>

Comment on HTML Parsing /Regex Qstn

Replies are listed 'Best First'.
Re: HTML Parsing /Regex Qstn by bobr (Monk) on Jan 21, 2010 at 16:35 UTC
You can use something like following code. Need to have HTML::Tree installed. `use HTML::TreeBuilder; use Data::Dump qw{dump}; my $tree = HTML::TreeBuilder->new_from_file("your_html_file"); my %output = (); for my $div ($tree->find("div")) { if(my $titlefield = $div->look_down(class => "titlefield")) { my $href = $titlefield->attr("href"); $output{$href} = [$titlefield->attr("title")]; my $date = ""; if(my $datefield = $div->look_down(class => "datefield")) { $date = $datefield->as_text(); } push @{$output{$href}}, $date; # ... } } dump { %output };` [download] -- Roman	[reply] [d/l]
Re^2: HTML Parsing /Regex Qstn by sri1230 (Novice) on Jan 21, 2010 at 16:53 UTC
Thanks Roman. And i will make sure i put the code tags next time. Thanks again all of you!	[reply]
Re^3: HTML Parsing /Regex Qstn by planetscape (Chancellor) on Jan 22, 2010 at 01:02 UTC
Yes, do keep `<code>...</code>` tags in mind for next time... but there is absolutely no reason you can't go back right now and edit your prior post. HTH, planetscape	[reply] [d/l]
Re^3: HTML Parsing /Regex Qstn by sri1230 (Novice) on Jan 21, 2010 at 17:06 UTC
Roman - One more question.. How do i get the content directly inside the div tag..the stuff that is'nt in any of those inner tags?	[reply]
Re^4: HTML Parsing /Regex Qstn by bobr (Monk) on Jan 21, 2010 at 17:38 UTC
Re^5: HTML Parsing /Regex Qstn by sri1230 (Novice) on Jan 21, 2010 at 21:47 UTC
Re: HTML Parsing /Regex Qstn by Utilitarian (Vicar) on Jan 21, 2010 at 15:06 UTC
+1 what anon said above, we don't know if you borked adding the input data or not, but assuming it's html in the file... State your problem in simple terms From a HTML fragment, you want to extract the href of an anchor tag and use it as the key for a hash the value being an array of the body of the anchor tag, the date from the last line of the div and the text after the line break up to the `/From $source - $date/` My advice When attempting to parse HTML you should rely on an existing module rather than a "bodge it yourself" regex in all but the most trivial cases. The HTML::Parse module will allow you to isolate the data, then it becomes trivial to grab what you want from each div `print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."`	[reply] [d/l] [select]
Re: HTML Parsing /Regex Qstn by Anonymous Monk on Jan 21, 2010 at 14:44 UTC
Why aren't you putting your code and data into `<CODE></CODE>` tags?	[reply] [d/l]