sri1230 has asked for the wisdom of the Perl Monks concerning the following question:

I have a bunch of div tags in a text file as shown below. What i am trying to do is parse out the different things inside My output should be a hash which should look like. Whats the best way to do this. Can i use regex, html parsing of some sort? Thanks for your help in advance

OUTPUT VAR1 = { 'http://xxx.com/Java-Architect-Technisource-Richmond-VA_0ad856f5dc92e277ee34526dc9d3b973.html' => ['Java Architect - Technisource - Richmond, VA', 'Mon, 18 Jan 2010 02:17:16 GMT', 's degree in Information Technology or related systems... technology standards Ability to keep up with industry trends, relevant system development technologies...'] etc }

INPUT FILE

<div><a class="titlefield" title="Java Architect - Technisource - Richmond, VA" href="http://xxx.com/Java-Architect-Technisource-Richmond-VA_0ad856f5dc92e277ee34526dc9d3b973.html">Java Architect - Technisource - Richmond, VA</a> <br/><span class="datefield">Mon, 18 Jan 2010 02:17:16 GMT</span> <span class="labelfield">[Listed at Indeed.com]</span> <br/> s degree in Information Technology or related systems... technology standards Ability to keep up with industry trends, relevant system development technologies... From Dice - 18 Jan 2010 02:17:16 GMT - save job, email, more... </div>

<div><a class="titlefield" title="Senior Java Developer - Technisource - Richmond, VA" href="http://xxx.com/Senior-Java-Developer-Technisource-Richmond-VA_c21a277e3f459c5334eea3c70a364463.html">Senior Java Developer - Technisource - Richmond, VA</a> <br/><span class="datefield">Thu, 21 Jan 2010 02:06:39 GMT</span> <span class="labelfield">[Listed at Indeed.com]</span> <br/> s degree in Information Technology or related systems... architecture experience and Java enterprise knowledge (JEE) - Passion for web technologies and experience... From Technisource - 21 Jan 2010 02:06:39 GMT - save job, email, more... </div>

<div><a class="titlefield" title="Senior Java Developer - TEKsystems - Richmond, VA" href="http://xxx.com/Senior-Java-Developer-TEKsystems-Richmond-VA_4655ae7f56afc20d39daa75b1592767a.html">Senior Java Developer - TEKsystems - Richmond, VA</a> <br/><span class="datefield">Mon, 18 Jan 2010 06:13:02 GMT</span> <span class="labelfield">[Listed at Indeed.com]</span> <br/> Computer Science, Information Technology, or Business or related work experience/certification. 5+ years of relevant general Information Technology experience... From TEKSystems - 18 Jan 2010 06:13:02 GMT - save job, email, more... </div>

Replies are listed 'Best First'.
Re: HTML Parsing /Regex Qstn
by bobr (Monk) on Jan 21, 2010 at 16:35 UTC
    You can use something like following code. Need to have HTML::Tree installed.
    use HTML::TreeBuilder; use Data::Dump qw{dump}; my $tree = HTML::TreeBuilder->new_from_file("your_html_file"); my %output = (); for my $div ($tree->find("div")) { if(my $titlefield = $div->look_down(class => "titlefield")) { my $href = $titlefield->attr("href"); $output{$href} = [$titlefield->attr("title")]; my $date = ""; if(my $datefield = $div->look_down(class => "datefield")) { $date = $datefield->as_text(); } push @{$output{$href}}, $date; # ... } } dump { %output };
    -- Roman
      Thanks Roman. And i will make sure i put the code tags next time. Thanks again all of you!

        Yes, do keep <code>...</code> tags in mind for next time... but there is absolutely no reason you can't go back right now and edit your prior post.

        HTH,

        planetscape
        Roman - One more question.. How do i get the content directly inside the div tag..the stuff that is'nt in any of those inner tags?
Re: HTML Parsing /Regex Qstn
by Utilitarian (Vicar) on Jan 21, 2010 at 15:06 UTC
    +1 what anon said above, we don't know if you borked adding the input data or not, but assuming it's html in the file...

    State your problem in simple terms
    From a HTML fragment, you want to extract the href of an anchor tag and use it as the key for a hash the value being an array of the body of the anchor tag, the date from the last line of the div and the text after the line break up to the /From $source - $date/

    My advice
    When attempting to parse HTML you should rely on an existing module rather than a "bodge it yourself" regex in all but the most trivial cases. The HTML::Parse module will allow you to isolate the data, then it becomes trivial to grab what you want from each div

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
Re: HTML Parsing /Regex Qstn
by Anonymous Monk on Jan 21, 2010 at 14:44 UTC
    Why aren't you putting your code and data into <CODE></CODE> tags?