Find and replace

dvdauthority has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(jeffa) Re: Find and replace by jeffa (Bishop) on Oct 18, 2001 at 19:27 UTC
Yes, Perl is an excellent choice for this problem. You will need to look into one of the many HTML parsers available on CPAN, such as HTML::Parser or my personal fav, HTML::TokeParser. Once you find the right 'keys' to extract the target info, i assume you will want to store them somewhere, such as a database. You will need to look into the DBI modules on CPAN. If you are planning on using Perl to _convert_ each entire page into .asp - well, that's going to be a bit tougher. You are going to need some man hours any way you slice it. If the pages have a lot of commonality then the task will be easier, but you still need to plan this one out. My experience in the past with porting scripts and such is usually to issue a ton of clever Perl one-liner substitutions only to find that i still have to make some changes by hand. Good luck, and keep it legal ;) jeffa	[reply]
Re: Find and replace by alien_life_form (Pilgrim) on Oct 18, 2001 at 19:25 UTC
Greetings. You are probably looking for either HTML::Parser or its brethren HTML::Treebuilder. Cheers alf	[reply]
Re: Find and replace by Rich36 (Chaplain) on Oct 18, 2001 at 20:02 UTC
What kind of solution are you looking at for interfacing with your data? Like alf and jeffa suggested, HTML::Parse or HTML::TokeParser would probably be the way to go to actually get the data. If you're using some kind of database solution, consider outputting the parsed data into a file as columns (space/tab/character delimited fields) so that you can easily import the data into a table format. Rich36	[reply]
Re: Find and replace by Anonymous Monk on Oct 18, 2001 at 20:31 UTC
I don't think you'll be needing HTML::Parser, your html-code isn't really clean enough for that, and you don't want to deal with all the nasty font-tags. A few patterns should get the job done much faster. I only tested with two reviews, here's what I got so far: # find the place where the features start: $_ = <IN> until m/face=arial color=white/; # now read the dvd features FEATURE: while(<IN>) { last FEATURE if m/blcorn.jpg/ or m/td/; m/src="(.).jpg">( )([^<]+)/ and print "$3: $1\n"; } while($_ = <IN>) { last if /<p>/; } $_ =~ s/.<b>//; $_ =~ s/<.//; print "Title: $_"; $_ = <IN>; s/^\s//; s/<.//; print $_; # that's the studio $_ = <IN>; s/^\s//; $_ =~ s/Reviewed by: //; $_ =~ s\|</font>\|\|; $_ =~ s\|</p>\|\|; print $_; # that's the reviewer # now for the text of the review: REVIEW: while ( $_ = <IN> ) { last REVIEW if m/<table/ or m/center/; $r .= $_; } # now some magic to clean up the font-soup: $r =~ s\|<font[^>]>\|\|g; $r =~ s\|</font>\|\|g; # now some magic to turn <br> <br> into <p>, $r =~ s\|<b>\s<br>\|<br><b>\|gs; # extra magic $r =~ s\|<br>(\s<br>)+\|<p>\|gs; # now we can recognize the headlines, and turn them into <h2> $r =~ s\|<p>\s<b>\s([^<]+)\s</b>\s*<br>\|<h2>$1</h2>\|gs; print $r; close IN; [download] aargh, data munging at it's dirtiest. You should probably read davorgs book on Data Munging with Perl, and learn how to do this in a more organized, controlable way. P.S. if you were serious about giving away dvds: Brigitte Jellinek Horus IT GmbH Jakob Haringer Str. 8 5020 Salzburg Austria EUROPE	[reply] [d/l]
Re: Re: Find and replace by davorg (Chancellor) on Oct 19, 2001 at 12:56 UTC
There's a section in my book (section 8.2) that explains in detail exactly why you shouldn't parse HTML with regular expressions. HTML::TreeBuilder (which, I think, would be the most useful module for this task) goes to great lengths to try to make sense of invalid HTML. It'll handle just about anything that you throw at it. -- <http://www.dave.org.uk> "The first rule of Perl club is you don't talk about Perl club."	[reply]
Re: Re: Find and replace by dvdauthority (Initiate) on Oct 18, 2001 at 22:10 UTC
First off, thanks for all the help/input guys...guess my .asp comment wasn't that bad after all! ha. I do have most of my stuff stored in a SQL database, so I have that base covered. It has links,author,studio and so on. I'll start off by saying that I have no clue how Perl works and I have no experience in it. I am serious about the DVD thing. If this will work, it will save me three or four weeks of hard programming. Drop me a line and hopefully we can see if we can go somewhere from there. Thanks again... Matt matt@dvdauthority.com	[reply]
Re: Re: Re: Find and replace by alien_life_form (Pilgrim) on Oct 19, 2001 at 13:19 UTC
Greetings. If you really have no perl clue/experience... you're in four three or four weeks of hard perl learning, followed by three or four hours of not extremely hard programming. Of course, at the end of the day you'll end up as somebody who knows perl, as opposed to somebody who knows how to cut and paste.... If you don't have that kind of time, a hired perl gun may be your next best choice. As for the issue at hand, I still stand by HTML::Treebuilder - and not only for aesthetical reasons. But regardless, you should know that, when your review cleaning code is ready, you could top everything off by using DBI to stuff them in your SQL database - directly from perl. Cheers, alf	[reply]