Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

html parse - concatenation

by Buckaroo Buddha (Scribe)
on Jun 05, 2000 at 18:12 UTC ( [id://16408]=perlquestion: print w/replies, xml ) Need Help??

Buckaroo Buddha has asked for the wisdom of the Perl Monks concerning the following question:

someone i know maintains a FAQ on multiple pages... does anyone know of a PERLscript that will allow hime to "concatenate" all of them into one (working) page?

(rather than just take what's given to me, i'll also write one off the top of my head please critique or write a better one)

# script requires an indexfile, with each # html filename to be concatenated in it # one html filename per line open(EACH_HTML_FILENAME,pagelist.txt) || die; open(UNIFIED_FILE,>>output.html) || die; # i could've used 'getopt' or @ARGV but this # is being written with brevity in mind ... # another thing i'd like to try to learn is # automatically geting a directorylist, parsing out # each filename with the extensions @ARGV and # concatenating them $first_pass = 1; while (<EACH_HTML_FILENAME>) { $thisfile = chomp($_); open(THIS_FILE,$thisfile) || die; if ($first_pass) { while (<THIS_FILE>) { if ($_ ne '</body>') { print{UNIFIED_FILE} "$_" ; } else { $first_pass = 0; # exit this iteration of the while loop } } } else { $body_start = 0; while (<THIS_FILE>) { if (!($body_start) { if ($_ eq '<body>) { $body_start = 1; } } else { if ($_ ne '</body>') { print{UNIFIED_FILE} "$_" ; } else { # exit this iteration of the while loop } # END -> if ($_ ne '</body>') } # END -> if (!($body_start) } # END -> while (<THIS_FILE>) } # END -> while (<EACH_HTML_FILENAME>) print{UNIFIED_FILE} "</body>" ;
anyways ... that's it, off the top of my head i think some criticism would help my coding skills and this utility (if it works) would help this guy i know (actually, it would also help me cause i want to be able to download a copy of his FAQ on one page ;)

Replies are listed 'Best First'.
Re: html parse - concatenation
by swiftone (Curate) on Jun 05, 2000 at 18:31 UTC
    This step may be totally unnecessary. On a UNIX system, you can use the cat command to append all of these files into one file. On a Windows/DOS box, I believe the copy command supports the + feature (i.e. copy file1+file2+file3 file4)

    Another method of this script would be like this:

    $flip=1; while(<>){ $flip=0 if /<\/body/i; print $_ if $flip; $flip=1 if /<body/i; } print "</body>\n</html>\n";
    This is a quick and dirty script which assumes that the body tags are alone on the line (or at least that there is nothing there that should be preserved in the final version.)

    It would be called like "faqcat.pl file1.html file2.htrml filen.html > newfile.html
    (Should work in both UNIX and Win/DOS)

    The next step is making sure that any internal links work correctly. If the files are merely text (i.e. non-linking HTML) this will be enough, but probably it has internal links to Table of contents, and different sections. I can't say how to correct those, because a lot depends on how the files are written, but if the syntax is simple, you can do it with Regex's, if it is more complex, you can do it with HTML::Parser.

    Update: Some explanation on how the script works:
    The idea is that we want to print everything that is not following a /BODY and not before a BODY, but we do want to print the stuff before the first BODY.
    The <> automagically brings the next line from any file(s) on the command-line. see perlop.

Re: html parse - concatenation
by Buckaroo Buddha (Scribe) on Jun 05, 2000 at 21:58 UTC
    i compliled your code (mine was big and clunky) and sent it off the the guy who wrote the faq and got this reply:
    Thanks Alex - works a treat. Go to http://go.to/ka7faq and be the first to download the printer friendly version!

    Paul

    ps. You get a mention in the Whats New page!
    the A-BIT KA7 FAQ

    i wrote him back and asked him to also thank swiftone & www.perlmonks.com ... so hopefully he'll get that up soon

    thanks for the help ... i think that this is really cool

      Aw shucks.

      Glad to be of service.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://16408]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-04-24 20:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found