Re: changing format of the first word of every line in an HTML doc
by ELISHEVA (Prior) on Jan 12, 2011 at 16:39 UTC
|
You are doing this the hard way! Parsing an HTML document using regular expressions is almost certainly doomed to bugginess, an abundance of edge cases, and inconsistant results. Even simple documents can trip one up, but MsWord documents dumped to HTML hardly qualify as simple. Please consider using an HTML parser package instead. It will get the interpretation of tags nested within tags right.
Some modules to look at : HTML::Parser, and in general CPAN search for HTML parsing tools.
| [reply] |
Re: changing format of the first word of every line in an HTML doc
by Your Mother (Archbishop) on Jan 12, 2011 at 17:28 UTC
|
In addition to what ELISHEVA said about parsing, I'd offer that you're making a simple manual task into a more difficult manual task with an intermediate programming loop.
Double clicking on the first word in the Word doc and hitting Ctrl+b will do what you say you want and without the runaround of cutting, pasting, parsing, loading, copying, pasting. So, I'm wondering if this is really what you want (emphasize the first word of the doc) or you actually have a different requirement for which this is a stand-in.
If your requirement is real, I'd suggest doing something with OpenOffice or RTF instead so you could remove (all but one of) the manual steps; you'd probably have to export to RTF first or import to OO. Both could be scripted completely and Word will open either if saved correctly. This isn't really trivial though. Ctrl+b is, if a bit repetitive. :)
| [reply] |
|
|
Thanks for the reply. :) I've decided to make a program since the actual document is long, so manually selecting text and Ctrl-b is really too cumbersome to make. I've initially considered using RTF (I'm using OO to open the document, btw) but the formatting codes of the resulting file is much more confusing for me to make a parsing program, than the HTML code.
| [reply] |
Re: changing format of the first word of every line in an HTML doc
by cormanaz (Deacon) on Jan 12, 2011 at 20:04 UTC
|
If the objective is to wind up with a modified Word document, you might consider using Win32::OLE to manipulate the word doc directly. This will give you an idea of how you can access the Word API via Perl. You would have to do some searching around and/or read up on the Word API to figure out how to do exactly what you want. It's a bit of a learning curve but worth it if you want to do this sort of thing often.Good luck... Steve | [reply] |
Re: changing format of the first word of every line in an HTML doc
by locked_user sundialsvc4 (Abbot) on Jan 13, 2011 at 01:30 UTC
|
A good general approach would be to save the document as .DOCX, which is a compressed-XML format. Then, use an XML parsing package.
In particular, you want to look at “XPath expressions,” which allows you to traverse the entire XML data-structure looking for what you need to find ... without writing code to do so.
| |
|
|
Thanks for this suggestion, sundial. =) This approach looks promising. I've done what you suggested, by saving the document to .DOCX, and extracted the document.xml from the compressed file-directory by using ZIP-extraction tool. By making a quick analysis of the xml file, I can identify where the first words of ever line are. Can you recommend a (free) software that would allow me to examine and modify the node-structure, attributes and elements of an XML-file?
Thanks so much!
| [reply] |
Re: changing format of the first word of every line in an HTML doc
by ww (Archbishop) on Jan 12, 2011 at 20:55 UTC
|
And if you're planning to use M$Word to save the doc as .html
DON'T!
Saving as html in Microsoft Word produces truly and outlandishly bad (unnecessarily verbose, obscure, sometimes ill-formed) css. | [reply] |
|
|
Don't you, worry. I won't. Thanks for the advice. =)
| [reply] |
Re: changing format of the first word of every line in an HTML doc
by ambrus (Abbot) on Jan 13, 2011 at 09:29 UTC
|
Alternately, you may try to use Winword itself to do these changes, with some macros or similar. Afterall, it already has commands for moving by paragraphs and words.
You might even just be able to record a macro with the macro recorder that
-
Moves to the beginning of the next paragraph
-
Selects one word to the right
-
Changes the formatting
and just execute this one in a loop.
| [reply] |
|
|
Sorry if it's too much to ask, but do you happen to have a Macro code for doing this (or any resource I can read to learn this)? =) I've browsed the default Macro list that comes with my MS Word 2007 but I can't find anything that suggests selecting a portion of text from a given line. Besides, I'm not familiar with the VB API for automating MS Office software. Thanks in advance!
| [reply] |
|
|
There's a macro recorder that lets you perform actions just like you normally would, and writes these actions to a macro as readable statements. You start recording a macro with it, then hit control-downarrow to go to beginning of the next paragraph, then hit control-shift-rightarrow to select a word, then apply some formatting, then stop the macro recording. Play the macro a few times to make sure it works. Then open the macro for editing and try add a loop around it.
| [reply] |