Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Amen grinder++. I whole heartedly agree.

I came to much the same conclusion in Being a heretic and going against the party line. after having only been using perl for a relatively short time. My experiences since have done little to change my mind.

Back in that old post I tried to make a distinction between the need to parse HTML and the need to extract something that just happens to be embedded within stuff that happens to be HTML. This distinction was roundly set upon as being wrong. I still hold with this distinction.

The dictionary definition of parse is

  1. To break (a sentence) down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part.
  2. To describe (a word) by stating its part of speech, form, and syntactical relationships in a sentence.
    1. To examine closely or subject to detailed analysis, especially by breaking up into components: “What are we missing by parsing the behavior of chimpanzees into the conventional categories recognized largely from our own behavior?” (Stephen Jay Gould).
    2. To make sense of; comprehend: I simply couldn't parse what you just said.

Whilst the dictionary definition of extract is:

  1. To draw or pull out, often with great force or effort: <cite>extract a wisdom tooth; used tweezers to extract the splinter.</cite>
  2. To obtain despite resistance: <cite>extract a promise.</cite>
  3. To obtain from a substance by chemical or mechanical action, as by pressure, distillation, or evaporation.
  4. To remove for separate consideration or publication; excerpt.
    1. To derive or obtain (information, for example) from a source.
    2. To deduce (a principle or doctrine); construe (a meaning).
    3. To derive (pleasure or comfort) from an experience.
  5. Mathematics. To determine or calculate (the root of a number).

From my perspective, when the need is to locate and capture one or more pieces of information from within any amount or structure of other stuff, without regard to the structural or semantic positioning of those pieces within the overall structure, the term extraction more applicable than parsing. If I need to understand the structure, derive semantic meaning from the structure or verify its correctness, then I need to parse, otherwise I just need to extract. After all, Practical Extraction and Reporting is what that Language was first designed to do.

My final, and strongest argument lies in a simple premise. If the information I was after was embedded amongst a lot of Arabic, Greek or Chinese, then noone would expect me to find and use a module that understood those languages, just to extract the bits I needed.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller



In reply to Re: Scraping HTML: orthodoxy and reality by BrowserUk
in thread Scraping HTML: orthodoxy and reality by grinder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (9)
As of 2024-04-18 13:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found