in reply to HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..

I hope I understand you in the right way, here is what I would try:

1) Copy your files to some working directory.

2) Run perl -i.old -ne 'print if $.==999' *html in that directory. That extracts line 999 from your files.

3) As the files contain now only a html-fragment we make it valid html again like this: perl -i.old -ne 'print "<html><body>$_</body></html>' *html

4) You have now a collection of html-files consisting only of the previous line 999 that you can parse further with whatever tool you want.

  • Comment on Re: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
by ag4ve (Monk) on Oct 16, 2010 at 01:56 UTC
    i'd use HTML::TokeParser::Simple and DBI and put your line directly into your db. also, if your files are in separate directories look at find2perl (or File::Find). i too would advise against just relying on the line number. however, if you want to do that, it can be done just as easy from straight command line: find -type f -print0 | xargs -0 -i{} head -1000 {} | tail -1 | while read string; do mysql -e 'insert query $string'; done that said, the perl would be faster and TokeParser would allow for more reliable data. the pers modules i stated above are pretty straight forward to use, but since i've just done (from LWP and not from files) i can pretty much get you a template for this in a few minutes if you need.
      hello ag4ve, many thanks for the posting!


      i like your idea of using HTML::TokeParser::Simple and DBI. I have little experience with HTML::TokeParser::Simple but To to this task on my own -it would go over my head - at least at the moment! Note: i also have had a look at the ideas of morgon - that seems to be also an appropiate way. But at the moment i have issues to get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme

      @ag4gee: I think i would love to go your way: and do it with HTML::TokeParser::Simple and DBI.

      I guess that i have to do it with the other items too: in order to get the full information set that is wanted:
      See one of the example sites:

      http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

      in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

      That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI. That would be great!!

      I look forward to hear from you

      best regards perlbeginner1

      Note: see the 5000 sites - with all the infos - on a german official governmental server...:
      http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/schnellsuche.php

      do a search with *.* -> then you get the result pages - All is available for the whole world - and as i work in the filed of education, nothing is wrong with doing the parsing-job!


      Note your code just looks great!! Really:


      use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); print $name->as_text;
Re^2: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
by Perlbeginner1 (Scribe) on Oct 15, 2010 at 23:59 UTC
    Hello Morgon, good evening! Many many thanks for the quick reply!

    you understood right! That is exactly what is wanted

    i like your advices!

    the fourth is the point where i am now! Now i have 5000 files with one line that is extracted:

    you say: 4) You have now a collection of html-files consisting only of the previous line 999 that you can parse further with whatever tool you want.

    how to proceed. Note i wwant to store this results in a db. Can i go and replace some html-elements with CSV !!? is this doable. Or should i parse the results with html-tokeParser!? I have little experience - but i guess that this job goes over my heard! ;-)

    look forward to hear from you
      You simply have to extract the data-fields now.

      There are several ways to do it.

      As we've gotten rid of a lot of crap by only using one line of the original html, you could use a regular expression here, but that is in general not a good idea for decomposing html.

      Personally I like HTML::TreeBuilder::XPath that you would have to install from CPAN.

      Here is how you would then extract the name from one of the files with it:

      use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); print $name->as_text;
      As you can see you simply use an xpath-expression to indentify the node you want.

      So how to determine that?

      I use a Firefox-plugin called XPather, that allows you to simply click on a html-element and extract the corresponding xpath.

      So you load the file you want to parse in Firefox, click on the stuff you want, get the xpath and use that in the perl-script.

      Hope that gets you started...

        Hello dear morgon:

        many thanks for posting! Your code and your ideas are just great!! Really: I am allmost convinced!


        use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); print $name->as_text;
        i have had a look at the ideas of you morgon - that seems to be also an appropiate way. But at the moment i have issues to get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme. I need some help here: The 5000 HTML-files all do look the same!


        See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

        in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

        That means i would be happy to have a little template that can be runned with HTML::TokeParser::Simple and DBI.

        That would be great!!

        Morgon - I would be happy to hear from you

        best regards perlbeginner1

        http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/schnellsuche.php do a search with *.* -> then you get the result pages