Extracting Data from a File

globaldre has asked for the wisdom of the Perl Monks concerning the following question:

Hello - I am a .Net developer with very limited Perl knowledge. I am attempting to read some values from several files and output the value as a csv file. I can write an .exe in C# that can accomplish this, but due to security concerns I am not able to run a .exe on the server. I know Perl is already installed on the server and running a Perl script wouldn't raise any concerns. I've been reading through several Perl books, but haven't come across any examples on how to get this done. Any ideas/suggestions is greatly appreciated.

I have some files in a directory that looks like this:
1. - root
  1.1 - html
    1.1.1 - html2010
        file1.html
        file2.html
        file3.html
        etc
    1.1.2 - html2010
        file1.html
        file2.html
        file3.html

I need to read the content from the "description" and "keywords" meta (test1,test2,test3,testk1,testk2,etc) from each file and output it as a csv file.

The html pages look something like this:

<!-- File 1 -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/T
+R/html4/strict.dtd">
<html lang="en" dir="ltr">
<head>
<title>Test Page1</title>
 <meta name="description" content="test1,test2,test3,test4,test5"/>
 <meta name="keywords" content="testk1,testk2,testk3,testk4,testk5"/>
</head>
<body>
 Body of the page 1
</body>
</html>
[download]

<!-- File 2 -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/T
+R/html4/strict.dtd">
<html lang="en" dir="ltr">
<head>
<title>Test Page2</title>
 <meta name="description" content="test6,test7,test8,test9,test10"/>
 <meta name="keywords" content="testk6,testk7,testk8,testk9,testk10"/>
</head>
<body>
 Body of the page 2
</body>
</html>
[download]

I have a very limited understanding of Perl. Feel free to make a recommendation if you think there is another scripting language that is more suitable for accomplishing the above task

Comment on Extracting Data from a File Select or Download Code

Replies are listed 'Best First'.
Re: Extracting Data from a File by Tanktalus (Canon) on Nov 11, 2010 at 15:17 UTC
Hopefully you can add/install modules, or those modules are already there. Note that you can bundle your code together in a number of ways, which was recently discussed here, which means that they aren't strictly needed on the server, you can ship them with whatever you have, and they will either live with your code or they will get installed, whichever way you go. CSV: Text::CSV_XS. Separating things by commas sounds straight forward (`join ',', @list`), but there are corner/edge cases that may crop up and will get you to throw your hands up in frustration. Text::CSV_XS handles those cases for you, both for parsing and writing. Parsing HTML: if they are well-formed XHTML, I prefer XML::Twig, but if not, check CPAN for html parsers. Again, you probably could regex the search for your meta names, but unless they are always exactly the same format (all on one line, always with name before content, neither of which are strictly required by HTML), it will be painful. Let a module do the heavy work for you, and you should be able to ask for the meta element with a name attribute of 'description', then you ask for the value of its 'content' key. Looking for the HTML files: File::Find, which is included in the perl you're using, though I've seen others prefer File::Find::Rule, which is not included in the core Perl distribution (but may be installed on your server anyway). Once you have all the documentation and modules, and you have your plan on how to "distribute" your code to the server, you can pull it all together. If you write it well, I'm guessing that your code will amount to 20-50 lines. That's it. With the contents of CPAN, I can't think of another scripting language that is better suited to what you're doing.	[reply] [d/l]
Re^2: Extracting Data from a File by Corion (Patriarch) on Nov 11, 2010 at 15:19 UTC
If this is really about only extracting `<meta` tags from HTML files, HTML::HeadParser is a limited parser written for that.	[reply] [d/l]
Re^2: Extracting Data from a File by globaldre (Initiate) on Nov 11, 2010 at 18:59 UTC
Could you give me an example on how it would look like? I am assuming the File::Find and other similar modules are just packaged classes in the same way they are used in .Net or Java, correct? I apologize again for my lack of knowledge.	[reply]
Re: Extracting Data from a File by PeterPeiGuo (Hermit) on Nov 11, 2010 at 16:33 UTC
Usuaully one should not use regexp try to parse html, but in this particular case, it is just fine to use simple regexp to locate what you are looking for in the files - and this way, you don't need to install extra packages. Assume that you have used regexp in C#, you wouldn't find any issue to use the perl one. The other thing you need is File::Find, so that you can traverse the directories. Since you have already done lots of research, just do a little bit more, and focus on regexp and File::Find. Peter (Guo) Pei	[reply]
Re: Extracting Data from a File by roho (Bishop) on Nov 11, 2010 at 18:58 UTC
Here is a sample program using the modules mentioned to process all HTML files in your directory structure, parse the HTML headers for description and keyword values, and write the results to CSV files. Please note that description and keyword values are written to separate CSV files, since it did not seem to make sense to mix them in one file. Hope this helps. #!/usr/bin/perl ###################################################################### # Name: extract_sample.pl # Desc: Sample program to extract HTML header data as CSV files. ###################################################################### use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; ###################################################################### # Create objects for each CSV file to be created. ###################################################################### my $csv1 = Text::CSV->new ( { binary => 1 } ) or die Text::CSV->error_ +diag(); my $csv2 = Text::CSV->new ( { binary => 1 } ) or die Text::CSV->error_ +diag(); $csv1->eol ("\n"); $csv2->eol ("\n"); ###################################################################### # Open CSV files for output. ###################################################################### my $dfile = 'description.csv'; my $kfile = 'keyword.csv'; open my $fh1, ">:encoding(utf8)", "$dfile" or die "Error opening $dfil +e: $!"; open my $fh2, ">:encoding(utf8)", "$kfile" or die "Error opening $kfil +e: $!"; ###################################################################### # Set directory (and sub-directories) for File::Find to search. ###################################################################### my $dir = '.'; find (\&HTML_Files, $dir); close $fh1 or die "Error closing $dfile: $!"; close $fh2 or die "Error closing $kfile: $!"; exit; ###################################################################### # This subroutine is called for each file in the directories searched. ###################################################################### sub HTML_Files { Parse_HTML_Header($File::Find::name) if /\.html?$/; } sub Parse_HTML_Header { ################################################################### # The 'parse' method below expects the HTML to be in a variable, # so we slurp the file contents into $text. ################################################################### my $ifile = shift; open(my $fh0, '<', $ifile) or die "Error opening $ifile: $!\n"; my $text = ''; { $/ = undef; $text = <$fh0>; } close $fh0; ################################################################### # Parse HTML header. ################################################################### my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); ################################################################### # Write results to separate CSV files for description and keywords. ################################################################### for ($h->header_field_names) { my @values = split ',', $h->header($_); if (/description/i) { $csv1->print ($fh1, \@values); } elsif (/keywords/i) { $csv2->print ($fh2, \@values); } } } [download] "Its not how hard you work, its how much you get done."	[reply] [d/l]
Re^2: Extracting Data from a File by globaldre (Initiate) on Nov 12, 2010 at 13:38 UTC
Thanks roho I will give this a try and make any tweaks as needed. Appreciate it!	[reply]

Peter (Guo) Pei