comment on

I ran this code as essentially suggested by james28909 against your data set. This approach has obvious flaws in terms of HTML structure, because there are href's that you don't care about. A module to parse this would be better.

#!usr/bin/perl
use warnings;
use strict;

my $line;
while (my $line = <DATA>)
{
   (my $url) = $line =~ m/.*a href="(.*)".*/;
   next unless $url;
   print "$url\n";
}

=Prints
javascript:history.back()
http://www.sec.gov/index.htm"><img src="/images/sealTop.gif" alt="SEC 
+Seal" border="0
/edgar/searchedgar/webusers.htm
http://www.sec.gov/
/edgar/searchedgar/webusers.htm
/edgar/searchedgar/companysearch.html
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+001.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+002.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+003.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+004.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+005.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+006.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+007.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+008.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+009.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0
+010.txt
/Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365.t
+xt
/cgi-bin/browse-edgar?CIK=0001050122&amp;action=getcompany
/cgi-bin/browse-edgar?action=getcompany&amp;SIC=3827&amp;owner=include

Process completed successfully
=cut



__DATA__
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>EDGAR Filing Documents for 0000927356-01-000365</title>
<link rel="stylesheet" type="text/css" href="/include/interactive.css"
+ />
</head>
<body style="margin: 0">
<noscript><div style="color:red; font-weight:bold; text-align:center;"
+>This page uses Javascript. Your browser either doesn't support Javas
+cript or you have it turned off. To see this page as it is meant to a
+ppear please use a Javascript enabled browser.</div></noscript>
<!-- BEGIN BANNER -->
.... abreviated to reduce space.....
[download]

In reply to Re^5: REGEX for url by Marshall
in thread REGEX for url by wrkrbeee

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.