in reply to Re^4: REGEX for url
in thread REGEX for url

I ran this code as essentially suggested by james28909 against your data set. This approach has obvious flaws in terms of HTML structure, because there are href's that you don't care about. A module to parse this would be better.

#!usr/bin/perl use warnings; use strict; my $line; while (my $line = <DATA>) { (my $url) = $line =~ m/.*a href="(.*)".*/; next unless $url; print "$url\n"; } =Prints javascript:history.back() http://www.sec.gov/index.htm"><img src="/images/sealTop.gif" alt="SEC +Seal" border="0 /edgar/searchedgar/webusers.htm http://www.sec.gov/ /edgar/searchedgar/webusers.htm /edgar/searchedgar/companysearch.html /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +001.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +002.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +003.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +004.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +005.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +006.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +007.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +008.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +009.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +010.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365.t +xt /cgi-bin/browse-edgar?CIK=0001050122&amp;action=getcompany /cgi-bin/browse-edgar?action=getcompany&amp;SIC=3827&amp;owner=include Process completed successfully =cut __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>EDGAR Filing Documents for 0000927356-01-000365</title> <link rel="stylesheet" type="text/css" href="/include/interactive.css" + /> </head> <body style="margin: 0"> <noscript><div style="color:red; font-weight:bold; text-align:center;" +>This page uses Javascript. Your browser either doesn't support Javas +cript or you have it turned off. To see this page as it is meant to a +ppear please use a Javascript enabled browser.</div></noscript> <!-- BEGIN BANNER --> .... abreviated to reduce space.....