Re^3: REGEX for url

Replies are listed 'Best First'.
Re^4: REGEX for url by wrkrbeee (Scribe) on Apr 25, 2016 at 21:28 UTC
Not sure if this helps, but the full text block, from <html> through </html> appears below. Just using $/ as a way to indicate the end of a record. I apologize for wasting your time. <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>EDGAR Filing Documents for 0000927356-01-000365</title> <link rel="stylesheet" type="text/css" href="/include/interactive.css" + /> </head> <body style="margin: 0"> <noscript><div style="color:red; font-weight:bold; text-align:center;" +>This page uses Javascript. Your browser either doesn't support Javas +cript or you have it turned off. To see this page as it is meant to a +ppear please use a Javascript enabled browser.</div></noscript> <!-- BEGIN BANNER --> <div id="headerTop"> <div id="Nav"><a href="http://www.sec.gov/index.htm">Home</a> \| <a +href="/cgi-bin/browse-edgar?action=getcurrent">Latest Filings</a> \| < +a href="javascript:history.back()">Previous Page</a></div> <div id="seal"><a href="http://www.sec.gov/index.htm"><img src="/im +ages/sealTop.gif" alt="SEC Seal" border="0" /></a></div> <div id="secWordGraphic"><img src="/images/bannerTitle.gif" alt="SE +C Banner" /></div> </div> <div id="headerBottom"> <div id="searchHome"><a href="/edgar/searchedgar/webusers.htm">Sear +ch the Next-Generation EDGAR System</a></div> <div id="PageTitle">Filing Detail</div> </div> <!-- END BANNER --> <!-- BEGIN BREADCRUMBS --> <div id="breadCrumbs"> <ul> <li><a href="http://www.sec.gov/">SEC Home</a> »</li> <li><a href="/edgar/searchedgar/webusers.htm">Search the Next-Ge +neration EDGAR System</a> »</li> <li><a href="/edgar/searchedgar/companysearch.html">Company Sear +ch</a> »</li> <li class="last">Current Page</li> </ul> </div> <!-- END BREADCRUMBS --> <div id="contentDiv"> <div id="formDiv"> <!-- START FILING DIV --> <div id="formHeader"> <div id="formName"> <strong>Form 10-K</strong> - Annual report [Section 13 and 15 +(d), not S-K Item 405] </div> <div id="secNum"> <strong><acronym title="Securities and Exchange Commission">S +EC</acronym> Accession <acronym title="Number">No.</acronym></strong> + 0000927356-01-000365 </div> </div> <div class="formContent"> <div class="formGrouping"> <div class="infoHead">Filing Date</div> <div class="info">2001-03-30</div> <div class="infoHead">Accepted</div> <div class="info">1995-09-28 00:00:00</div> <div class="infoHead">Documents</div> <div class="info">10</div> </div> <div class="formGrouping"> <div class="infoHead">Period of Report</div> <div class="info">2000-12-30</div> </div> <div style="clear:both"></div> </div> <!-- END FILING DIV --> <!-- START DOCUMENT DIV --> <div style="padding: 0px 0px 4px 0px; font-size: 12px; margin: 0px +2px 0px 5px; width: 100%; overflow:hidden"> <p>Document Format Files</p> <table class="tableFile" summary="Document Format Files"> <tr> <th scope="col" style="width: 5%;"><acronym title="Sequenc +e Number">Seq</acronym></th> <th scope="col" style="width: 40%;">Description</th> <th scope="col" style="width: 20%;">Document</th> <th scope="col" style="width: 10%;">Type</th> <th scope="col">Size</th> </tr> <tr> <td scope="row">1</td> <td scope="row">ANNUAL REPORT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0001.txt">0001.txt</a></td> <td scope="row">10-K</td> <td scope="row">194594</td> </tr> <tr class="blueRow"> <td scope="row">2</td> <td scope="row">EMPLOYMENT AGREEMENT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0002.txt">0002.txt</a></td> <td scope="row">EX-10.6</td> <td scope="row">18708</td> </tr> <tr> <td scope="row">3</td> <td scope="row">CHANGE IN TERMS AGREEMENT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0003.txt">0003.txt</a></td> <td scope="row">EX-10.9</td> <td scope="row">24380</td> </tr> <tr class="blueRow"> <td scope="row">4</td> <td scope="row">FIRST AMENDMENT TO LEASE AGREEMENT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0004.txt">0004.txt</a></td> <td scope="row">EX-10.12</td> <td scope="row">15945</td> </tr> <tr> <td scope="row">5</td> <td scope="row">THIRD AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0005.txt">0005.txt</a></td> <td scope="row">EX-10.19</td> <td scope="row">3127</td> </tr> <tr class="blueRow"> <td scope="row">6</td> <td scope="row">FOURTH AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0006.txt">0006.txt</a></td> <td scope="row">EX-10.20</td> <td scope="row">3887</td> </tr> <tr> <td scope="row">7</td> <td scope="row">FIFTH AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0007.txt">0007.txt</a></td> <td scope="row">EX-10.21</td> <td scope="row">3980</td> </tr> <tr class="blueRow"> <td scope="row">8</td> <td scope="row">SIXTH AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0008.txt">0008.txt</a></td> <td scope="row">EX-10.22</td> <td scope="row">4017</td> </tr> <tr> <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td> <td scope="row">700</td> </tr> <tr class="blueRow"> <td scope="row">10</td> <td scope="row">CONSENT OF INDEPENDENT PUBLIC ACCOUNTANTS< +/td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0010.txt">0010.txt</a></td> <td scope="row">EX-23.1</td> <td scope="row">346</td> </tr> <tr> <td scope="row"> </td> <td scope="row">Complete submission text file</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365.txt">0000927356-01-000365.txt</a> +</td> <td scope="row"> </td> <td scope="row">272254</td> </tr> </table> </div> <!-- END DOCUMENT DIV --> </div> <!-- START FILER DIV --> <div id="filerDiv"> <div class="mailer">Mailing Address <span class="mailerAddress">13751 S WADSWORTH PARK DR SUITE D-14 +0</span> <span class="mailerAddress"> DRAPER UT 84020 </span> </div> <div class="mailer">Business Address <span class="mailerAddress">13751 S WADSWORTH PARK DR SUITE D-14 +0</span> <span class="mailerAddress"> DRAPER UT 84020 </span> <span class="mailerAddress">8015728225</span> </div> <div class="companyInfo"> <span class="companyName">1 800 CONTACTS INC (Filer) <acronym title="Central Index Key">CIK</acronym>: <a href="/cgi-bin/b +rowse-edgar?CIK=0001050122&action=getcompany">0001050122 (see all + company filings)</a></span> <p class="identInfo"><acronym title="Internal Revenue Service Number"> +IRS No.</acronym>: <strong>870571643</strong> \| State of Incorp.: <st +rong>DE</strong> \| Fiscal Year End: <strong>1231</strong><br />Type: +<strong>10-K</strong> \| Act: <strong>34</strong> \| File No.: <a href= +"/cgi-bin/browse-edgar?filenum=000-23633&action=getcompany"><stro +ng>000-23633</strong></a> \| Film No.: <strong>1587687</strong><br />< +acronym title="Standard Industrial Code">SIC</acronym>: <b><a href="/ +cgi-bin/browse-edgar?action=getcompany&SIC=3827&owner=include +">3827</a></b> Optical Instruments & Lenses<br />Assistant Direct +or 10</p> </div> <div class="clear"></div> </div> <!-- END FILER DIV --> </div> </body> </html> [download]	[reply] [d/l]
Re^5: REGEX for url by Marshall (Canon) on Apr 25, 2016 at 22:24 UTC
I ran this code as essentially suggested by james28909 against your data set. This approach has obvious flaws in terms of HTML structure, because there are href's that you don't care about. A module to parse this would be better. #!usr/bin/perl use warnings; use strict; my $line; while (my $line = <DATA>) { (my $url) = $line =~ m/.a href="(.)".*/; next unless $url; print "$url\n"; } =Prints javascript:history.back() http://www.sec.gov/index.htm"><img src="/images/sealTop.gif" alt="SEC +Seal" border="0 /edgar/searchedgar/webusers.htm http://www.sec.gov/ /edgar/searchedgar/webusers.htm /edgar/searchedgar/companysearch.html /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +001.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +002.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +003.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +004.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +005.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +006.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +007.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +008.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +009.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +010.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365.t +xt /cgi-bin/browse-edgar?CIK=0001050122&action=getcompany /cgi-bin/browse-edgar?action=getcompany&SIC=3827&owner=include Process completed successfully =cut __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>EDGAR Filing Documents for 0000927356-01-000365</title> <link rel="stylesheet" type="text/css" href="/include/interactive.css" + /> </head> <body style="margin: 0"> <noscript><div style="color:red; font-weight:bold; text-align:center;" +>This page uses Javascript. Your browser either doesn't support Javas +cript or you have it turned off. To see this page as it is meant to a +ppear please use a Javascript enabled browser.</div></noscript> <!-- BEGIN BANNER --> .... abreviated to reduce space..... [download]	[reply] [d/l]
Re^4: REGEX for url by wrkrbeee (Scribe) on Apr 25, 2016 at 21:09 UTC
Any possibilities for why that would not work on my end? Maybe something that a rookie would do that an expert would not, or vice versa? Thank you for your time!	[reply]
Re^5: REGEX for url by NetWallah (Canon) on Apr 25, 2016 at 21:19 UTC
This version is a little more robust - it works in both cases - with or without setting "$/". It can also handle multiple URL's. `use strict; use warnings; $/="</html>"; for(<DATA>){ print"$1\n" while /a href="(.*)"/g; } __DATA__ <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365- 0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td> <td scope="row"><a href="/Another/URL/here.html">0009.txt</a></td>` [download] This is not an optical illusion, it just looks like one.	[reply] [d/l]
Re^5: REGEX for url by ExReg (Priest) on Apr 25, 2016 at 22:07 UTC
Not able to check it on my machine, but wouldn't a /s be helpful here to be able to pass over the newlines? `print if s/.a href="(.)".*/$1/s;`	[reply] [d/l]