Web pages grow like weeds in an untended garden. My web site now comprises 82 pages, with 1176 links. It's time to do some gardening. In particular, it's time to look for broken links, and either fix or remove them.
I'm certainly not going to go crawling through scores of pages by hand, clicking on links to see if they work. We need a program to do this. Yahoo lists programs that check web pages. I looked at some of these, but I didn't find any that were
so I decided to write my own.
linkcheck
. To check the links on a page, we write
linkcheck http://my.isp.com/page.html
This will give us a report like
Checked 1 pages, 49 links Found 0 broken links
linkcheck
checks all the links on one page, but we want to check all the pages on a site. We can do this by recursively following links that we find to other web pages, and then checking those pages
linkcheck -r http://my.isp.com/page.html Checked 144 pages, 1025 links Found 3 broken links
If we follow every link that we find, we're liable to end up spidering the entire web. To avoid this, we only follow links to pages on our own site: my.isp.com
.
linkcheck -o -r http://my.isp.com/page.html Checked 144 pages, 1131 links Found 3 broken links
| / - \
"$Pages pages, $Links links, $Broken broken\r"
Output is written to stdout, while the twiddle displays on stderr. This allows us to redirect output and still see the twiddle. It also ensures that the twiddle is unbuffered, so that it displays in real time.
To make this into a usable program, we must also
Writing all this from the ground up would be a big job. Fortunately, we don't have to. Most of the heavy lifting has already been done by others, and made available to us in modules. Here are the modules used by linkcheck
Getopt::Std
HTML::Parser
LWP::UserAgent
Pod::Usage
URI
Using these modules, we can bolt together the completed application with only a few hundred lines of code. In the remainder of this article, we'll see how to do this.
Getopt::Std
Getopt::Std
parses command line options. See
Parsing Command Line Options with GetOpt::
for further discussion.
URI
URI
manages URIs: each URI
object represents a single URI. URI
has many methods for constructing, manipulating, and analyzing URIs, but we need only a few. To create a URI
object, we write
$uri = new URI 'http://my.isp.com/page1.html#section1';
We can resolve relative links with the new_abs
constructor
$uri2 = new_abs 'page2.html', $uri; # http://my.isp.com/page2.html
Accessors extract the components of a URI
$uri->scheme; # http $uri->authority; # my.isp.com $uri->fragment; # section1
Passing an argument to an accessor sets that component. Empty components are represented as undef
.
$uri->fragment('section2'); # http://my.isp.com/page1.html#section2 $uri->fragment(undef); # http://my.isp.com/page1.html
The as_string()
method returns the string representation of a URI
object. as_string()
is overloaded onto the stringize ("") operator; this means that we can use a URI
object almost anywhere that we can use a string
print "$uri\n"; $Visited{$uri} = 1;
LWP::UserAgent
LWP::Simple
module
use LWP::Simple; $content = get($uri);
The get()
method returns the contents of the web page, or undef
on failure. However, we need a bit more control than that, so we'll use the LWP::UserAgent
module, instead.
A user agent is any kind of HTTP client. LWP::UserAgent
implements an HTTP client in Perl. To retrieve a web page, we create an LWP::UserAgent
object, send an HTTP request, and receive the HTTP response.
$ua = new LWP::UserAgent; $request = new HTTP::Request GET => $uri; $response = $ua->request($request);
$response
contains the contents of the web page
$content = $response->content;
If we only need the HTTP headers—for example, to check the existence or the modification date of a page—we can make a HEAD
request, instead
$request = new HTTP::Request HEAD => $uri;
The request()
method automatically handles redirects. We can recover the URL from which the page was ultimately retrieved as
$uri = $response->request->uri;
HTML::Parser
HTML::Parser
parses web pages. We don't use HTML::Parser
directly; rather, we create a subclass of it
use HTML::Parser; package HTML::Parser::Links; use base qw(HTML::Parser);
To parse a web page, we create an object of our subclass and pass the contents of the page to the parse
method
$parser = new HTML::Parser::Links; $parser->parse($content); $parser->eof;
parse
invokes methods in our subclass as callbacks. We only need one callback
sub start { my($parser, $tag, $attr, $attrseq, $origtext) = @_;
parse
calls start
whenever it identifies the opening tag of an HTML markup. The parameters are
$parser
HTML::Parser::Links
object$tag
h1
, a
, strong
%$attr
@$attrseq
$origtext
We only care about a few tags and attributes. If we find a base
tag, we capture the URL so that we can resolve relative links on that page
$tag eq 'base' and $base = $attr->{href};
When we find an a
(anchor) tag, we capture either the href (for links)
$tag eq 'a' and $attr->{href} and $href = $attr->{href};
or the name (for fragments)
$tag eq 'a' and $attr->{name} and $name = $attr->{name};
Pod::Usage
Pod::Usage
parses any POD text that it finds in the program source and prints it. This makes it easy to add usage and help facilities to a program.
pod2usage(); # print synopsis pod2usage(VERBOSE=>1); # print synopsis and options pod2usage(VERBOSE=>2); # print entire man page
pod2usage
is typically called when there are errors on the command line, so it exits after printing the POD.
Modules writers typically put their code into a package that is named after the module, to promote encapsulation and avoid name collisions. Conversely, package writers may put their code into a module, to make it available to other programs.
However, we can also embed packages directly in our program, simply by adding a package
statement
package Spinner;
We use packages in our program to
If we were writing modules, we would need to
However, our packages are visible only within our program, so we needn't be so formal: we can create and use packages at our convenience. Here are the packages that we use within linkcheck
Spinner
HTML::Parser::Links
Page
Link
Spinner
| / - \
in the same location on the screen. Here is the complete package
package Spinner; use vars qw($N @Spin); @Spin = ('|', '/', '-', '\\'); sub Spin { print STDERR $Spin[$N++], "\r"; $N==4 and $N=0; }
There's not much to it. $N
, @Spin
, and &Spin
are all contained in the Spinner::
namespace. To advance the spinner, we call
Spinner::Spin();
It is tempting to use file-scoped lexicals instead of package variables
package Spinner; my $N; my @Spin = ('|', '/', '-', '\\');
If Spinner
were a module, this would be fine; however, in our case it wouldn't actually provide any encapsulation. File-scoping doesn't respect Package
declarations, so any file-scoped lexicals would share the same namespace—and be subject to name collisions—with every other file-scoped lexical in the entire program.
HTML::Parser::Links
HTML::Parser::Links
is our subclass of HTML::Parser
. The code fragments shown above illustrate the base class interface. In our subclass, we have additional instance data, to represent the parsed HTML page, and accessors to return information about the page.
The new
method is our constructor.
sub new { my($class, $base) = @_; my $parser = new HTML::Parser; $parser->{base } = $base; $parser->{links} = []; $parser->{fragment} = {}; bless $parser, $class }
To create an HTML::Parser::Links
object, we
HTML::Parser
object
Here is the complete start
method
sub start { my($parser, $tag, $attr, $attrseq, $origtext) = @_; $tag eq 'base' and $parser->{base} = $attr->{href}; $tag eq 'a' and $attr->{href} and do { my $base = $parser->{base}; my $href = $attr->{href}; my $uri = new_abs URI $href, $base; push @{$parser->{links}}, $uri; }; $tag eq 'a' and $attr->{name} and do { my $name = $attr->{name}; $parser->{fragment}{$name} = 1; }; }
We only care about base
and a
tags. If we find a base
element, we save the href so that we can resolve relative links. When we find a link, we create a new URI
object and add it to the list of links. Finally, if we find a fragment, we add it to the fragment
hash.
We have two accessors.
$parser->links()
returns a list of all the links on the page.
$parser->check_fragment($fragment)
returns true iff $fragment
exists on the page.
Page
Page
package retrieves and parses web pages. The web is multiply connected: there many be many links to a single web page. However, downloading pages over the network takes time, so we don't want to download any page more than once.
Page
caches web pages in %Page::Content
. The URL is the hash key, and the page content is the value. The first time we request a page, Page
downloads it and caches the contents; any subsequent requests for the same page are satisfied from the cache, with no additional network activity.
The Page
package also parses web pages. Parsing a page doesn't require network I/O, but it still takes time, and if we create and run a new parser for every fragment that we have to check, that time could be significant.
To avoid this, Page
caches parsers in %Page::Parser
. The hash key is the page URL, and the value is an HTML::Parser::Links
object.
Here is the external interface for the Page
package.
$page = new Page $uri; $uri = $page->uri; $links = $page->links; $content = get $page; $parser = parse $page;
Link
Link
package checks the validity of a single link. Its external interface is very simple
$link = new Link $uri; $ok = $link->check;
Like the Page
package, Link
has some optimizations to avoid unnecessary operations. Checking links breaks down into two cases. If the link has a fragment
http://my.isp.com/page.html#section
then we have to download the entire page, parse it, and then verify that the fragment exists in the page. If the link has no fragment
http://my.isp.com/page.html
then we don't have to parse the page; in fact, we don't even have to download it: a HEAD
request will tell us whether the page exists, and that's all we care about.
Internally, the check()
method calls check_fragment()
or check_base()
, respectively, to handle these two cases. check_fragment()
uses the Page
package to download and parse the page, then it checks to see if the fragment exists in the page. check_base()
issues a HEAD
request directly to see if the page exists.
In either case, check()
caches the results in %Link::Check
, so we never have to check any link more than once.
linkcheck
in about 100 lines of code. Here is the main program
package main; my %Options; my %Checked; my($Scheme, $Authority); my($Pages, $Links, $Broken) = (0, 0, 0); getopt('vt', \%Options); Help(); CheckPages(@ARGV); Summary();
%Options
holds command line options. %Checked
is a hash of checked URLs; we use it to avoid infinite recursion if there is a cycle of links on our web site. $Authority
records the current site; we use it to identify onsite links. $Pages
, $Links
and $Broken
provide counts for Progress()
and Summary()
.
CheckPages
@ARGV
contains a list of pages to check. CheckPages()
creates a URI
object for each page, and calls CheckPage()
on it.
sub CheckPages { my @pages = @_; my @URIs = map { new URI $_ } @pages; for my $uri (@URIs) { $Scheme = $uri->scheme; $Authority = $uri->authority; CheckPage($uri); } }
CheckPage
CheckPage()
checks a single page.
sub CheckPage { my $uri = shift; $Checked{$uri} and return; $Checked{$uri} = 1; $Pages++; Twiddle(); print "PAGE $uri\n" if $Options{v} > 1; my $page = new Page $uri; my $links = $page->links; defined $links or die "Can't get $uri\n"; CheckLinks($page, $links); }
After some housekeeping, it creates a new Page
object, gets all the links on the page, and calls CheckLinks()
.
linkcheck
checks for broken links, but the pages that the user specifies on the command line have to exist. If we can't download one, we die
.
CheckLinks
CheckLinks()
checks the links on a page.
sub CheckLinks { my($page, $links) = @_; my @links; for my $link (@$links) { $link->scheme eq 'http' or next; my $on_site = $link->authority eq $Authority; $on_site or $Options{o} or next; $Links++; Twiddle(); print "LINK $link\n" if $Options{v} > 2; Link->new($link)->check or do { Report($page, $link); next; }; $on_site or next; $link->fragment(undef); push @links, $link; } $Options{r} or return; for my $link (@links) { CheckPage($link); } }
The first loop checks the links. We only check HTTP links, and we only check offsite links if the -o flag is specified. The actual check is
Link->new($link)->check
If the check fails, we call Report()
.
If the check succeeds and the link is onsite, we add it to @links
. If the -r flag is specified, we fall through to the second loop and call CheckPage()
on each onsite link.
Report()
prints broken links, according to the -a and -v flags.
Twiddle()
advances a spinner or prints a progress report, according to the -t flag.
Summary()
prints a final count of checked pages, checked links, and broken links.
linkcheck 0.01
linkcheck 1.07
The power of packages like Page
and Link
isn't that they do anything very complex or sophisticated; rather, it is that once we have written them, we can use them without having to think about how they work.
Early versions of linkcheck
didn't have the Page
and Link
packages. Instead, they cached pages and links in open code in the main program. The resulting program was intricate, fragile, and difficult to modify.
package Spinner; { my $N; my @Spin = ('|', '/', '-', '\\'); sub Spin { print STDERR $Spin[$N++], "\r"; $N==4 and $N=0; } }