Web pages grow like weeds in an untended garden. My web site now comprises 82 pages, with 1176 links. It's time to do some gardening. In particular, it's time to look for broken links, and either fix or remove them.
I'm certainly not going to go crawling through scores of pages by hand, clicking on links to see if they work. We need a program to do this. Yahoo lists programs that check web pages. I looked at some of these, but I didn't find any that were
so I decided to write my own.
linkcheck. To check the links on a page, we write
linkcheck http://my.isp.com/page.html
This will give us a report like
Checked 1 pages, 49 links Found 0 broken links
linkcheck checks all the links on one page, but we want to check all the pages on a site. We can do this by recursively following links that we find to other web pages, and then checking those pages
linkcheck -r http://my.isp.com/page.html Checked 144 pages, 1025 links Found 3 broken links
If we follow every link that we find, we're liable to end up spidering the entire web. To avoid this, we only follow links to pages on our own site: my.isp.com.
linkcheck -o -r http://my.isp.com/page.html Checked 144 pages, 1131 links Found 3 broken links
| / - \"$Pages pages, $Links links, $Broken broken\r"Output is written to stdout, while the twiddle displays on stderr. This allows us to redirect output and still see the twiddle. It also ensures that the twiddle is unbuffered, so that it displays in real time.
To make this into a usable program, we must also
Writing all this from the ground up would be a big job. Fortunately, we don't have to. Most of the heavy lifting has already been done by others, and made available to us in modules. Here are the modules used by linkcheck
Getopt::StdHTML::ParserLWP::UserAgentPod::UsageURIUsing these modules, we can bolt together the completed application with only a few hundred lines of code. In the remainder of this article, we'll see how to do this.
Getopt::StdGetopt::Std parses command line options. See
Parsing Command Line Options with GetOpt:: for further discussion.
URIURI manages URIs: each URI object represents a single URI. URI has many methods for constructing, manipulating, and analyzing URIs, but we need only a few. To create a URI object, we write
$uri = new URI 'http://my.isp.com/page1.html#section1';
We can resolve relative links with the new_abs constructor
$uri2 = new_abs 'page2.html', $uri; # http://my.isp.com/page2.html
Accessors extract the components of a URI
$uri->scheme; # http $uri->authority; # my.isp.com $uri->fragment; # section1
Passing an argument to an accessor sets that component. Empty components are represented as undef.
$uri->fragment('section2'); # http://my.isp.com/page1.html#section2
$uri->fragment(undef); # http://my.isp.com/page1.html
The as_string() method returns the string representation of a URI object. as_string() is overloaded onto the stringize ("") operator; this means that we can use a URI object almost anywhere that we can use a string
print "$uri\n";
$Visited{$uri} = 1;
LWP::UserAgentLWP::Simple module
use LWP::Simple; $content = get($uri);
The get() method returns the contents of the web page, or undef on failure. However, we need a bit more control than that, so we'll use the LWP::UserAgent module, instead.
A user agent is any kind of HTTP client. LWP::UserAgent implements an HTTP client in Perl. To retrieve a web page, we create an LWP::UserAgent object, send an HTTP request, and receive the HTTP response.
$ua = new LWP::UserAgent; $request = new HTTP::Request GET => $uri; $response = $ua->request($request);
$response contains the contents of the web page
$content = $response->content;
If we only need the HTTP headers—for example, to check the existence or the modification date of a page—we can make a HEAD request, instead
$request = new HTTP::Request HEAD => $uri;
The request() method automatically handles redirects. We can recover the URL from which the page was ultimately retrieved as
$uri = $response->request->uri;
HTML::ParserHTML::Parser parses web pages. We don't use HTML::Parser directly; rather, we create a subclass of it
use HTML::Parser; package HTML::Parser::Links; use base qw(HTML::Parser);
To parse a web page, we create an object of our subclass and pass the contents of the page to the parse method
$parser = new HTML::Parser::Links; $parser->parse($content); $parser->eof;
parse invokes methods in our subclass as callbacks. We only need one callback
sub start
{
my($parser, $tag, $attr, $attrseq, $origtext) = @_;
parse calls start whenever it identifies the opening tag of an HTML markup. The parameters are
$parserHTML::Parser::Links object$tagh1, a, strong%$attr@$attrseq$origtext
We only care about a few tags and attributes. If we find a base tag, we capture the URL so that we can resolve relative links on that page
$tag eq 'base' and
$base = $attr->{href};
When we find an a (anchor) tag, we capture either the href (for links)
$tag eq 'a' and $attr->{href} and
$href = $attr->{href};
or the name (for fragments)
$tag eq 'a' and $attr->{name} and
$name = $attr->{name};
Pod::UsagePod::Usage parses any POD text that it finds in the program source and prints it. This makes it easy to add usage and help facilities to a program.
pod2usage(); # print synopsis pod2usage(VERBOSE=>1); # print synopsis and options pod2usage(VERBOSE=>2); # print entire man page
pod2usage is typically called when there are errors on the command line, so it exits after printing the POD.
Modules writers typically put their code into a package that is named after the module, to promote encapsulation and avoid name collisions. Conversely, package writers may put their code into a module, to make it available to other programs.
However, we can also embed packages directly in our program, simply by adding a package statement
package Spinner;
We use packages in our program to
If we were writing modules, we would need to
However, our packages are visible only within our program, so we needn't be so formal: we can create and use packages at our convenience. Here are the packages that we use within linkcheck
SpinnerHTML::Parser::LinksPageLinkSpinner| / - \
in the same location on the screen. Here is the complete package
package Spinner;
use vars qw($N @Spin);
@Spin = ('|', '/', '-', '\\');
sub Spin
{
print STDERR $Spin[$N++], "\r";
$N==4 and $N=0;
}
There's not much to it. $N, @Spin, and &Spin are all contained in the Spinner:: namespace. To advance the spinner, we call
Spinner::Spin();
It is tempting to use file-scoped lexicals instead of package variables
package Spinner;
my $N;
my @Spin = ('|', '/', '-', '\\');
If Spinner were a module, this would be fine; however, in our case it wouldn't actually provide any encapsulation. File-scoping doesn't respect Package declarations, so any file-scoped lexicals would share the same namespace—and be subject to name collisions—with every other file-scoped lexical in the entire program.
HTML::Parser::LinksHTML::Parser::Links is our subclass of HTML::Parser. The code fragments shown above illustrate the base class interface. In our subclass, we have additional instance data, to represent the parsed HTML page, and accessors to return information about the page.
The new method is our constructor.
sub new
{
my($class, $base) = @_;
my $parser = new HTML::Parser;
$parser->{base } = $base;
$parser->{links} = [];
$parser->{fragment} = {};
bless $parser, $class
}
To create an HTML::Parser::Links object, we
HTML::Parser object
Here is the complete start method
sub start
{
my($parser, $tag, $attr, $attrseq, $origtext) = @_;
$tag eq 'base' and
$parser->{base} = $attr->{href};
$tag eq 'a' and $attr->{href} and do
{
my $base = $parser->{base};
my $href = $attr->{href};
my $uri = new_abs URI $href, $base;
push @{$parser->{links}}, $uri;
};
$tag eq 'a' and $attr->{name} and do
{
my $name = $attr->{name};
$parser->{fragment}{$name} = 1;
};
}
We only care about base and a tags. If we find a base element, we save the href so that we can resolve relative links. When we find a link, we create a new URI object and add it to the list of links. Finally, if we find a fragment, we add it to the fragment hash.
We have two accessors.
$parser->links()
returns a list of all the links on the page.
$parser->check_fragment($fragment)
returns true iff $fragment exists on the page.
PagePage package retrieves and parses web pages. The web is multiply connected: there many be many links to a single web page. However, downloading pages over the network takes time, so we don't want to download any page more than once.
Page caches web pages in %Page::Content. The URL is the hash key, and the page content is the value. The first time we request a page, Page downloads it and caches the contents; any subsequent requests for the same page are satisfied from the cache, with no additional network activity.
The Page package also parses web pages. Parsing a page doesn't require network I/O, but it still takes time, and if we create and run a new parser for every fragment that we have to check, that time could be significant.
To avoid this, Page caches parsers in %Page::Parser. The hash key is the page URL, and the value is an HTML::Parser::Links object.
Here is the external interface for the Page package.
$page = new Page $uri; $uri = $page->uri; $links = $page->links; $content = get $page; $parser = parse $page;
LinkLink package checks the validity of a single link. Its external interface is very simple
$link = new Link $uri; $ok = $link->check;
Like the Page package, Link has some optimizations to avoid unnecessary operations. Checking links breaks down into two cases. If the link has a fragment
http://my.isp.com/page.html#section
then we have to download the entire page, parse it, and then verify that the fragment exists in the page. If the link has no fragment
http://my.isp.com/page.html
then we don't have to parse the page; in fact, we don't even have to download it: a HEAD request will tell us whether the page exists, and that's all we care about.
Internally, the check() method calls check_fragment() or check_base(), respectively, to handle these two cases. check_fragment() uses the Page package to download and parse the page, then it checks to see if the fragment exists in the page. check_base() issues a HEAD request directly to see if the page exists.
In either case, check() caches the results in %Link::Check, so we never have to check any link more than once.
linkcheck in about 100 lines of code. Here is the main program
package main;
my %Options;
my %Checked;
my($Scheme, $Authority);
my($Pages, $Links, $Broken) = (0, 0, 0);
getopt('vt', \%Options);
Help();
CheckPages(@ARGV);
Summary();
%Options holds command line options. %Checked is a hash of checked URLs; we use it to avoid infinite recursion if there is a cycle of links on our web site. $Authority records the current site; we use it to identify onsite links. $Pages, $Links and $Broken provide counts for Progress() and Summary().
CheckPages@ARGV contains a list of pages to check. CheckPages() creates a URI object for each page, and calls CheckPage() on it.
sub CheckPages
{
my @pages = @_;
my @URIs = map { new URI $_ } @pages;
for my $uri (@URIs)
{
$Scheme = $uri->scheme;
$Authority = $uri->authority;
CheckPage($uri);
}
}
CheckPageCheckPage() checks a single page.
sub CheckPage
{
my $uri = shift;
$Checked{$uri} and return;
$Checked{$uri} = 1;
$Pages++;
Twiddle();
print "PAGE $uri\n" if $Options{v} > 1;
my $page = new Page $uri;
my $links = $page->links;
defined $links or
die "Can't get $uri\n";
CheckLinks($page, $links);
}
After some housekeeping, it creates a new Page object, gets all the links on the page, and calls CheckLinks().
linkcheck checks for broken links, but the pages that the user specifies on the command line have to exist. If we can't download one, we die.
CheckLinksCheckLinks() checks the links on a page.
sub CheckLinks
{
my($page, $links) = @_;
my @links;
for my $link (@$links)
{
$link->scheme eq 'http' or next;
my $on_site = $link->authority eq $Authority;
$on_site or $Options{o} or next;
$Links++;
Twiddle();
print "LINK $link\n" if $Options{v} > 2;
Link->new($link)->check or do
{
Report($page, $link);
next;
};
$on_site or next;
$link->fragment(undef);
push @links, $link;
}
$Options{r} or return;
for my $link (@links)
{
CheckPage($link);
}
}
The first loop checks the links. We only check HTTP links, and we only check offsite links if the -o flag is specified. The actual check is
Link->new($link)->check
If the check fails, we call Report().
If the check succeeds and the link is onsite, we add it to @links. If the -r flag is specified, we fall through to the second loop and call CheckPage() on each onsite link.
Report() prints broken links, according to the -a and -v flags.
Twiddle() advances a spinner or prints a progress report, according to the -t flag.
Summary() prints a final count of checked pages, checked links, and broken links.
linkcheck 0.01linkcheck 1.07
The power of packages like Page and Link isn't that they do anything very complex or sophisticated; rather, it is that once we have written them, we can use them without having to think about how they work.
Early versions of linkcheck didn't have the Page and Link packages. Instead, they cached pages and links in open code in the main program. The resulting program was intricate, fragile, and difficult to modify.
package Spinner;
{
my $N;
my @Spin = ('|', '/', '-', '\\');
sub Spin
{
print STDERR $Spin[$N++], "\r";
$N==4 and $N=0;
}
}