Checking links with LinkCheck

I'm not an early adopter. I started hearing about this World Wide Web thing in 1993, but I didn't figure out what it was until 1994. My ISP started hosting web pages in 1995, but I didn't write one until 1996, and I didn't write a second one until 1997.

Web pages grow like weeds in an untended garden. My web site now comprises 82 pages, with 1176 links. It's time to do some gardening. In particular, it's time to look for broken links, and either fix or remove them.

I'm certainly not going to go crawling through scores of pages by hand, clicking on links to see if they work. We need a program to do this. Yahoo lists programs that check web pages. I looked at some of these, but I didn't find any that were

simple
free
to my liking

so I decided to write my own.

Features

The first thing we need for a program is a name. We'll call this one linkcheck. To check the links on a page, we write

linkcheck http://my.isp.com/page.html

This will give us a report like

Checked 1 pages, 49 links          
Found 0 broken links

-r

As shown, linkcheck checks all the links on one page, but we want to check all the pages on a site. We can do this by recursively following links that we find to other web pages, and then checking those pages

linkcheck -r http://my.isp.com/page.html
Checked 144 pages, 1025 links          
Found 3 broken links

If we follow every link that we find, we're liable to end up spidering the entire web. To avoid this, we only follow links to pages on our own site: my.isp.com.

-o

We don't follow links to offsite pages, but there remains a question of whether to check links to offsite pages. If we want to do this, we specify the -o flag

linkcheck -o -r http://my.isp.com/page.html
Checked 144 pages, 1131 links          
Found 3 broken links

-v verbosity

If we find any broken links, we'd probably like to know what they are. The -v verbosity flag controls the amount of output that we get

-v 0: show count of broken links (default)
-v 1: also list broken links
-v 2: also list checked pages
-v 3: also list checked links

-t twiddle

Web pages can take a long time to download. While we're waiting, we'd like to see some output, so that we know the program is doing something, and so that we don't get too bored. We could use the -v flag, but we might be sending output to a file or pipe. Instead, we provide a twiddle

-t 0: none (default)
-t 1: spinner: | / - \
-t 2: progress report: "$Pages pages, $Links links, $Broken broken\r"

Output is written to stdout, while the twiddle displays on stderr. This allows us to redirect output and still see the twiddle. It also ensures that the twiddle is unbuffered, so that it displays in real time.

Algorithm

Here is a rough outline of the steps required to check the links on a page. Given a URL, we must

parse the server name from the URL
open a TCP connection to the server
send an HTTP request
receive the HTTP response
handle redirects
extract the HTML page from the HTTP response
parse all the links from the HTML page
resolve relative links
check the existence of each link

To make this into a usable program, we must also

parse the command line
identify off-site links
keep track of visited pages and checked links
provide documentation

Writing all this from the ground up would be a big job. Fortunately, we don't have to. Most of the heavy lifting has already been done by others, and made available to us in modules. Here are the modules used by linkcheck

Getopt::Std
HTML::Parser
LWP::UserAgent
Pod::Usage
URI

Using these modules, we can bolt together the completed application with only a few hundred lines of code. In the remainder of this article, we'll see how to do this.

Modules

First, we'll review the modules

`Getopt::Std`

Getopt::Std parses command line options. See Parsing Command Line Options with GetOpt:: for further discussion.

`URI`

URI manages URIs: each URI object represents a single URI. URI has many methods for constructing, manipulating, and analyzing URIs, but we need only a few. To create a URI object, we write

$uri = new URI 'http://my.isp.com/page1.html#section1';

We can resolve relative links with the new_abs constructor

$uri2 = new_abs 'page2.html', $uri;  # http://my.isp.com/page2.html

Accessors extract the components of a URI

$uri->scheme;		# http
$uri->authority;	# my.isp.com
$uri->fragment;		# section1

Passing an argument to an accessor sets that component. Empty components are represented as undef.

$uri->fragment('section2');	# http://my.isp.com/page1.html#section2
$uri->fragment(undef);		# http://my.isp.com/page1.html

The as_string() method returns the string representation of a URI object. as_string() is overloaded onto the stringize ("") operator; this means that we can use a URI object almost anywhere that we can use a string

print "$uri\n";
$Visited{$uri} = 1;

`LWP::UserAgent`

LWP is the Library for WWW access in Perl. We use it to retrieve web pages. Perhaps the simplest way to get a web page is with the LWP::Simple module

use LWP::Simple;
$content = get($uri);

The get() method returns the contents of the web page, or undef on failure. However, we need a bit more control than that, so we'll use the LWP::UserAgent module, instead.

A user agent is any kind of HTTP client. LWP::UserAgent implements an HTTP client in Perl. To retrieve a web page, we create an LWP::UserAgent object, send an HTTP request, and receive the HTTP response.

$ua       = new LWP::UserAgent;
$request  = new HTTP::Request GET => $uri;
$response = $ua->request($request);

$response contains the contents of the web page

$content  = $response->content;

If we only need the HTTP headers—for example, to check the existence or the modification date of a page—we can make a HEAD request, instead

$request  = new HTTP::Request HEAD => $uri;

The request() method automatically handles redirects. We can recover the URL from which the page was ultimately retrieved as

$uri = $response->request->uri;

`HTML::Parser`

Once we have a web page, we want to find all the links on it. HTML::Parser parses web pages. We don't use HTML::Parser directly; rather, we create a subclass of it

use HTML::Parser;

package HTML::Parser::Links;

use base qw(HTML::Parser);

To parse a web page, we create an object of our subclass and pass the contents of the page to the parse method

$parser = new HTML::Parser::Links;
$parser->parse($content);
$parser->eof;

parse invokes methods in our subclass as callbacks. We only need one callback

sub start
{
    my($parser, $tag, $attr, $attrseq, $origtext) = @_;

parse calls start whenever it identifies the opening tag of an HTML markup. The parameters are

$parser: the HTML::Parser::Links object
$tag: the name of the HTML markup, e.g. h1, a, strong
%$attr: a hash of the attribute name=value pairs in the tag
@$attrseq: a list of the attributes in the tag, in their original order
$origtext: the original text of the tag

We only care about a few tags and attributes. If we find a base tag, we capture the URL so that we can resolve relative links on that page

$tag eq 'base' and
    $base = $attr->{href};

When we find an a (anchor) tag, we capture either the href (for links)

$tag eq 'a' and $attr->{href} and 
    $href = $attr->{href};

or the name (for fragments)

 
$tag eq 'a' and $attr->{name} and
    $name = $attr->{name};

`Pod::Usage`

It is common practice to embed the documentation for a Perl program within the program itself, in POD format. Pod::Usage parses any POD text that it finds in the program source and prints it. This makes it easy to add usage and help facilities to a program.

pod2usage();		   # print synopsis
pod2usage(VERBOSE=>1);  # print synopsis and options
pod2usage(VERBOSE=>2);  # print entire man page

pod2usage is typically called when there are errors on the command line, so it exits after printing the POD.

Packages

Modules and packages are related, but distinct, concepts. A module is a file that contains Perl code. A package is a namespace that contains Perl subroutines or variables.

Modules writers typically put their code into a package that is named after the module, to promote encapsulation and avoid name collisions. Conversely, package writers may put their code into a module, to make it available to other programs.

However, we can also embed packages directly in our program, simply by adding a package statement

package Spinner;

We use packages in our program to

create internal interfaces
support encapsulation
avoid name collisions

If we were writing modules, we would need to

create complete, general interfaces
choose a good module name
provide documentation

However, our packages are visible only within our program, so we needn't be so formal: we can create and use packages at our convenience. Here are the packages that we use within linkcheck

Spinner
HTML::Parser::Links
Page
Link

`Spinner`

The -t 1 option displays a spinner. This a 1-character animation, constructed by cyclically printing the characters

| / - \

in the same location on the screen. Here is the complete package

package Spinner;

use vars qw($N @Spin);

@Spin = ('|', '/', '-', '\\');

sub Spin
{
    print STDERR $Spin[$N++], "\r";
    $N==4 and $N=0;
}

There's not much to it. $N, @Spin, and &Spin are all contained in the Spinner:: namespace. To advance the spinner, we call

Spinner::Spin();

It is tempting to use file-scoped lexicals instead of package variables

package Spinner;

my $N;
my @Spin = ('|', '/', '-', '\\');

If Spinner were a module, this would be fine; however, in our case it wouldn't actually provide any encapsulation. File-scoping doesn't respect Package declarations, so any file-scoped lexicals would share the same namespace—and be subject to name collisions—with every other file-scoped lexical in the entire program.

`HTML::Parser::Links`

HTML::Parser::Links is our subclass of HTML::Parser. The code fragments shown above illustrate the base class interface. In our subclass, we have additional instance data, to represent the parsed HTML page, and accessors to return information about the page.

The new method is our constructor.

sub new
{
    my($class, $base) = @_;

    my $parser = new HTML::Parser;
    $parser->{base }    = $base;
    $parser->{links}    = [];
    $parser->{fragment} = {};

    bless $parser, $class
}

To create an HTML::Parser::Links object, we

create an HTML::Parser object
add our instance variables to the object
rebless the object into our own class

Here is the complete start method

sub start
{
    my($parser, $tag, $attr, $attrseq, $origtext) = @_;

    $tag eq 'base' and
        $parser->{base} = $attr->{href};

    $tag eq 'a' and $attr->{href} and do
    {
        my $base = $parser->{base};
        my $href = $attr->{href};
        my $uri  = new_abs URI $href, $base;
        push @{$parser->{links}}, $uri;
    };

    $tag eq 'a' and $attr->{name} and do
    {
        my $name = $attr->{name};
        $parser->{fragment}{$name} = 1;
    };
}

We only care about base and a tags. If we find a base element, we save the href so that we can resolve relative links. When we find a link, we create a new URI object and add it to the list of links. Finally, if we find a fragment, we add it to the fragment hash.

We have two accessors.

$parser->links()

returns a list of all the links on the page.

$parser->check_fragment($fragment)

returns true iff $fragment exists on the page.

`Page`

The Page package retrieves and parses web pages. The web is multiply connected: there many be many links to a single web page. However, downloading pages over the network takes time, so we don't want to download any page more than once.

Page caches web pages in %Page::Content. The URL is the hash key, and the page content is the value. The first time we request a page, Page downloads it and caches the contents; any subsequent requests for the same page are satisfied from the cache, with no additional network activity.

The Page package also parses web pages. Parsing a page doesn't require network I/O, but it still takes time, and if we create and run a new parser for every fragment that we have to check, that time could be significant.

To avoid this, Page caches parsers in %Page::Parser. The hash key is the page URL, and the value is an HTML::Parser::Links object.

Here is the external interface for the Page package.

$page    = new Page $uri;
$uri     = $page->uri;
$links   = $page->links;
$content = get   $page;
$parser  = parse $page;

`Link`

The Link package checks the validity of a single link. Its external interface is very simple

$link = new Link $uri;
$ok   = $link->check;

Like the Page package, Link has some optimizations to avoid unnecessary operations. Checking links breaks down into two cases. If the link has a fragment

http://my.isp.com/page.html#section

then we have to download the entire page, parse it, and then verify that the fragment exists in the page. If the link has no fragment

http://my.isp.com/page.html

then we don't have to parse the page; in fact, we don't even have to download it: a HEAD request will tell us whether the page exists, and that's all we care about.

Internally, the check() method calls check_fragment() or check_base(), respectively, to handle these two cases. check_fragment() uses the Page package to download and parse the page, then it checks to see if the fragment exists in the page. check_base() issues a HEAD request directly to see if the page exists.

In either case, check() caches the results in %Link::Check, so we never have to check any link more than once.

Program

With all the infrastructure provided by the modules and packages, we can complete linkcheck in about 100 lines of code. Here is the main program

package main;

my %Options;
my %Checked;
my($Scheme, $Authority);
my($Pages, $Links, $Broken) = (0, 0, 0);

getopt('vt', \%Options);
Help();
CheckPages(@ARGV);
Summary();

Globals

We declare our globals as file-scoped lexicals: this is the main program; the file scope properly belongs to it. %Options holds command line options. %Checked is a hash of checked URLs; we use it to avoid infinite recursion if there is a cycle of links on our web site. $Authority records the current site; we use it to identify onsite links. $Pages, $Links and $Broken provide counts for Progress() and Summary().

`CheckPages`

After parsing command line options, @ARGV contains a list of pages to check. CheckPages() creates a URI object for each page, and calls CheckPage() on it.

sub CheckPages
{
    my @pages = @_;
    my @URIs  = map { new URI $_ } @pages;

    for my $uri (@URIs)
    {
        $Scheme    = $uri->scheme;
        $Authority = $uri->authority;
        CheckPage($uri);
    }
}

`CheckPage`

CheckPage() checks a single page.

sub CheckPage
{
    my $uri = shift;
    
    $Checked{$uri} and return;
    $Checked{$uri} = 1;
    $Pages++;
    Twiddle();
    print "PAGE $uri\n" if $Options{v} > 1;

    my $page  = new Page $uri;
    my $links = $page->links;
    defined $links or
        die "Can't get $uri\n";

    CheckLinks($page, $links);
}

After some housekeeping, it creates a new Page object, gets all the links on the page, and calls CheckLinks().

linkcheck checks for broken links, but the pages that the user specifies on the command line have to exist. If we can't download one, we die.

`CheckLinks`

CheckLinks() checks the links on a page.

sub CheckLinks
{
    my($page, $links) = @_;
    my @links;

    for my $link (@$links)
    {
        $link->scheme eq 'http' or next;
        my $on_site = $link->authority eq $Authority;
        $on_site or $Options{o} or next;

        $Links++;
        Twiddle();
        print "LINK $link\n" if $Options{v} > 2;
        Link->new($link)->check or do
        {
            Report($page, $link);
            next;
        };

        $on_site or next;
        $link->fragment(undef);
        push @links, $link;
    }

    $Options{r} or return;

    for my $link (@links)
    {
        CheckPage($link);
    }
}

The first loop checks the links. We only check HTTP links, and we only check offsite links if the -o flag is specified. The actual check is

Link->new($link)->check

If the check fails, we call Report().

If the check succeeds and the link is onsite, we add it to @links. If the -r flag is specified, we fall through to the second loop and call CheckPage() on each onsite link.

Output

Report() prints broken links, according to the -a and -v flags.

Twiddle() advances a spinner or prints a progress report, according to the -t flag.

Summary() prints a final count of checked pages, checked links, and broken links.

Distribution

linkcheck 0.01

the version described in this article

linkcheck 1.07

the latest version, with support for

frames, forms, images and script tags
empty fragments
long option names
authorization
cache control
proxies
timeouts
restricting recursion to a single subdirectory
https URLs

Conclusion

We've seen how to use existing modules to manage URIs, download web pages, and parse HTML. We've written our own packages to cache web pages and links. Building on this infrastructure, we've bolted together a non-trivial application with little more than 100 lines of code.

The power of packages like Page and Link isn't that they do anything very complex or sophisticated; rather, it is that once we have written them, we can use them without having to think about how they work.

Early versions of linkcheck didn't have the Page and Link packages. Instead, they cached pages and links in open code in the main program. The resulting program was intricate, fragile, and difficult to modify.

NOTES

modules

Remember modules? It's a column about modules.

file-scoped lexicals

A correspondent points out that block-scoped lexicals would also solve this problem.

package Spinner;
{
    my $N;
    my @Spin = ('|', '/', '-', '\\');

    sub Spin
    {
	print STDERR $Spin[$N++], "\r";
	$N==4 and $N=0;
    }
}

multiply connected

Otherwise, we would call it the World Wide Tree (WWT)

Steven W. McDougall / swmcd@theworld.com / resume / 2000 Oct 12