Representing Sets in Perl

Sometimes, a program needs to represent a set of things. A set is not a native data type in Perl, but there are at least 4 modules on CPAN for managing sets of one kind or another. In addition, a hash can be pressed into service as a set of keys.

This article starts with a brief review of the mathematical definition of a set. Then it describes 5 different implementations of sets in Perl, with attention to

universe
implementation
performance
special features

Sets

Ideal

A set is a mathematical abstraction. It is usually defined as

a well defined collection of distinct objects

This definition has three important elements:

well defined: you can decide whether or not something is in the set
distinct: no duplications
nothing else: this means that if two sets have the same elements, then they are the same set, because there isn't anything else that could distinguish them.

Real

When we move from the abstract world of mathematics to the concrete world of programming, things get more complicated. Almost immediately, we face issues of universe, implementation, and performance.

Universe

Mathematicians build their entire world from the ground up out of sets: everything is a set. In particular, the elements of sets are also sets.

In Perl, we have a pre-existing world, already populated with things: numbers, strings, lists, hashes, references. So the first thing we have to decide is what can be an element of a set. Possibilities include

numbers
- real
- integer
strings
references
anything

The collection of all things that may be in a set is called the universe.

Implementation

Mathematicians describe sets; programmers have to implement them. To implement a set, we have to find a way to represent it on a computer, and then write code to carry out set operations on that representation. Different representations may be more or less difficult to operate on, be well or poorly suited to a given universe, and provide different performance characteristics.

Performance

Mathematical sets don't consume resources; sets that are implemented on computers do: principally memory and CPU. How much depends on both the universe and the chosen representation. Mathematical sets can be arbitrarily large, so any implementation will eventually encounter performance limitations.

Hashes

Perhaps the simplest way to represent a set in Perl is to record its elements as the keys of a hash. Hash keys satisfy the two explicit elements of the definition of a set:

you can find out whether a key is in a hash
keys are guaranteed unique

Given, say, a list of elements, we can make a set of them like this

%set = map { $_ => 1 } @list

Hash keys are strings, so the universe for this implementation is the collection of all strings. Perl will happily convert back and forth between strings and numbers, so we can consider the universe to include numbers, as well.

Testing for membership is trivial

$set{'foo'} and ...

and taking the union of two sets is easy enough

%u = map { $_ => 1 } keys %a, keys %b;

But taking an intersection is a bit more involved

%i = map { $_ => 1 } grep { $b{$_} } keys %a;

and constructing the complement of a set is clearly problematic.

If all we need to do is put some elements in a set and extract them later, then using a hash in open code is fine. If we want to do anything more complicated, we will probably want to write some subroutines to implement set operations. And if we're going to write subroutines, we should probably put them in a package, and if we're going to create a package, we might as well make it into a module and put it on CPAN...

`Set::Scalar`

It's been done. It's called Set::Scalar, and it is available on CPAN.

Implementation

Set::Scalar provides an object-oriented interface: each set is a represented by a Set::Scalar object. To create a new set, write

$set = new Set::Scalar(qw(a b c d));

Internally, the elements of the set are stored as hash keys, so the universe is the collection of all strings. There are methods for set operations and predicates

$u = union        $a $b
$i = intersection $a $b

subset  $a $b and ...
is_null $a    and ...

and there are operator overloads for many of the methods

$u = $a + $b        # union
$u = $a * $b        # intersection
$a < $b and ....    # proper subset

Set::Scalar uses an instance variable to indicate whether the keys in the hash are the strings that are in the set, or the strings that are not in the set. This allows it to represent the complement of a finite set, without recourse to an infinite amount of storage.

Performance

In general, Set::Scalar has the same performance characteristics as a hash in open code. The memory required for a set is proportional to the number of elements in the set. Insertion, deletion, and testing for an individual element all require unit time. The time required for most other set operations is linear in the number of elements.

Special Features

Print Representation

Set::Scalar provides an as_string() method, which returns a print representation of a set. The default format is

(a b c d)

The format can be controlled by setting display attributes. as_string() is overloaded onto the stringize ("") operator.

Recursive Sets

The documentation for Set::Scalar says that one set can contain a reference to another. However, the current implementation stringizes objects as they are entered into the hash. This means that the first set merely contains the print representation of the second, rather than a live reference to it.

Value Sets

When a set is represented by hash keys, the hash values are normally irrelevant. However, if a Set::Scalar object is constructed from a hash reference, then it records both the keys and values.

my $transcendentals = new Set::Scalar { e  => 2.71828
				        pi => 3.14159 };

Methods are provided for accessing keys and values.

`Set::Object`

Set::Object implements a set of Perl objects. In Perl, objects are accessed through references. The obvious implementation of a set of objects would therefore be a hash of object references. However, this won't work in Perl, because hash keys are strings. If you provide a reference as a hash key, Perl accepts it, but then uses its print representation instead. For example,

$ref = bless {}, 'foo';
%set = ($ref => 1);
print keys %set, "\n";

prints something like

foo=HASH(0x10015270)

As the Perl documentation disarmingly explains, it is not possible to convert the print representation back into a live reference, because the reference count has been lost.

Implementation

Nonetheless, the obvious implementation of a set of objects is a hash of object references, and that is how Set::Object implements it. It creates its own hash table. It implements the hash table using Perl arrays, and derives the hash key from the machine address of the reference. It carefully increments the reference count every time it inserts an object into its hash table, and decrements the reference count every time it removes one. Needless to say, all of this is done in XS code.

To the user, Set::Object presents a standard object-oriented interface. Each set is represented as a Set::Object object

$f = new Foo;
$b = new Bar;
$a = new Set::Object($f, $b);

Set::Object also provides the usual set operations and operator overloads

$u = union $a $b;
$i = $a * $b;

Performance

Since Set::Object is implemented as a hash, it has the same performance characteristics as a native hash: single element operations run in unit time, set operations run in linear time, and storage is proportional to the number of elements.

The Set::Object POD shows some benchmarks suggesting that element lookup time for Set::Object is comparable to that of a native hash, while element insertion is an order of magnitude faster. However, the real significance of Set::Object isn't its performance relative to a native hash, but the fact that it makes it possible to hash live references in the first place.

XS must die!

It might be possible to implement Set::Object in native Perl by using a hash and storing each reference as both key and value:

sub Set::Object::new
{
    my($class, @refs) = @_;
    my $hash = { map { $_ => $_ } @refs };
    bless $hash, $class
}

Then the stringized reference serves as a unique hash key, while the hash value preserves a copy of the live reference.

`Set::IntRange`

Set::IntRange implements a set of integers.

Implementation

The implementation is very simple: Set::IntRange allocates one bit for each integer. If the bit is set then the integer is in the set; if the bit is clear then the integer is not in the set.

Obviously, Set::IntRange can not allocate one bit for every integer. Actually, it allocates one bit for every integer in a finite range, which is specified when the set is created:

$set = Set::IntRange->new($lower,$upper);

Like the other Set:: modules, Set::IntRange provides set operations and operator overloads. One restriction on these operations is that all the operands must have the same range. Set::IntRange contains methods for reallocating a set with a different range, but it is up to the programmer to look after this; Set::IntRange won't do it automatically.

Performance

Set::IntRange is a front-end for Bit::Vector. Bit::Vector in turn is a Perl interface for a package of C subroutines that allocate and manage arrays of bits. The whole thing is built for speed. Typical applications include infinite-precision arithmetic and graph algorithms.

Each integer serves as its own index into the bit array, so single element operations run in unit time.

Asymptotically, the storage required for a set and the time required for set operations both increase linearly with the size of the set. However, at any point in time, a Set::IntRange object has some range, and given that range, space and time requirements are fixed. This may be an advantage or a disadvantage, depending on the application.

In any case, for a given set, Set::IntRange is likely to have much better performance than a hash implementation, because of the array representation and because of the underlying C code.

Special Features

Intervals

An interval is a contiguous set of integers. It is denoted

[lower, upper]

where lower is the first integer in the interval and upper is the last. This is a mathematical notation, and doesn't have anything to do with Perl arrays or references.

Set::IntRange provides methods for operating on intervals. For example,

$set->Interval_Fill($lower, $upper)

adds the interval [$lower, $upper] to $set.

Iterators

Set::IntRange has two iterators: Interval_Scan_inc and Interval_Scan_dec. These work in terms of intervals. To iterate over the elements of a set, write a loop like this

for ($start = $set->Min();
     ($lower, $upper) = $set->Interval_Scan_inc($start);
     $start = $upper + 1)
{
    ...
}

This loop executes once, not for each element in the set, but for each interval. In each iteration, the current interval is reported as [$lower, $upper]. If there is only one integer in the interval, then $upper and $lower are the same.

`Bit::Vector`

The BitVector method returns a reference to the underlying Bit::Vector object. The Bit::Vector module provides methods for operating on the bit array in many different and powerful ways.

`Set::IntSpan`

Set::IntSpan implements a set of integers. It was designed specifically to manage the article lists in .newsrc files. Article lists use a special format that compactly represents long runs of consecutive numbers:

alt.foo: 1-21,28,31
alt.bar: 1-14192,14194,14196-14221

Implementation

Set::IntSpan uses the .newsrc format internally. Rather than record each element in the set, it records each interval in the set. For example, the set

{ 1, 2, ..., 10 }

is represented by a list not of 10 numbers, but just 2: 1 and 10. The advantage of this becomes clear when we consider the set

{ 1, 2, ..., 1_000_000_000 }

This set contains a billion elements, but it is still represented internally by a list of 2 numbers.

Set::IntSpan presents a standard object-oriented interface, and provides the usual set operations.

$a = new Set::IntSpan "1-21,28,31";
$b = new Set::IntSpan [1..10];
$u = union     $a $b;
$i = intersect $a $b;

Performance

Set::IntSpan has rather different performance characteristics from the other Set:: modules. Storage and time requirements are all proportional to the number of intervals in the set.

For sets with a few large intervals, this can be a major performance advantage. For sets with many small intervals, it can be a major disadvantage. This example from the POD illustrates the issue:

$billion = new Set::IntSpan '0-1_000_000_000';   # OK
$odd     = grep_set { $_ & 1 } $billion;         # trouble
$even    = map_set  { $_ * 2 } $billion;         # double trouble

Special Features

Infinite Sets

In general, the complement of a finite set is an infinite set. This will usually cause some kind of trouble for any implementation that represents a set by representing each element in the set.

Set::IntSpan requires that the boundaries of each interval be finite. However, the first interval may have no lower bound, and the last may have no upper bound. This allows it to represent sets that extend to minus and plus infinity, respectively.

The utility of this isn't that these kinds of infinite sets are especially common or useful. Rather, it is that users can carry out set operations involving complements without getting tangled up in endpoint problems at INT_MAX.

Iterators

In addition to the grep_set and map_set operators shown above, Set::IntSpan has explicit iterators. For example, this loop

for (        $element=$set->first; 
     defined $element; 
     	     $element=$set->next) { ... }

will iterate over all the elements of $set, in order. In addition,

$set->Start($n)

sets the iterator to $n; this allows a loop to iterate over portions of $set, even if $set is infinite.

Set::Predicate

All of the set implementations discussed above have something in common: they represent sets by representing their elements. In contrast, mathematicians usually describe sets by giving a rule, or predicate, for deciding whether something is in the set.

{ n | n > 2  } 	the set of integers greater than 2
{ n | n%5==0 }	the set of integers that are multiples of 5

Representing a set like this seems simple enough. Each predicate is represented by a subroutine that returns true iff the predicate is true on its argument:

sub g2 { $_[0]    > 2 }
sub m5 { $_[0]%5 == 0 }

The set object just needs a reference to the subroutine, and then it can decide whether something is an element:

sub Set::Predicate::new		#create a new object
{
    my($class, $predicate) = @_;
    bless { predicate => $predicate }, $class
}

sub Set::Predicate::element	# return true iff $element is in $set
{
    my($set, $element) = @_;
    $set->{predicate}->($element)
}

Then we can write

$a = new Set::Predicate \&g2;
$b = new Set::Predicate \&m5;
$u = union $a $b;

Implementing union (and the rest of the set ops) for this representation is left as an exercise for the reader.

Notes

native: as it is, for example, in Pascal
defined: Real mathematicians don't define sets, they axiomatize them.
more complicated: or less, depending on your point of view
collection: For technical reasons, the universe may not actually be a set, so we refer to it as a collection, instead.
numbers: Of course, we can not then distinguish between a number and the string that represents it in the hash, but this is rarely necessary or desirable, at least in Perl.
current implementation: According to the author, this is fixed in Set::Scalar 0.9.
die!: Well, no. XS has a place. I just hate to see anyone coding XS unless it's absolutely necessary.
linearly: Storage requirements are linear. Time requirements are actually O(n*log(n)): n for the number of elements in the set, and log(n) for the propagation delay through the address decoders between CPU and RAM. If you regard memory access time as a fixed quantity, they you aren't really headed for infinity. Remember, asymptotes don't stop just because you've maxxed out your processor address space.

Steven W. McDougall / resume / swmcd@theworld.com / 1999 November 15