This article starts with a brief review of the mathematical definition of a set. Then it describes 5 different implementations of sets in Perl, with attention to
a well defined collection of distinct objectsThis definition has three important elements:
In Perl, we have a pre-existing world, already populated with things: numbers, strings, lists, hashes, references. So the first thing we have to decide is what can be an element of a set. Possibilities include
%set = map { $_ => 1 } @listHash keys are strings, so the universe for this implementation is the collection of all strings. Perl will happily convert back and forth between strings and numbers, so we can consider the universe to include numbers, as well.
Testing for membership is trivial
$set{'foo'} and ...and taking the union of two sets is easy enough
%u = map { $_ => 1 } keys %a, keys %b;But taking an intersection is a bit more involved
%i = map { $_ => 1 } grep { $b{$_} } keys %a;and constructing the complement of a set is clearly problematic.
If all we need to do is put some elements in a set and extract them later, then using a hash in open code is fine. If we want to do anything more complicated, we will probably want to write some subroutines to implement set operations. And if we're going to write subroutines, we should probably put them in a package, and if we're going to create a package, we might as well make it into a module and put it on CPAN...
Set::Scalar
Set::Scalar
, and it is available on CPAN.
Set::Scalar
provides an object-oriented interface: each set is a represented by a Set::Scalar
object. To create a new set, write
$set = new Set::Scalar(qw(a b c d));Internally, the elements of the set are stored as hash keys, so the universe is the collection of all strings. There are methods for set operations and predicates
$u = union $a $b $i = intersection $a $b subset $a $b and ... is_null $a and ...and there are operator overloads for many of the methods
$u = $a + $b # union $u = $a * $b # intersection $a < $b and .... # proper subset
Set::Scalar
uses an instance variable to indicate whether the keys in the hash are the strings that are in the set, or the strings that are not in the set. This allows it to represent the complement of a finite set, without recourse to an infinite amount of storage.
Set::Scalar
has the same performance characteristics as a hash in open code. The memory required for a set is proportional to the number of elements in the set. Insertion, deletion, and testing for an individual element all require unit time. The time required for most other set operations is linear in the number of elements.
Set::Scalar
provides an as_string()
method, which returns a print representation of a set. The default format is
(a b c d)The format can be controlled by setting display attributes.
as_string()
is overloaded onto the stringize (""
) operator.
Set::Scalar
says that one set can contain a reference to another. However, the current implementation stringizes objects as they are entered into the hash. This means that the first set merely contains the print representation of the second, rather than a live reference to it.
Set::Scalar
object is constructed from a hash reference, then it records both the keys and values.
my $transcendentals = new Set::Scalar { e => 2.71828 pi => 3.14159 };Methods are provided for accessing keys and values.
Set::Object
Set::Object
implements a set of Perl objects. In Perl, objects are accessed through references. The obvious implementation of a set of objects would therefore be a hash of object references. However, this won't work in Perl, because hash keys are strings. If you provide a reference as a hash key, Perl accepts it, but then uses its print representation instead. For example,
$ref = bless {}, 'foo'; %set = ($ref => 1); print keys %set, "\n";prints something like
foo=HASH(0x10015270)As the Perl documentation disarmingly explains, it is not possible to convert the print representation back into a live reference, because the reference count has been lost.
Set::Object
implements it. It creates its own hash table. It implements the hash table using Perl arrays, and derives the hash key from the machine address of the reference. It carefully increments the reference count every time it inserts an object into its hash table, and decrements the reference count every time it removes one. Needless to say, all of this is done in XS code.
To the user, Set::Object
presents a standard object-oriented interface. Each set is represented as a Set::Object
object
$f = new Foo; $b = new Bar; $a = new Set::Object($f, $b);
Set::Object
also provides the usual set operations and operator overloads
$u = union $a $b; $i = $a * $b;
Set::Object
is implemented as a hash, it has the same performance characteristics as a native hash: single element operations run in unit time, set operations run in linear time, and storage is proportional to the number of elements.
The Set::Object
POD shows some benchmarks suggesting that
element lookup time for Set::Object
is comparable to that
of a native hash, while element insertion is an order of
magnitude faster. However, the real significance of
Set::Object
isn't its performance relative to a native
hash, but the fact that it makes it possible to hash live references
in the first place.
Set::Object
in native Perl by using a hash and storing each reference as both key and value:
sub Set::Object::new { my($class, @refs) = @_; my $hash = { map { $_ => $_ } @refs }; bless $hash, $class }Then the stringized reference serves as a unique hash key, while the hash value preserves a copy of the live reference.
Set::IntRange
Set::IntRange
implements a set of integers.
Set::IntRange
allocates one bit for each integer. If the bit is set then the integer is in the set; if the bit is clear then the integer is not in the set.
Obviously, Set::IntRange
can not allocate one bit for every integer. Actually, it allocates one bit for every integer in a finite range, which is specified when the set is created:
$set = Set::IntRange->new($lower,$upper);Like the other
Set::
modules, Set::IntRange
provides set operations and operator overloads. One restriction on these operations is that all the operands must have the same range. Set::IntRange
contains methods for reallocating a set with a different range, but it is up to the programmer to look after this; Set::IntRange
won't do it automatically.
Set::IntRange
is a front-end for Bit::Vector
. Bit::Vector
in turn is a Perl interface for a package of C subroutines that allocate and manage arrays of bits. The whole thing is built for speed. Typical applications include infinite-precision arithmetic and graph algorithms.
Each integer serves as its own index into the bit array, so single element operations run in unit time.
Asymptotically, the storage required for a set and the time required
for set operations both increase linearly with the size of the
set. However, at any point in time, a Set::IntRange
object has some range, and given that range, space and time requirements are
fixed. This may be an advantage or a disadvantage, depending on the
application.
In any case, for a given set, Set::IntRange
is likely to have much better performance than a hash implementation, because of the array representation and because of the underlying C code.
[lower, upper]where
lower
is the first integer in the interval and upper
is the last. This is a mathematical notation, and doesn't have anything to do with Perl arrays or references.
Set::IntRange
provides methods for operating on intervals. For example,
$set->Interval_Fill($lower, $upper)adds the interval
[$lower, $upper]
to $set
.
Set::IntRange
has two iterators: Interval_Scan_inc
and Interval_Scan_dec
. These work in terms of intervals. To iterate over the elements of a set, write a loop like this
for ($start = $set->Min(); ($lower, $upper) = $set->Interval_Scan_inc($start); $start = $upper + 1) { ... }This loop executes once, not for each element in the set, but for each interval. In each iteration, the current interval is reported as
[$lower, $upper]
.
If there is only one integer in the interval,
then $upper
and $lower
are the same.
Bit::Vector
BitVector
method returns a reference to the underlying Bit::Vector
object. The Bit::Vector
module provides methods for operating on the bit array in many different and powerful ways.
Set::IntSpan
Set::IntSpan
implements a set of integers. It was designed specifically to manage the article lists in .newsrc
files.
Article lists use a special format that compactly represents long runs of consecutive numbers:
alt.foo: 1-21,28,31 alt.bar: 1-14192,14194,14196-14221
Set::IntSpan
uses the .newsrc format internally. Rather than record each element in the set, it records each interval in the set. For example, the set
{ 1, 2, ..., 10 }is represented by a list not of 10 numbers, but just 2: 1 and 10. The advantage of this becomes clear when we consider the set
{ 1, 2, ..., 1_000_000_000 }This set contains a billion elements, but it is still represented internally by a list of 2 numbers.
Set::IntSpan
presents a standard object-oriented interface, and provides the usual set operations.
$a = new Set::IntSpan "1-21,28,31"; $b = new Set::IntSpan [1..10]; $u = union $a $b; $i = intersect $a $b;
Set::IntSpan
has rather different performance characteristics from the other Set::
modules. Storage and time requirements are all proportional to the number of intervals in the set.
For sets with a few large intervals, this can be a major performance advantage. For sets with many small intervals, it can be a major disadvantage. This example from the POD illustrates the issue:
$billion = new Set::IntSpan '0-1_000_000_000'; # OK $odd = grep_set { $_ & 1 } $billion; # trouble $even = map_set { $_ * 2 } $billion; # double trouble
Set::IntSpan
requires that the boundaries of each interval be finite. However, the first interval may have no lower bound, and the last may have no upper bound. This allows it to represent sets that extend to minus and plus infinity, respectively.
The utility of this isn't that these kinds of infinite sets are especially common or useful. Rather, it is that users can carry out set operations involving complements without getting tangled up in endpoint problems at INT_MAX
.
grep_set
and map_set
operators shown above, Set::IntSpan
has explicit iterators. For example, this loop
for ( $element=$set->first; defined $element; $element=$set->next) { ... }will iterate over all the elements of
$set
, in order. In addition,
$set->Start($n)sets the iterator to
$n
; this allows a loop to iterate over portions of $set
, even if $set
is infinite.
{ n | n > 2 } the set of integers greater than 2 { n | n%5==0 } the set of integers that are multiples of 5Representing a set like this seems simple enough. Each predicate is represented by a subroutine that returns true iff the predicate is true on its argument:
sub g2 { $_[0] > 2 } sub m5 { $_[0]%5 == 0 }The set object just needs a reference to the subroutine, and then it can decide whether something is an element:
sub Set::Predicate::new #create a new object { my($class, $predicate) = @_; bless { predicate => $predicate }, $class } sub Set::Predicate::element # return true iff $element is in $set { my($set, $element) = @_; $set->{predicate}->($element) }Then we can write
$a = new Set::Predicate \&g2; $b = new Set::Predicate \&m5; $u = union $a $b;Implementing
union
(and the rest of the set ops) for this representation is left as an exercise for the reader.
Set::Scalar 0.9
.