Do not meddle in the affairs of wizards, for you are crunchy and good with ketchup.
This article is in five parts
November | Introduction | motivation, definitions, examples |
December | Architecture | the Perl interpreter, calling conventions, data representation |
January | Tools | h2xs, xsubpp, DynaLoader |
February | Modules | Math::Ackermann , Set::Bit |
March | Align::NW |
Needleman-Wunsch global optimal sequence alignment |
First, we discuss some design alternatives for XS modules. Then we present examples that illustrate different designs.
Perl does not have classes or objects per se. Rather, it has a collection of features that conspire to support an object-oriented (OO) programming model. These include
use
bless
->
XS is just one more feature in the mix, and there are different ways we can use it. In particular, XS modules can represent data in either C or Perl, and can have methods written in either C or Perl.
If we represent object data in C, then methods written in C—that is, xsubs—have native access to the data. Methods written in Perl must access object data through accessor methods.
If we represent object data in Perl, then methods written in Perl have native access to the data. Methods written in C must access object data through the Perl C API.
This table summarizes the four possibilities for data representation and access.
Data | |||
---|---|---|---|
Perl | C | ||
Method | Perl | native | accessor method |
xsub | Perl C API | native |
Representing data in Perl may be simpler; representing data in C may be necessary for performance. We give an example of each approach in the remainder of this article.
Math::Ackermann
A(0 , n ) = n + 1 A(m+1, 0 ) = A(m, 1) A(m+1, n+1) = A(m, A(m+1, n))
It is perhaps unparalleled in consuming enormous amounts of RAM and CPU with very little code, while still doing something of at least nominal utility.
Because this is such a CPU-intensive function, we would like to write a module that does two things
Caching values is easy in Perl.
package Math::Ackermann; sub new { my $class = shift; bless [], $class } sub compute { my($ackermann, $m, $n) = @_; defined $ackermann->[$m][$n] or $ackermann->[$m][$n] = A($m, $n); $ackermann->[$m][$n] }
Computing the value in C is easy and efficient.
int A(int m, int n) { if (m==0) return n+1; if (n==0) return A(m-1, 1); return A(m-1, A(m, n-1)); }Here is the XS code to connect the C subroutine to the Perl module
int A(m, n) int m int n
Now we can compute values like this
my $ackermann = new Math::Ackermann; print $ackermann->compute(1, 1); # computes value print $ackermann->compute(2, 2); # computes value print $ackermann->compute(1, 1), # returns cached value
In this design, data is represented in Perl, and computation is done in C. This is a good allocation of resources, because data access time is small compared to computation time. Also, the volume of data is very small: 2 integers on input; one integer on output, so storage efficiency is not an issue.
Because Math::Ackermann
represents instance data in Perl, routines that are written in Perl, such as new()
and compute()
, have native access to that data. Routines that are written in C, such as A()
, must access data through the Perl C API. The XS routine shown above describes the calling sequence for A()
, and xsubpp
generates the necessary calls to the Perl C API.
Next, we'll look at a Perl module that represents data in C.
Set::Bit
Set::Bit
represents a set of integers in the range 0..n
It allocates one bit of storage for each integer in the range. If the bit is set then the integer is in the set; if the bit
is clear then the integer is not in the set.
The computation involved in set operations is trivial: setting and clearing bits to add or remove integers from a set; and'ing and or'ing bits to compute unions and intersections. Because of this, overall performance is dominated by data access time: how long it takes to locate a particular bit in a particular set. In addition, sets may be arbitrarily large, so storage efficiency will inevitably become an issue.
To improve performance, we represent Set::Bit
objects as C structs
typedef struct { int nBits; /* range of set: 0..nBits-1 */ int nInts; /* number of integers */ int *pInts; /* vector of integers */ } Vector;
and we write C subroutines to manipulate these structs
Vector *new (int nBits); void DESTROY (Vector *pVector); void dump (Vector *pVector); int top (Vector *pVector); void insert (Vector *pVector, int n); void Remove (Vector *pVector, int n); int member (Vector *pVector, int n); Vector *Union (Vector *pA, Vector *pB); Vector *intersect(Vector *pA, Vector *pB);
The name Vector
refers to the vector of int
s that holds the bits. To create a Set::Bit
object, we call new()
Vector *new(int nBits) { Vector *pVector = (Vector *) malloc(sizeof (Vector)); pVector->nBits = nBits; pVector->nInts = (nBits + sizeof(int) - 1) / sizeof(int); pVector->pInts = (int *) calloc(pVector->nInts, sizeof(int)); return pVector; }
The new()
routine
Vector
structint
s required to hold all the bits in the rangeint
s to hold the bitsVector
.
To add an integer to a set, we call insert()
.
void insert(Vector *pVector, int n) { int q, r; q = n / sizeof (int); r = n % sizeof (int); pVector->pInts[q] |= 1<<r; }
When we are done with a set, we call DESTROY()
to free the storage.
void DESTROY(Vector *pVector) { free(pVector->pInts); free(pVector); }
The other subroutines have similar implementations. All the C declarations and definitions are in
A()
as an xsub in the Math::Ackermann
package, but the Math::Ackermann
object was an ordinary Perl array.
We will install the C subroutines in vector.c
as xsubs in the Set::Bit
package, but we want to do more than that. We don't want any kind of Perl data structure to be the Set::Bit
object. Instead, we want the C-language Vector
struct to be the Set::Bit
object.
We have not written a module where the object itself is a C struct, and we need to learn some new XS tricks to make it work. If we present the solution all at once, it will likely seem complex and opaque. Instead, we will discover the solution by trial and error.
h2xs
h2xs
.../development>h2xs -A -n Set::Bit .../development>cp vector.c Set/Bit/ .../development>cp vector.h Set/Bit/
We add an OBJECT
key to the argument list of WriteMakefile()
in Makefile.PL
'OBJECT' => 'Bit.o vector.o'
We #include vector.h
in Bit.xs
, and add a PROTOTYPES
directive. Bit.xs
now looks like this
#include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "vector.h" MODULE = Set::Bit PACKAGE = Set::Bit PROTOTYPES: ENABLE
new
new
, like this
Vector * new(nBits) int nBits
Now, we run
.../development/Set/Bit>perl Makefile.PL .../development/Set/Bit>make
and immediately get
Error: 'Vector *' not in typemap in Bit.xs, line 12
make
invokes xsubpp
, and xsubpp
generates code to translate between C and Perl data representations. The default type map in
/usr/local/lib/perl5/version/ExtUtils/typemap
contains entries for translating many built-in C data types. But Vector *
isn't a built-in C data type, and the default typemap certainly hasn't an entry for it.
To tell xsubpp
how to translate a Vector *
, we create a local typemap in the module directory, and add an entry to it for Vector *
In the most general case, we would need to make up a new XS type, and add entries to the TYPEMAP, INPUT, and OUTPUT sections. If we called our XS type T_VECTOR
, then our typemap file would look something like
TYPEMAP Vector * T_VECTOR INPUT T_VECTOR # Code to translate a Perl scalar to a Vector * OUTPUT T_VECTOR # Code to translate a Vector * to a Perl scalar
T_PTROBJ
T_PTROBJ
, together with code fragments for it in the INPUT and OUTPUT sections. To use T_PTROBJ
, all we have to do is create a
.../development/Set/Bit/typemapfile and put this line
Vector * T_PTROBJ
in it. This tells xsubpp
to use the code fragments for T_PTROBJ
whenever it needs to translate between a Vector *
and a Perl scalar.
T_PTROBJ
was created expressly for our purpose: to make C structs into Perl objects. If we now run make
, we get a clean compile. But before we try to run our code, let's look at Bit.c
and find out what xsubpp
has (and hasn't) done for us.
sv_setref_pv
xsubpp
to connect new()
to Perl
XS(XS_Set__Bit_new) { dXSARGS; if (items != 1) croak("Usage: Set::Bit::new(nBits)"); { int nBits = (int)SvIV(ST(0)); Vector *RETVAL; RETVAL = new(nBits); ST(0) = sv_newmortal(); sv_setref_pv(ST(0), "VectorPtr", (void*)RETVAL); } XSRETURN(1); }
We saw most of this code last month. The glue routine extracts nBits
from the Perl stack, and declares RETVAL
as a Vector *
. Then it calls new()
, and assigns the return value to RETVAL
. Next, it creates a new mortal scalar at ST(0)
on the Perl stack.
Now comes the magic: the glue routine calls sv_setref_pv
to return RETVAL
to Perl.
sv_setref_pv(ST(0), "VectorPtr", (void*)RETVAL);
perlguts documents sv_setref_pv
like this
SV* sv_setref_pv(SV* rv, char* classname, PV iv);
Copies the pointer value (the address, not the string!) into an SV whose reference is rv. SV is blessed if classname is non-null.
SV stands for Scalar Value; an SV is the internal Perl data structure that represents a scalar. When we call sv_setref_pv()
, it
RETVAL
VectorPtr
packageST(0)
a reference to the SV
Finally, the mortal at ST(0)
is the return value of the new()
method.
new()
is a reference to a blessed scalar that holds a pointer to our C-language Vector
struct. However, the scalar is blessed into the VectorPtr
package, not the Set::Bit
package.
This is a bit odd. Perl methods that function as constructors customarily return references to objects that are blessed into the method's own package. The
MODULE = Set::Bit PACKAGE = Set::Bit
line in Bit.xs
causes all our xsubs to be installed in the Set::Bit
package. But Set::Bit::new()
returns a reference to an object that is blessed into the VectorPtr
package. This means, for example, that we could write
$set = Set::Bit::new(100);and get a reference to a valid Perl object, but we couldn't write
$set->insert(42);
because $$set
is blessed into the VectorPtr
package, and the VectorPtr
package has no insert()
method. The insert()
method is in the Set::Bit
package, along with the new()
method.
There are various ways around this problem. We could leave new()
in the Set::Bit
package, and add another PACKAGE
directive to install the rest of our xsubs in the VectorPtr
package
MODULE = Set::Bit PACKAGE = VectorPtr
Or we could rename the whole module VectorPtr
.
But these solutions are ugly and awkward. Module names shouldn't be dictated by the names of our C structs. Module names should be free parameters in our design, chosen to organize our libraries and to make our code read naturally. We very much want to get our object blessed into the Set::Bit
package.
$ntype
sv_setref_pv(ST(0), "VectorPtr", (void*)RETVAL);
To control the package that our objects are blessed into, we need to understand where the string "VectorPtr"
comes from. If we look in the OUTPUT section of the default typemap, we find this entry
T_PTROBJ sv_setref_pv($arg, \"${ntype}\", (void*)$var);
So xsubpp
emits code that blesses the object into the package named by $ntype
. As discussed last month, $ntype
is the type of the C variable that is being converted. In this case, the C variable is RETVAL
, and its type is Vector *
.
Vector *
is an example of a C type name that is not a valid Perl package name. To accommodate this, xsubpp
silently substitutes the string "Ptr"
for the *
before emitting $ntype
as a package name. This is a general mechanism; for example, the C type name Vector * *
would be converted to the Perl package name VectorPtrPtr
.
RETVAL
RETVAL
is the C variable that xsubpp
declares to hold the return value of our C subroutine; therefore, its type must match the return type of our C subroutine. xsubpp
declares RETVAL
with the type that we specify for our XS routine in Bit.xs
. Thus, the XS code
Vector * new(nBits) int nBits
in Bit.xs
leads to the C-language declaration
Vector * RETVAL;
in Bit.c
.
This means that the return type of our XS routine determines the package that our object is blessed into. In order to bless our object into the Set::Bit
package, we must write our XS routine like this
Set::Bit new(nBits) int nBits
It seems that this cannot stand, because
Set::Bit
is not the return type of the C-language new()
routineSet::Bit
is not a valid C type namebut we can get around both problems.
xsubpp
handles the second problem for us. It squashes colons in $ntype
to underscores whenever it emits $ntype
as the type of a C variable. So the XS code shown above leads to this declaration for RETVAL
in Bit.c
Set__Bit RETVAL;
which is valid C code.
To handle the first problem, we simply add a typedef
to Bit.xs
, like this
typedef Vector *Set__Bit;
xsubpp
passes this typedef
through to Bit.c
unchanged. Now the declaration of RETVAL
is valid and correct.
xsubpp
doesn't squash colons when it emits $ntype
as a Perl package name, so the call to sv_setref_pv
will be
sv_setref_pv(ST(0), "Set::Bit", (void*)RETVAL);
This will bless our object into the Set::Bit
package, as we desire.
Bit.xs
looks like this
#include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "vector.h" typedef Vector *Set__Bit; MODULE = Set::Bit PACKAGE = Set::Bit PROTOTYPES: ENABLE Set::Bit new(nBits) int nBits
If we now run
.../development/Set/Bit>makewe get
Error: 'Set::Bit' not in typemap in Bit.xs, line 14
The problem is that xsubpp
is trying to find the return type of our XS routine in the typemap. We made an entry in the local typemap for Vector *
, but then we changed the return type of our XS routine from Vector *
to Set::Bit
. We need to change the entry in the local typemap accordingly
Set::Bit T_PTROBJ
.../development/Set/Bit>make
and get a clean compile. Now, we try to create a Set::Bit
object. We add this line to test.pl
my $set = new Set::Bit 100;
and run
.../development/Set/Bit>make test
The output is
1..1 ok 1 Usage: Set::Bit::new(nBits) at test.pl line 21.
The ok 1
means that the module loaded successfully. The usage message means that we called the new()
method with the wrong number of arguments.
With a little thought, we can identify the problem. We invoked new()
with method call syntax. In Perl, a method call
new Set::Bit 100
ultimately becomes a subroutine call with the package name passed as the first argument
Set::Bit::new("Set::Bit", 100)
To accommodate this, we need to change our XS routine to take two arguments: a string and an integer
Set::Bit new(package, nBits) char *package int nBits
If we run xsubpp
on this XS routine, it will emit C code to call new()
with both arguments
RETVAL = new(package, nBits);
We could change our C routine to take both arguments, and then ignore the package name. However, I try to avoid polluting my interfaces with implementation artifacts. Instead, we add a CODE:
directive to the XS routine, and write the call to new()
ourselves.
CODE: RETVAL = new(nBits);
When we use the CODE:
directive, we also need an OUTPUT:
directive to tell xsubpp
what the return value is
OUTPUT: RETVAL
Our XS routine is now
Set::Bit new(package, nBits) char *package int nBits CODE: RETVAL = new(nBits); OUTPUT: RETVAL
When we run
.../development/Set/Bit>make test
we get a clean compile, and the output is
1..1 ok 1
showing that our call to new()
succeeded.
Set::Bit
Objectnew()
calls sv_setref_pv()
, and
sv_setref_pv()
creates a Perl scalar. This scalar is the Set::Bit
object. Here are three crucial facts regarding it
Vector
structSet::Bit
packageSet::Bit::new()
returns a reference to itHere is a picture that illustrates this
$set = new Set::Bit 100; reference Name: $set RV: ----------->scalar Package: Set::Bit IV: ----------------->Vector { }
RV
stands for Reference Value. IV
stands for Integer Value.
Because we have a reference to the Set::Bit
object, we can make method calls on it. Whenever we make a method call, the reference is passed as the first argument to the method.
Methods may be written in either Perl or C. A method can dereference its first argument to get the value of the object. The value of a Set::Bit
object is the machine address of a Vector
struct.
There isn't much that a Perl method can do with the address of a Vector
struct, except maybe print it.
sub Set::Bit::print_address { my $set = shift; printf "%p", $$set; }
A C method can pass the address of a Vector
struct to any of the C-language subroutines defined in vector.c
. Let's look at how this is done.
insert
insert
method inserts an integer into a set
$set = new Set::Bit 100; $set->insert(42);
Here is the XS code to install insert()
in the Set::Bit
package
void insert(pVector, n) Set::Bit pVector int n
and here is the glue routine that xsubpp
emits to call insert()
XS(XS_Set__Bit_insert) { dXSARGS; if (items != 2) croak("Usage: Set::Bit::insert(pVector, n)"); { Set__Bit pVector; int n = (int)SvIV(ST(1)); if (sv_derived_from(ST(0), "Set::Bit")) { IV tmp = SvIV((SV*)SvRV(ST(0))); pVector = (Set__Bit) tmp; } else croak("pVector is not of type Set::Bit"); insert(pVector, n); } XSRETURN_EMPTY; }
Let's look at the C code to see how we get our Vector *
back from Perl.
The XS routine declares pVector
to be of type Set::Bit
. As before, xsubpp
squashes the colons to underscores for us, emitting the C declaration
Set__Bit pVector;
and the typedef in Bit.xs
makes Set__Bit
an alias for Vector *
.
The first argument to insert()
is a reference to a Set::Bit
object. This argument appears on the Perl stack at ST(0)
.
sv_derived_from()
verifies that ST(0)
is indeed a reference to a Set::Bit
object. SvRV()
dereferences ST(0)
, returning the scalar that was created earlier by sv_setref_pv()
. SvIV()
extracts the value of that scalar as an Integer Value (IV). This value is the Vector *
that was stored there by sv_setref_pv()
.
The glue routine stores the IV in tmp
, and then assigns it to pVector
on the next line, with a typecast to quiet the compiler. Finally, it calls the C-language insert()
routine, passing it the Vector *
in pVector
.
member()
int member(pVector, n) Set::Bit pVector int n
and intersect()
Set::Bit intersect(pA, pB) Set::Bit pA Set::Bit pB
In each case, the first argument is a reference to the object on which the method was invoked. intersect()
creates and returns a new Set::Bit
object; the mechanics of this are the same as those described above for new()
.
union
union()
is a bit different. We can't have a C subroutine named union
, because union
is a keyword in C. I named the C subroutine Union
, but I want the Perl method to be named union
, so that the Perl interface will have a consistent naming scheme. With just a little more XS code, we can name our method and call it, too.
Set::Bit union(pA, pB) Set::Bit pA Set::Bit pB CODE: RETVAL = Union(pA, pB); OUTPUT: RETVAL
We name the XS routine union
, and then add a CODE:
directive so that we can write the C call to Union
.
DESTROY
Set::Bit
object goes away, Perl deletes the object. The object itself is a Perl scalar: Perl allocated the memory for it in the call to sv_setref_pv()
, and Perl will free that memory.
The Set::Bit
object holds a pointer to a Vector
struct. Perl doesn't know anything about the Vector
struct. We allocated it, and we are responsible for freeing it. We do this in the DESTROY
method.
Perl calls the DESTROY
method for us after the last reference to the Set::Bit
object goes away, and before it deletes the Set::Bit
object. Aside from the fact that Perl calls DESTROY()
automatically, it is an ordinary method. Here is the XS code for DESTROY()
.
void DESTROY(pVector) Set::Bit pVector
As with the other instance methods, the glue routine extracts the Vector *
from the Perl stack and passes it to our C-language DESTROY()
routine. Our C code then frees the storage for the Vector
.
Vector *
. However, we can use accessor methods (written in C) to get instance data from an object, and then operate on that data in Perl.
member()
is an accessor method: it returns true iff its argument is an element of the set. With the member()
method, we can write other methods in Perl. For example, the elements()
method returns a list of the elements of a set
sub elements { my $set = shift; grep { $set->member($_) } 0..$set->top }
and the print()
method returns a string containing a print representation of a set
sub print { my $set = shift; join ", ", $set->elements }
Given a suitable set of accessor methods, the choice of whether to implement other methods in C or in Perl can be made on the basis of performance and convenience.
Set::Bit
object to be the C-language Vector
struct, rather than a Perl data object. It didn't work out that way. The Set::Bit
object is indeed a Perl data object: it is the scalar created by sv_setref_pv()
.
Nonetheless, the Set::Bit
object gives us the essential features of a C-language object
Vector *
, passed as the first argument
At the same time, the Set::Bit
object gives us the flexibility to write methods in Perl. It is, perhaps, the best of both worlds.
Math::Ackermann
Here are sources and distribution for Set::Bit
Align::NW
Align::NW
module does Needleman-Wunsch global optimal
sequence alignment. It represents data in both C and Perl, and has
methods written in both C and Perl. We'll discuss the techniques used
to implement it—next month.
Mr. E.V. Lambert of Homeleigh, The Burrows, Oswestly, has presented us with a poser. We do not know which bush he is behind, but we can soon find out.- Monty Python, How Not to be Seen
<BOOM>
<BOOM>
<BOOM> <yyaaahhhhh!>
Yes, it was the middle one.
Bit.xs
.