Do not meddle in the affairs of wizards, for you are crunchy and good with ketchup.

XS Mechanics

This article is about XS. It explains what it is, why it is, how it works, and how to use it. It includes a complete, working example of an XS module, and a stub module that you can use as a starting point for your own code. It is an express goal of this article to provide the background and information necessary for you to write your own XS modules.

This article is in five parts

November Introduction motivation, definitions, examples

December Architecture the Perl interpreter, calling conventions, data representation

January Tools h2xs, xsubpp, DynaLoader

February Modules Math::Ackermann, Set::Bit

March Align::NW Needleman-Wunsch global optimal sequence alignment

Concepts

I claimed last month that XS must be learned from the bottom up. The bottom turns out to be the von Neumann architecture for stored-program computers, and it's a long climb up from there. Instead of taking that path, we'll start at the top, and work our way down by analysis. This will give us the concepts that we need to understand XS.

From the top, the problem is how to call a C subroutine from Perl. In order to do this, two things must happen:

control must pass from Perl to C (and back)
data must pass from Perl to C (and back)

We'll refer to these as control flow and data flow, respectively. In order to understand control and data flow, we must first understand

C data representation
Perl data representation
C program execution
Perl program execution

Finally, underlying all this is the fact that ultimately connects Perl to C: the Perl interpreter is a C program. We'll look at these ideas in more detail below.

Data representation

We'll start by describing data representation in C and Perl.

C Data representation

C programs generally represent data in native machine formats. These are compact, efficient, and minimal. Here are some typical C data formats

Data type	Format
character	a single byte
integer	32-bit 2s complement
double	80 bits, allocated to sign, exponent and mantissa
function pointer	address of the function entry point

Complex data types are represented as aggregates of simple data types. For example, the C type

int x[2];

is represented by 8 bytes in memory: 4 bytes for x[0], followed immediately by 4 bytes for x[1]. Similarly, the C type

struct S
{
    int	  a;
    char  b[4];
}

is represented by 8 bytes in memory: 4 bytes for a, followed immediately by 4 bytes for the 4 elements of b.

This sort of data representation has advantages and disadvantages.

Advantages

The biggest advantage is efficiency. Data occupies the least amount of storage possible: an 8-bit character occupies 8 bits; an array of 2 32-bit integers occupies 64 bits.

Also, the data is stored in the formats that the CPU uses. This makes C code run fast, because the CPU can operate directly on the C representation, without doing any conversion or translation.

Other advantages of C data representation include simplicity and transparency.

Disadvantages

The disadvantage of C data representation is that data doesn't describe itself: it contains no metadata.

For example, given 8 bytes in memory, there is no way to tell—from the bytes themselves—whether they contain an instance of int x[2], or struct S, or some other data type, or no meaningful data at all.

All the information about data is implicit in the executable code of the program. The thing that makes 8 bytes of memory a struct S is the fact that the program treats it as such: the fact that the program puts an int in the first 4 bytes, and 1 character in each of the second 4 bytes.

Two specific disadvantages of C data representation are

it tends to push low-level data management issues up into the application
it makes debugging harder

Perl Data representation

In Perl, data knows about itself. This makes Perl data representation much more complex than C data representation. A complete description of Perl's internal data representation is far beyond the scope of this article. Instead, I'll give a sketch of the general approach, and mention just a few of the data types that appear in the Perl C API.

Every data object in a Perl program is represented internally by a C struct. (Remember, the Perl interpreter is a C program.) For example, a scalar looks something like this

typedef enum 
{
    IOK = 0x01,		/* has valid integer value */
    POK = 0x02,		/* has valid string  value */
} Flags;

struct Scalar
{
    int   refcnt;	/* how many references to us 	*/
    Flags flags;	/* what we are			*/
    char *pv;		/* pointer to malloc'd string	*/
    int	  cur;		/* length of pv as a C string	*/
    int	  len;		/* allocated size of pv		*/
    int	  iv;		/* integer value		*/
};

The Scalar struct allows the interpreter to manage the type information for each scalar, so that programmers don't have to. For example, suppose we write

my $x = 42;

When the interpreter executes this statement, it

allocates a Scalar struct
sets the refcnt field to 1
sets the iv field to 42
sets the IOK flag
clears the POK flag

If we later write

print "$x";

the interpreter

allocates space for pv
calls sprintf(pv, "%d", iv) to convert iv to a string
sets the POK flag

refcnt starts out at 1 because $x refers to the scalar. Whenever a reference to the scalar is created, the interpreter increments refcnt; whenever a reference to the scalar goes away, the interpreter decrements refcnt. When the last reference to the scalar (including $x) goes away, refcnt goes to zero and the interpreter frees the Scalar struct.

Arrays are represented by a struct (declared in av.h). Hashes are represented by a struct (declared in hv.h). Code refs are represented by a struct (declared in cv.h). Every data object in Perl is represented by a struct of one sort or another.

Advantages

This kind of data representation has some big advantages for the programmer.

It frees the programmer from data typing. The interpreter knows the type of every data object; it converts types as necessary.

It also frees the programmer from storage allocation. The interpreter knows the size and location of every data object. It allocates, resizes and frees them as necessary. No memory leaks, no wild or dangling pointers. No C programmer's disease.

Disadvantages

This kind of data representation has one big disadvantage: efficiency. All that meta-data consumes bytes of storage, and the interpreter consumes CPU cycles as it navigates the data structures, checking flags and converting data types on the fly. Every program is different, but it is not unusual for Perl data to require an order of magnitude more RAM and CPU than the corresponding C data. This is one of the primary motivations for writing XS modules.

PerlGuts

XS code requires access to Perl data objects, but it does not manipulate the underlying C structs directly. Neither, for the most part, does the Perl interpreter. Instead, access to Perl data objects is provided by a large collection of macros and subroutines. These are declared in the various .h files in /usr/local/lib/perl/version/architecture/CORE/, and are documented in perlguts. They constitute Perl's C language Application Programming Interface (API).

If you plan to write XS code, you should read perlguts to get an idea of how the API is organized, and what kinds of facilities it provides. When you actually write XS, you need to have perlguts at hand as a reference.

Program Execution

Now, we'll look at how C and Perl execute code.

C Program Execution

Like C data representation, C program execution is simple, minimal, and efficient. The C compiler translates C program text into a sequence of native machine instructions. The CPU executes these instructions in hardware.

A dedicated CPU register, called the program counter (PC), holds the address of the next instruction to execute. In normal control flow, the CPU increments the PC as each instruction is executed. This causes the CPU to execute instructions in the order that they appear in memory.

C control structures, such as for, while, and if () then; else;, require the CPU to execute jumps: from the bottom of a loop back to the top; around then clauses and else clauses that aren't taken. A jump instruction contains the address of the next instruction to execute; this is called the target of the jump. When the CPU executes a jump, it loads the target address directly into the PC, and execution continues there.

Subroutine calls

Subroutine calls are a special kind of jump. Most processors provide some hardware support for subroutine calls, including

a call instruction
a return instruction
a stack in memory, called the processor stack
a CPU register dedicated as a stack pointer (SP)

The first instruction of a subroutine is called its entry point. A call instruction contains the address of the entry point.

To execute a call instruction, the CPU first increments the PC to obtain the address of the next instruction. This is called the return address. However, it does not execute the next instruction. Instead, it pushes the return address onto the processor stack. Then, it loads the entry point from the call instruction into the PC and begins executing the subroutine.

At the end of the subroutine is a return instruction. When the CPU executes a return instruction, it pops the return address off of the stack, loads it into the PC, and resumes execution with the instruction that follows the original call instruction.

The CPU keeps return addresses on a stack so that it can handle nested—and especially, recursive—subroutine calls.

Parameter passing

In addition to calling and returning from subroutines, programs want to pass parameters into subroutines, and get return values back.

Parameters are passed on the same stack as the return address. The compiler generates code that evaluates each parameter and pushes the parameter value onto the stack. After all the parameters are on the stack, the call instruction is issued.

The subroutine can't pop its parameters off the stack: if it did, it would lose the return address. However, it knows the location of each parameter as an offset from the stack pointer, so it can access its parameters directly on the stack.

A C subroutine can return only one value. For simplicity and efficiency, the subroutine may pass the return value from the subroutine back to the caller in a CPU register. Alternatively, the caller can leave space for the return value on the stack. This is necessary if the return value is too big to fit into a register.

Perl Program execution

The execution of a C program divides cleanly into two phases

the compiler translates the source code into machine code
the CPU executes the machine code

The execution of a Perl program divides into two similar phases

the interpreter translates the source code into a syntax tree
the interpreter executes the syntax tree

However, the division between the two phases is not so clean

BEGIN blocks execute code during the translation phase
eval and s///e translate code during the execution phase.

The ability to cross the line between translation and execution is a powerful feature of Perl.

The Syntax Tree

The Perl interpreter translates Perl source code into a syntax tree. The nodes of the tree are called opcodes. These are not the opcodes of the CPU instruction set. Rather, they are C structs that represent the primitive operations in the Perl language. There are over 300 opcodes, representing

all the built-in functions, such as cos, open, and sort
all the operators, such as cmp, +, and <<
control structures, such as loops and conditionals
subroutine calls

The children of a node represent its operands: the condition of an if statement; numbers to be added; arguments to a subroutine.

The Stack Machine

After the interpreter builds the syntax tree, it executes the program by traversing the nodes of the tree. This is called walking the tree.

The interpreter walks the tree in postfix order. This means that it walks all the children of a node before it walks the node itself. Since the children of a node may have children of their own, this is a recursive process.

Walking a node typically yields a value; for this reason, we also speak of evaluating a node. The interpreter keeps these values on a stack. We'll call this the Perl stack, to distinguish it from the processor stack mentioned earlier.

The things on the Perl stack are not the C structs that represent Perl data, but pointers to those structs. This is important both for efficiency, and for obtaining the correct semantics for Perl programs. However, we'll still speak of the things on the Perl stack as values.

To evaluate a node, the interpreter pops the operands for the node off of the stack, carries out whatever operation is specified by the node, and then pushes the resulting value back onto the stack. In this way, the values of child nodes become operands for their parent. Programs that manage data on a stack in this way are sometimes called stack machines.

Example

Suppose we have the Perl statement

	$x = $y + 3;

This is parsed into a syntax tree like this

	   =
	  / \
	$x   +
	    / \
	  $y   3

This tree has 5 nodes. The table below shows their order of evaluation, and the contents of the stack after each node is evaluated.

Step 1 2 3 4 5
Node $x $y 3 + =
Stack \$x 42
\$x 3
42
\$x 45
\$x

Here is what happens in each step.

Since the = node will assign to $x, the interpreter pushes a reference to $x, not its value.
The value of $y turns out to be 42.
3 evaluates to 3.
The + node pops the values 3 and 42, adds them, and pushes the sum onto the stack.
The = pops the value 45 and the reference to $x, and carries out the assignment. Now the stack is empty.

Subroutine calls

Now that we understand something about Perl data and execution, we can look in detail at how Perl makes subroutine calls.

Perl Subroutines

The Perl interpreter creates a separate syntax tree for each subroutine in the program. Each syntax tree is managed by a code reference. A code reference is the thing that you get when you write

my $coderef = sub { ... }

sub foo { ... }
my $coderef = \&foo;

in Perl. Internally, a code reference is represented by (what else?) a C struct. Among other things, the code reference has a pointer to the root of the syntax tree for the subroutine. We'll call this the root pointer.

A subroutine call in the program source is represented in the syntax tree by a fragment that looks like this

	    entersub
	       |
	  +----+--...---+
	  |    |        |
	arg1 arg2 ... argN

The entersub opcode transfers control to the called subroutine. Its children evaluate the arguments of the call.

To execute a subroutine call, the interpreter first walks each child node, and pushes the result onto the Perl stack. When all the arguments are on the stack, the interpreter walks the entersub node.

An entersub opcode holds a pointer to a code reference. When the interpreter walks an entersub, it follows this pointer to the code reference, and then follows the root pointer in the code reference to the syntax tree for the subroutine. Then it executes the subroutine.

As mentioned above, the things on the Perl stack are not the C structs that represent the arguments of the subroutine, but rather, pointers to those structs. This is the meaning of the statement in perlsub that

The array @_ is a local array, but its elements are aliases for the actual scalar parameters.

In other words, Perl passes parameters by reference, like FORTRAN, and unlike C.

If the subroutine returns any values, it pushes pointers to them onto the stack, in the same locations where its parameters were. After the subroutine returns, the caller retrieves the return values from the stack.

eXternal Subroutines

One of the other things that a code reference has is a C function pointer. In other words, a code reference has a field that contains the address of the entry point of a compiled C subroutine. We'll call this the xsub pointer. The C subroutine itself is called an xsub.

When the interpreter executes an entersub opcode, it first checks the xsub pointer in the code reference. If the xsub pointer is null, it follows the root pointer to the syntax tree for the subroutine and walks it, as described above.

If the xsub pointer is not null, then the interpreter ignores the root pointer. Instead, it gets the address of the xsub from the xsub pointer, and calls the xsub. This is an eXternal Subroutine call. This how control passes from Perl to C. This is XS.

Loading, Linking, and Installation

We've seen how Perl can call a C subroutine through the xsub pointer in a code reference. Now we need to understand how Perl gets hold of a C subroutine in the first place.

In order for a C subroutine to become an xsub, three things must happen

loading: the subroutine has to be loaded into memory
linking: the Perl interpreter has to find its entry point
installation: the interpreter has to set the xsub pointer in a code reference to the entry point of the subroutine

Installation

Installation is the easiest. The Perl C API includes a routine

newXS(char *name, void (*fp)())

Given the name of a Perl subroutine in name, and the address of the entry point of a C subroutine in fp, newXS will install fp as the xsub pointer in the code reference for name. Once this happens, any Perl code that calls name() will invoke the C subroutine.

name can be in any package. To install a subroutine called new() in the package Align::NW, we pass the string "Align::NW::new" for name. If the Align::NW package implements a class, this has the effect of making the C subroutine a method of that class.

Loading and Linking

In order to understand loading and linking, we have to know where C subroutines come from.

C subroutines begin as source code in source files. The compiler translates source files into object files, and the linker combines object files to produce a link library. A link library contains

the executable code for each subroutine
a symbol table that maps the name of each subroutine to the offset of its entry point in the library

Static vs. Dynamic

At this point, we have two alternatives: static linking and dynamic linking.

Static linking requires us to rebuild the Perl interpreter. We compile the Perl source files into object files, and then link the Perl object files with our library to create a new version of the Perl executable.

With static linking, our C subroutines become a permanent part of the Perl interpreter. They are loaded into memory as part of the executable image whenever the interpreter runs. The interpreter knows where the entry points are because the linker resolves their addresses when it builds the executable.

Dynamic linking allows us to use the Perl interpreter unmodified. If a Perl program needs an xsub, it loads the library at run time and looks up the subroutine entry point in the symbol table. Different programs can install different xsubs.

We usually prefer dynamic linking, precisely because it allows us to write new XS modules without having to rebuild the Perl interpreter. However, XS modules can be statically linked. This is sometimes done for efficiency. A few modules (such as the DynaLoader) require static linking. Prior to Perl5, all language extensions required static linking.

Dynamic Linking

In principle, dynamic linking is simple. A link library is a file on disk. If we know its name, we can open it and read it into memory. If we know the name of a subroutine in the library, we can look it up in the symbol table and find its entry point.

In practice, dynamic linking is complex. One reason is that the internal format of link libraries is system dependent. Most systems provide facilities to load libraries and look up entry points. However, these facilities are also—you guessed it—system dependent. They all do more or less the same thing, but the names of the system calls and the precise semantics vary.

Even the names of the libraries are system dependent. In Unix systems, they are called shared objects, and are identified by a .so extension. In Windows, they are called Dynamic Link Libraries and are identified by a .dll extension. On Macintosh, they are called shared libraries, and are (usually) identified by a finder type of shlb.

Another issue is that many systems treat dynamic link libraries specially. For example, they may use virtual memory to map them into shared memory pages. This makes it essential that we use the proper facilities on each system for dynamic linking.

Finally, the system calls that do dynamic linking, as well as the newXS() call, are C language calls. This means that we have to call them from C code, not Perl code. We need an xsub to install an xsub.

Dynaloader

Fortunately, all this is done for us by a Perl module called the DynaLoader. When we write an XS module, our module inherits from DynaLoader. When our module loads, it makes a single call to the DynaLoader::bootstrap method. bootstrap locates our link libraries, loads them, finds our entry points, and makes the appropriate newXS() calls.

DynaLoader handles system dependencies by brute force. If you look in the Perl sources, you'll find about 10 different versions of the Dynaloader XS code. The Configure script chooses the right version for each system when the Perl interpreter is built.

Tracing the data paths through the Dynaloader to understand how it installs xsubs is left as an exercise for the reader.

Parameter Passing

The XS mechanism described above shows how Perl can call a C subroutine. However, we want to do more than that. We also want to pass parameters from Perl into C, and return values from C to Perl.

The problem is that the Perl interpreter puts parameters on the Perl stack, but a C subroutine expects to find parameters on the processor stack. Furthermore, the Perl parameters are pointers to C structs that represent Perl data, while C code expects parameters to be values in native hardware formats.

To solve this problem, something has to convert between Perl and C data representations. The Perl interpreter doesn't, so the xsub has to. Typically, the xsub uses facilities in the Perl C API to get parameters from the Perl stack and convert them to C data values. To return a value, the xsub creates a Perl data object and leaves a pointer to it on the Perl stack.

Coding directly to the Perl C API is hard. Furthermore, we often want to call existing C subroutines from Perl: subroutines that don't know about the Perl C API, and that do expect to find their parameters on the processor stack. We'll refer to an existing C subroutine as a target routine.

Rather than drag the Perl C API into all our C code, we usually write glue routines. The Perl interpreter calls the glue routine as an xsub. The glue routine converts the Perl parameters to C data values, and then calls the target routine, passing it the C data values as parameters on the processor stack. When the target routine returns, the glue routine creates a Perl data object to represent its return value, and pushes a pointer to that object onto the Perl stack. Finally, the glue routine returns control to the Perl interpreter.

Glue routines provide some structure for the data flow problem, but they are still hard to write. So we don't. Instead, we write XS code. XS is, more or less, a macro language. It allows us to declare target routines, and specify the correspondence between Perl and C data types. A program called xsubpp reads XS code and generates glue routines in straight C.

Much of the work, and much of the wizardry, of XS has to do with figuring out how to write XS code so that xsubpp will do the right thing. We'll talk about this in great detail—next month.

NOTES

a C program

There is a project to rewrite the Perl interpreter in C++. This may require technical changes to XS, but it shouldn't affect the overall architecture.

something like

To see what it really looks like, look at /usr/local/lib/perl/version/architecture/CORE/sv.h

is created

for example, by writing \$x

goes away

either by being reassigned or by passing out of scope

memory leaks

Circular structures, e.g. $x=\$x, prevent reference counts from going to zero. If the programmer does not break the cycle explicitly, the memory is not freed until the program terminates.

order of magnitude more

But don't underestimate the power that Perl brings to the table. I once worked on some database access code that had been implemented in both C and Perl. It was easier to make the relevant optimizations in Perl, and in the end, the Perl implementation ran faster than the C implementation.

increments

More precisely, the CPU adds the size of the current instruction to the PC to advance it to the next instruction. On many processors, different instructions have different sizes.

recursive

The IBM 360/370/3033/3081 series computers had no processor stack and no dedicated stack pointer. The CPU saved return addresses in general-purpose registers. Compilers typically copied the return address into a fixed memory location associated with the called routine. This supported nested, but not recursive, subroutine calls. It was adequate for languages like assembler, FORTRAN, and COBOL; it was less than adequate for languages like PL/I and C.

parameter value

This is what it means to pass parameters by value.

it knows

It knows because the compiler compiled the offsets into it.

root

It also has a pointer to the node where execution of the subroutine begins, which may not be the root node. We gloss over this distinction.

code reference

The indirection through the code reference is essential. The entersub node can't hold the root pointer itself, because someone might write

sub One { print "One\n" }
One();
eval 'sub One { print "Two\n" }';
One();

The interpreter can find the code reference through the symbol table and reassign the root pointer during the execution phase. In contrast, the interpreter has no way of finding an entersub node after the translation phase ends.

same locations

adjusting the stack pointer, if necessary

Prior to Perl5

It was actually worse than this. Prior to Perl5, all language extensions required changes to the Perl sources. Modifying language processors is never easy, so this was only done for Big Important Things, like database interfaces (sybperl, oraperl). Even then, it was considered unsatisfactory. One of the design goals of Perl5 was to allow people to extend the language without touching the core language processor.

exercise for the reader

extra credit: Figure out how the DynaLoader xsubs are installed.

XS Mechanics by Steven W. McDougall is licensed under a Creative Commons Attribution 3.0 Unported License.

Steven W. McDougall / resume / swmcd@theworld.com / 2000 February 11

November	Introduction	motivation, definitions, examples
December	Architecture	the Perl interpreter, calling conventions, data representation
January	Tools	h2xs, xsubpp, DynaLoader
February	Modules	`Math::Ackermann`, `Set::Bit`
March	`Align::NW`	Needleman-Wunsch global optimal sequence alignment