Do not meddle in the affairs of wizards, for you are crunchy and good with ketchup.
This article is in five parts
|motivation, definitions, examples
|the Perl interpreter, calling conventions, data representation
|h2xs, xsubpp, DynaLoader
|Needleman-Wunsch global optimal sequence alignment
Narrowly, XS is the name of the glue language that is used to specify the subroutine interfaces and data conversions necessary to call C from Perl. More broadly, XS comprises a system of programs and facilities that work together to make this happen:
DynaLoader, and the XS language itself. We'll talk about all these later on.
More generally, Perl is an applications programming language. It provides powerful facilities, such as automatic data typing, automatic memory management, hash tables, and regular expressions. These make it easy to bolt together applications, without having to attend to every little detail. The tradeoff is that these facilities have substantial run-time costs.
In contrast, C and C++ are examples of systems programming languages. They provide control over every CPU cycle and every byte, so that inner loops can be fast and critical data structures can be small. The tradeoff is that you have to program every CPU cycle and every byte in the entire program: even parts that aren't processor bound.
XS lets us have the best of both worlds. With XS, we can use Perl for the bulk of our code, and C for just those parts that require fine control over system resources.
If you just want to get the job done, consider using the Simplified Wrapper and Interface Generator (SWIG). SWIG is a software development tool that connects assorted application programming languages, such as Perl, Python, and Tcl, to assorted system programming languages, such as C, C++, and Objective-C.
SWIG is very easy to use. At its simplest, you just hand it your
.c file, tell it what your application language is, and it does the rest. Here's an example, adapted from the SWIG documentation:
unix> swig -perl5 -module example example.c unix> gcc -c example.c example_wrap.c unix> ld -G example.o example_wrap.o -o example.so unix> perl5.005 use example; print example::factorial(4), "\n"; <ctrl-d> 24
I could write a tutorial on SWIG, but it would be superfluous: SWIG already has extensive documentation. SWIG is available online, it is free, and it works. If you just want to get the job done, this SWIG is for you.
The first is that the core Perl docs, such as perlxs and perlguts, tacitly assume that you already understand XS. Accordingly, they omit or gloss over crucial assumptions and background information. This sounds bad, but it is actually rather common in the Unix world.
The second is that you can't learn XS. Not as such. Not from the top down. This problem is much more profound than the first, and it stems not from any inadequacy in the documentation, but from what XS is—and isn't.
The Perl docs refer to XS as a language, but it isn't. XS is a
collection of macros. The XS langauge processor is a program called
xsubpp, where pp is short for PreProcessor, and PreProcessor is a polite term for macro expander.
xsubpp expands XS macros into the bits of C code necessary to connect the Perl interpreter to your C-language subroutines.
Because XS isn't a language, it lacks structure. The underlying C code has structure, but you can't see it, because it is hidden behind the macros. This makes it virtually impossible to learn XS on its own terms.
Once you understand all this, you don't strictly need XS: you can code directly to the Perl C API, and your C code will link and run under the Perl interpreter.
If you do code directly to the Perl C API, you will find that it is difficult, error-prone, tedious, and repetitive. You keep writing the same little bits of code to move parameters on and off the Perl stack; to convert data from Perl's internal representation to C variables; to check for null pointers and other Bad Things. When you make a mistake, you don't get bad output: you crash the interpreter.
Now you understand XS.
My favorite number cruncher used to be the Fast Fourier Transform, but as I think of it now, it seems—well—dated. It's so classical, so linear, so old-tech. Besides, it runs in O(n*log(n)), which is almost tractable in Perl.
Instead, I'm going to code up the Needleman-Wunsch (NW) dynamic programming algorithm for global optimal sequence alignment. Sequence alignment is an important problem in the bleeding-edge field of genomics. Here is
Sequence alignment is a combinatorial problem, and naive algorithms run in exponential time. The Needleman-Wunsch algorithm runs in (more or less) O(n^3), which is still bad enough that the genomics community uses specialized hardware and networked databases to do their alignments.
As a benchmark, I aligned 2 sequences of 200 characters each. This is a rather modest problem by the standards of genomics. The Perl implementation aligns them in something like 200 or 400 seconds. The exact time is irrelevant: it takes longer than I am willing to wait.
The O(n^3) step in the NW algorithm is filling in the score matrix; everything else runs in linear time. I wrote a C program that fills in the score matrix.
It runs the benchmark 200x200 alignment in 3 seconds, or about 100 times faster than the Perl implementation.
I don't want to rewrite the rest of the Perl implementation in C. Parts of the algorithm are intricate, and it relies heavily on Perl for housekeeping and memory management. It is the sort of code that is a joy in Perl and a burden in C.
Instead, I want to use the C implementation to fill in the score matrix, use the Perl implementation for everything else, and use XS to call from one to the other. In the next four parts of this article, we'll see how to do that.
There is a lot of material to cover, but either you understand the architecture beneath XS, or you are crunchy and good with ketchup.
awk(1) man page.