Do not meddle in the affairs of wizards, for you are crunchy and good with ketchup.
This article is in five parts
November | Introduction | motivation, definitions, examples |
December | Architecture | the Perl interpreter, calling conventions, data representation |
January | Tools | h2xs, xsubpp, DynaLoader |
February | Modules | Math::Ackermann , Set::Bit |
March | Align::NW |
Needleman-Wunsch global optimal sequence alignment |
Align::NW
Align::NW
, an XS module to do global optimal sequence alignment.
Given two sequences
abcdeffh bdefghijkl
we want to find the best way to align them over their entire lengths: something like this
abcdeffh | ||| | b.defghijkl
This alignment has 5 matches, one mismatch, and one gap of length 1. We score the alignment according to a payoff matrix
match | 4 |
mismatch | -3 |
gap open | -2 |
gap extend | -1 |
and find that the alignment has a score of
5*4 + 1*(-3) + 1*(-2-1) = 20 - 3 - 3 = 14
Out of all possible alignments, we want to find the one with the highest score for a given payoff matrix.
We'll refer to the sequences as A
and B
, and to their lengths as lA
and lB
, respectively. The number of possible alignments is exponential in lA
and lB
, so enumerating and scoring them all in order to find the best one is obviously intractable.
b | c | d | e | |
a | ||||
b | ||||
c | ||||
d |
Then it fills in the matrix with scores, working from top left to bottom right.
b | c | d | e | |
a | -3 | -3 | -3 | -3 |
b | 4 | 1 | 0 | -1 |
c | 1 | 8 | 5 | 4 |
d | 0 | 5 | 12 | 9 |
Finally, it searches for the highest scoring cell on the bottom and right edges, and backtracks from that cell through the matrix to recover the alignment.
abcd ||| bcde
The rules for scoring the matrix and backtracking aren't too difficult to explain, but we needn't concern ourselves with them. Interested readers can glean them from the Align::NW
sources, or read them directly in the references given in the module POD. Proving that these rules actually lead to the optimal alignment is harder: Needleman and Wunsch got their names on the algorithm for doing that.
Align::NW
.
A Perl implementation of an XS module isn't always feasible, but
having one can be very useful. It allows us to address design issues in Perl, without getting tangled up in C language coding. Here is a straight Perl implementation of Align::NW
.
The new
method creates a new Align::NW
object, like this
my $nw = { a => $a, b => $b, rows => $rows, cols => $cols, dp => $dp, payoff => $payoff, options => \%options };
$a
and $b
are the two sequences to be aligned. $dp
is the score matrix; it is indexed like this
$cell = $dp->[$row][$col];
The entries in the score matrix are cells. Each cell looks like this
$cell = { row => $row, col => $col, score => $score, prev => $prev };
A cell knows its own row and column. It also has a score, and a reference to a previous cell in the matrix. The prev
references are used to backtrack through the matrix after all the scores are filled in.
score
method fills in the matrix. There are lA*lB
cells in the matrix. Furthermore, computing the score for each cell requires a loop over rows and a loop over columns, so filling in the score matrix requires O(lA*lB*(lA+lB))
operations. This is much better than exponential, but still bad enough that it is worth implementing score
in C.
The align
method backtracks through the matrix and generates the alignment. Backtracking is conceptually simple, but rather intricate to implement. Furthermore, it runs in O(lA+lB)
. Implementing align
in C would be difficult, and wouldn't improve performance.
score
method, together with a main()
to drive it. It doesn't align the sequences; it doesn't handle input or output; it doesn't have options. All it does is fill in the score matrix.
Converting a method from Perl to C raises two distinct issues
We could convert the code and not the data. C code can operate directly on Perl data: that's what the Perl C API is for. However, accessing Perl data is expensive, whether the Perl interpreter does it or our own C code does it. To realize big performance gains in the NW algorithm, we need to convert the data to C, as well.
typedef struct { int match; int mismatch; int open; int extend; } Payoff; typedef struct Cell { int row; int col; int score; struct Cell *prev; } Cell; typedef struct { Payoff payoff; char *pszA; char *pszB; int nRows; int nCols; Cell **ppCell; } NW;
The Cell
and NW
structs absolutely have to be in C for performance. The Payoff
struct doesn't: it's values are constant, so we could leave it in Perl and have score
make a local copy.
If we leave Payoff
in Perl, then score
has to access it through the Perl C API; if we convert it to C, then our Perl code has to access it through an XS interface. This is ultimately a question of style and convenience. In this implementation, I converted Payoff
to C.
Payoff
struct.
Payoff *new (int match, int mismatch, int open, int extend); void DESTROY(Payoff *pPayoff);
Since we are coding in C, and not Perl, we are responsible for our own memory management. The new
routine allocates storage for the struct and initializes it; the DESTROY
routine frees the storage.
The interface for the Matrix
struct is similar
Matrix *new (char *pszA, char *pszB, Payoff *pPayoff); void DESTROY(Matrix *pMatrix); void score (Matrix *pMatrix); Cell *cell (Matrix *pMatrix, int nRow, int nCol); void dump (Matrix *pMatrix);
In addition to new
and DESTROY
routines, it has score
to fill in the score matrix, cell
to return a pointer to a particular cell in the matrix, and dump
for debugging.
The interface for the Cell
struct is very simple. The new
routine for the Matrix
struct also allocates and frees the cells in the matrix. All that remains for the Cell
interface is accessors
int row (Cell *pCell); int col (Cell *pCell); int score(Cell *pCell); Cell *prev (Cell *pCell);
These interfaces can't stand as written, because there are two routines named new
and two named DESTROY
. We can fix this with the common C hack of prefixing each routine with the name of the struct on which it operates
Payoff *payoff_new (int match, int mismatch, int open, int extend); void payoff_DESTROY(Payoff *pPayoff); int cell_row (Cell *pCell); int cell_col (Cell *pCell); int cell_score (Cell *pCell); Cell *cell_prev (Cell *pCell); Matrix *matrix_new (char *pszA, char *pszB, Payoff *pPayoff); void matrix_DESTROY(Matrix *pMatrix); void matrix_score (Matrix *pMatrix); void matrix_dump (Matrix *pMatrix); Cell *matrix_cell (Matrix *pMatrix, int nRow, int nCol);
Align::NW
. To improve performance, we decided to reimplement the score
method in C. The score
method needs to bring its data along with it, so we created three C language structs. Each struct needs an interface so that we can manage it from Perl. Among them, these interfaces have 11 entry points. Each entry point will need XS code to connect it to Perl.
We could simply install all 11 entry points in the Align::NW
package, but that lacks structure. Instead, we'll treat each struct as an object, and regard the subroutines that operate on them as methods.
In Perl, a package provides a namespace for the methods of a class. We create a separate package for each struct, and install the routines for each struct in the corresponding package. These structs are all part of the Align::NW
module, so we nest the packages in the Align::NW
namespace
Align::NW::Matrix
Align::NW::Cell
Align::NW::Payoff
h2xs
.../development>h2xs -A -n Align::NW .../development>cp score.c Align/NW/ .../development>cp score.h Align/NW/
h2xs
generates NW.xs
for us
#include "EXTERN.h" #include "perl.h" #include "XSUB.h" MODULE = Align::NW PACKAGE = Align::NW
and we add
#include "score.h"
to bring in our header file.
MODULE
and PACKAGE
MODULE
and PACKAGE
directives. The MODULE
is correct as generated. The PACKAGE
directive is not. We actually need three different PACKAGE
directives, one for each of our C structs. We also enable prototypes for each package.
MODULE = Align::NW PACKAGE = Align::NW::Matrix PROTOTYPES: ENABLE MODULE = Align::NW PACKAGE = Align::NW::Cell PROTOTYPES: ENABLE MODULE = Align::NW PACKAGE = Align::NW::Payoff PROTOTYPES: ENABLE
The xsubs for each package will be placed after the corresponding PACKAGE
directive. Even though we are writing the Align::NW
module, we don't need an Align::NW
PACKAGE
directive, because we aren't installing any xsubs in the Align::NW
package.
xsubpp
how to convert between our C data types and Perl data types. A review of the C interface shows that we aren't actually passing any structs across the interface, but rather pointers to structs. As discussed last month, we use the XS type T_PTROBJ
to convert C struct pointers to Perl objects
Cell * T_PTROBJ Matrix * T_PTROBJ Payoff * T_PTROBJ
matrix_new
is
Matrix *matrix_new(char *pszA, char *pszB, Payoff *pPayoff);
In XS, we write this as
Matrix * matrix_new(pszA, pszB, pPayoff) char * pszA char * pszB Payoff * pPayoff
As written, this xsub will work, but it won't do everything that we want.
matrix_new
as a constructor
$matrix = matrix_new Align::NW::Matrix $a, $b, $payoff;
When we do this, Perl passes the package name as the first argument. We need to declare this argument, and then add a CODE
directive so that we can ignore it
Matrix * matrix_new(package, pszA, pszB, pPayoff) char * package char * pszA char * pszB Payoff * pPayoff CODE: RETVAL = matrix_new(pszA, pszB, pPayoff); OUTPUT: RETVAL
xsubpp
will generate code to install matrix_new
in the Align::NW::Matrix
package, because we placed the matrix_new
xsub after the
MODULE = Align::NW PACKAGE = Align::NW::Matrixline in
NW.xs
. However, the pointer that matrix_new
returns is blessed into a package that is named after the return type of the matrix_new
xsub. As written, that type is Matrix *
, and the corresponding Perl package is MatrixPtr
, not Align::NW::Matrix
.
There are two issues here
Align::NW::
prefixPtr
suffix
For an external interface—one used by applications—I would want to suppress the Ptr
suffix. It adds visual clutter and exposes an implementation detail. However, this is an internal interface: it is used only by the Align::NW
module. In this context, the Ptr
suffix isn't entirely extraneous: it helps us remember what sort of objects we are dealing with. I'll let it stand.
We definitely want the Align::NW::
prefix. Otherwise, we pollute the top-level package namespace with the names of our C structs.
Taking both the prefix and the suffix, we get a package name of Align::NW::MatrixPtr
. To get our Matrix *
blessed into this package, we declare the xsub like this
Align::NW::Matrix * matrix_new(package, pszA, pszB, pPayoff) ...
xsubpp
will convert this to the C language declaration
Align__NW__Matrix *matrix_new(....);
We can accommodate this in our C code by adding a typedef to score.h
typedef Matrix Align__NW__Matrix;
We want our methods to be in the same package as our objects, so we change the PACKAGE
directive to match
MODULE = Align::NW PACKAGE = Align::NW::MatrixPtr
All the same arguments apply to the Score
and Payoff
structs, and we use corresponding typedef
s and PACKAGE
directives for them.
$matrix = matrix_new Align::NW::MatrixPtr $a, $b, $payoff;
the matrix_
prefix is ugly and redundant. Furthermore, we have a method named matrix_DESTROY
that serves as a destructor. This won't work at all: Perl expects the destructor for an Align::NW::MatrixPtr
object to be named Align::NW::MatrixPtr::DESTROY
, not Align::NW::MatrixPtr::matrix_DESTROY
.
We use a PREFIX
directive to suppress the matrix_
prefix in Perl code. The PREFIX
directive goes on the same line as the MODULE
and PACKAGE
directives. If we write
MODULE = Align::NW PACKAGE = Align::NW::MatrixPtr PREFIX = matrix_
then xsubpp
deletes the string matrix_
from the beginning of each xsub name before it installs it. The name of our constructor is then Align::NW::MatrixPtr::new
, and we can write
$matrix = new Align::NW::MatrixPtr $a, $b, $payoff;
Here are the final MODULE
lines for our three C structs
MODULE = Align::NW PACKAGE = Align::NW::MatrixPtr PREFIX = matrix_ MODULE = Align::NW PACKAGE = Align::NW::CellPtr PREFIX = cell_ MODULE = Align::NW PACKAGE = Align::NW::PayoffPtr PREFIX = payoff_
The remaining xsubs are straightforward. Align::NW::PayoffPtr::payoff_new
is a constructor, so it needs a CODE
directive, like Align::NW::MatrixPtr::matrix_new
. For the other xsubs, all we have to write is the signature.
Align::NW::
prefixes and accepted the Ptr
suffix. We need to change the entries in the typemap to match
Align::NW::Cell * T_PTROBJ Align::NW::Matrix * T_PTROBJ Align::NW::Payoff * T_PTROBJ
OBJECT
key to the WriteMakefile()
call in Makefile.PL
OBJECT => 'NW.o score.o',
and do
.../development/Align/NW>perl Makefile.PL .../development/Align/NW>make test
The results should be
t/nw................ok All tests successful.
Align::NW
Math::Ackermann
, Bit::Vector
, and Align::NW
illustrate three different ways to write XS modules. All of them can serve as models for your modules.
dynamic programming
Align::NW
Align::
is probably too generic a term for a top-level name in the CPAN, but it is adequate for our purposes. If we implemented the related Smith-Waterman algorithm for local optimal sequence alignment, we could call it Align::SW
.
dp
dp
stands for dynamic programming.