Randomly shuffled sequences are routinely used in sequence analysis to evaluate the statistical significance of a biological sequence. In many cases, biologists need sophisticated shuffling tools that preserve not only the counts of distinct letters but also higher-order statistics such as doublet counts, triplet counts, and, in general, k-let counts. We present a sequence analysis tool named uShuffle for generating uniform random permutations of biological sequences (such as DNAs, RNAs, and proteins) that preserve the exact k-let counts. The uShuffle tool implements the latest variant of the Euler algorithm and uses Wilson's algorithm in the crucial step of arborescence generation. It is carefully engineered and extremely efficient. The uShuffle tool achieves maximum flexibility by allowing arbitrary alphabet size and let size. It can be used as a command-line program, a web application, or a utility library. Source code in C, Java, and C#, and integration instructions for Perl and Python are provided.
To compile the uShuffle C library and the command-line tool in a Unix-like environment, download the three files main.c, ushuffle.c, and ushuffle.h, then run the following commands:
$ gcc -O3 -c ushuffle.c $ gcc -o main.exe -O3 main.c ushuffle.o
The uShuffle C library (ushuffle.c and ushuffle.h) exports the following functions as the programming interface:
void shuffle(const char *s, char *t, int l, int k); void shuffle1(const char *s, int l, int k); void shuffle2(char *t); typedef long (*randfunc_t)(); void set_randfunc(randfunc_t randfunc);
The function shuffle accepts four parameters: s is the sequence to be permuted, t is the output random sequence, l is the length of s, and k is the let size. The function shuffle simply calls shuffle1 first and shuffle2 next. The function shuffle1 implements the construction of the directed multigraph. The function shuffle2 implements the loop-erased random walks in the directed multigraph and the generation of the random sequence. The function set_randfunc sets the random number generator.
The executable file main.exe is the command-line tool. It has four options:
-s [string] specifies the input sequence -n [number] specifies the number of random sequences to generate -k [number] specifies the let size to preserve -seed [number] specifies the seed for random number generator
Note that the command-line tool is meant to be a minimal example of using the uShuffle C library in your own program. If you need more functions such as file input in FASTA format, please feel free to extend it with additional options.
To compile the Java program and run it locally in the applet viewer, download the three files UShuffle.java, ushuffle.manifest, and ushuffle.html, then run the following commands:
$ javac -deprecation *.java $ jar -cmf ushuffle.manifest ushuffle.jar *.class $ appletviewer -J-mx1000m ushuffle.html
Note that the -J-mx1000m option for appletviewer sets the Java heap size. To learn more about this, follow the link How to Increase Java Heap Size on Windows Platform.
The Java program can also be run as a command-line tool with exactly the same options as the C command-line tool. The public methods for the uShuffle Java library have slightly different signatures:
public void shuffle(char s, char t, int l, int k); public void shuffle1(char s, int l, int k); public void shuffle2(char t); public void set_randfunc(Random rand);
The C# program UShuffle.cs is ported from the Java program; it is almost identical to the Java version except that it doesn't have an applet interface.
To build the Perl module, download and unzip the file perl.zip, then run the command make. To verify that the module is built properly, run the command perl test from the directory Ushuffle. To install the module on your system, run the command make install from the directory Ushuffle.
To build and install the Python module, follow the link How to Use uShuffle in Python.
To repeat the experiments described in our paper, download and unzip the file experiments.zip, then follow the instructions in the README file.
Minghui Jiang, James Anderson, Joel Gillespie, and Martin Mayne. uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics, 9:#192, 2008.
From left to right: James, Martin, Joel, and Minghui. The eight letters shown in the four sheets of paper is a permutation of "uShuffle".
Last modified: Thu Apr 29 10:22:24 MDT 2010 _ _