diff --git a/README.md b/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b4cdfa7c90f3ccb39a01372d7892fe597336463d --- /dev/null +++ b/README.md @@ -0,0 +1,319 @@ +ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun + +Version: 2.02 + +Ruiqiang Li + +lirq@genomics.org.cn +Bioinformatics Department of Beijing Genomics Institute +Beijing, 101300 +China + +lirq@bmb.sdu.dk +Department of Biochemistry and Molecular Biology +University of Southern Denmark +Campusvej 55 +DK-5230 Odense M, Denmark + +1. Introduction: + +ReAS aims to recover ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. + +Transposable elements (TEs) make up a significant proportion of many eukaryotic genomes, totaling almost 45% for human, and +50% to 80% for rice, maize, and wheat. They play important evolutionary roles, and can be essential tools for genome +analyses. RepeatMasker (A. Smit and P. Green, unpublished data) is one of the most commonly used algorithms for detecting +TEs. It relies on a comparison to known TEs from libraries like Repbase, which represents many years of manual labor. +Computational tools like REPuter, RECON, and RepeatGluer automate this process. However, almost all of the genomes sequenced +today employ a whole genome shotgun (WGS) method that is incapable of assembling the most recent TEs, and any efforts to +force such an assembly together generally increase the probability of assembly errors. For example, in the mouse and rice +genomes, 15% and 14% of the reads were left out of the assemblies, respectively. Some of the unassembled reads are due to +centromeres or telomeres, but we know in rice that many are recent TEs. Unassembled reads are the most informative reads for +TE recovery as they are the least diverged from their ancestral sequences. Despite this fact, all of the above tools were +only tested on assembled genomes and it is not clear how effectively or efficiently they might incorporate the information in +the unassembled reads. Hence, we developed a new algorithm, ReAS (from ����recovery of ancestral sequences����), to produce +the requisite TE library using only the unassembled reads of a WGS. + +ReAS aims for transposable elements, but as a program, it will recognize all repeat elements which over represented in the +read set. There will of course be some artifacts. WGS could not be really random, some unique regions will be assembled as +repeats when sampling depth is over the threshold. Also some recently duplicated gene families or genome fragments may be assembled +out. rRNAs, centromere and telomere units were unavoidable. So please keep in mind that not all ReAS assembled sequences are +TEs. But good news is that after we tested in rice, seems very few coding genes were involved. And for rRNAs, +centromeres, telomeres and pseudogenes, they are also taken as repeats and should be masked during annotation. + +2. Algorithm: + +ReAS starts with WGS raw reads. It used K-mer strategy to define MDRs (mathematically defined repeats), then try to assemble +these MDRs together into BDRs (biologically defined repeats) with accurate boundaries. + +ReAS v1 selected initial K-mer as a bait, then retrieve all reads containning the K-mer, and trim the reads down to 100-bp +fragments centered at the K-mer. The fragments were clustered together according to pairwise similarity. Multi-alignment tool +ClustalW was used to get consensus for the groups of fragments. High-depth k-mers at the ends of assembled consensus sequence +were used to retrieve reads again. So the consensus was extended step by step at both ends until reaching unique region that +no further extensions are possible. Each time, we extend 100bp, after getting ends, a finishing step is used to get the last +few bases. For details, please refer to our ReAS paper: + +Li R, Ye J, Li S, Wang J, Han Y, et al (2005) ReAS: Recovery of ancestral sequences for transposable elements from the +unassembled reads of a whole genome shotgun. PLoS Comput Biol 1(4): e43. + +ReAS v1 is an integrated program, which will read the whole read set and the list of high-depth kmers into memory at the +beginning of assembly. Therefor, it's efficient. As we tested in rice, assembling repeats for a typical eukaryotic +genome (400MB~3GB; 6X WGS; repeat content: ~40%) will need only about 30CPU days. And ReAS realized multi-process and internally, +so actually it needs only two days in our IBM690 supercomputer with 20 parallel CPUs. But the problem is It +requires a huge memory (at least 50GB for genome size about 400MB like rice), and the parallel strategy is only available on +MPI structure supercomputers. + +Hence, we changed some of the strategies, and developed ReAS version 2. It's an incompact pipeline, comprised of a set of +independent programs. We provided some alternative choices for some of the steps, so that people can decide according to +their computer system. In all, it has following steps: + +(1) set D depth threshold, generate a list of high-depth K-mers from the read set; +(2) pick out reads which contains a certain number (M) of high-depth K-mers, we think these reads contains repeat fragments; +(3) do pairwise alignment among the reads, for each read, map all alignment hits on it, then cut the high-depth segments out +according to a presetted depth threshold; +(4) refine pairwise alignment among the high-depth read segments to get overlapping bigraph among the repeat segments; +(5) path searching and consensus assembly from the network. + +Pairwise alignment tool could be cross_match, blastn, patternhunter, or any others. Researchers can use their favorite tools, +only need to convert the output into our standard format. As our experience, we strongly suggest to use cross_match, it has +the best performance and efficiency for read sequences alignment. We used MUSCLE (http://www.drive5.com/muscle/) as multi-alignment +tool to assemble consensus sequences. + +3. Compiling and installation: + +ReAS v2 is a set of C++ programs and perl scripts. You can simply type "make" or "gmake" to compile C++ codes under the source code +directory, then type "make install" to copy perl scripts and C++ excutable files into bin directory. v3.2.1 or higher version of gcc +is needed. + +The programs have to be compiled in 64 bit. For ReAS is specially designed for big genomes which has a lot of repeats, it will always +need to handle huge reads set, so 64 bit is needed to support >4GB memory. Althrough the program is also runnable under 32 bit compilation, +the result will be seriously wrong. Please change the options in "Makefile" according to your linux/unix operational system. +For example, we should set "-mpowerpc64 -maix64" for IBM AIX system, and set nothing for normal 64 bit linux system. + +Don't forget to add "bin/" into your searching path "$PATH". + +4. External programs needed: + +As we need cross_match, blastn or patternhunter to perform pairwise alignment, And need MUSCLE (http://www.drive5.com/muscle/) +to do multi-alignment. There are not included in ReAS package, So please install them independently. Dust is used to mask +low-complexity repeats, it's freely available from http://blast.wustl.edu/pub/ . + +5. Pipeline and programs: + +The whole pipeline starts with raw sequencing reads as the only input dataset. And final output is a file containing consensus +sequences for all repeats. We call them "reads.fa" and "consensus.fa" respectively on following introduction. + +ReAS has an interface shell to run the whole pipeline. You can also try to run the pipeline step by step manually so that to +adjust parameters whenever as needed. + +The integrated pipeline: +======================== + +reas_all.pl + +Usage : ./reas_all.pl [options] + -read STRING reads file, fasta format + -k INT k-mer size, default=17 + -d INT depth threshold, default=6 + -m INT threshold of number of high-depth k-mers when defining repeat, default=1 + -subset FLOAT randomly pick up a fraction of reads for assembly, default=1(all) + -n INT split the data file into number of pieces, default=1 + -pa INT number of parallel processors, default=1 + -bound STRING output repeat boundaries file, default="seg.bk" + -link STRING output bigraph linkage file, default="seg.link" + -multi INT a subset for multi-alignment, default=-1(all), suggest 1000 if set + -output STRING output final consensus file, default="consensus.fa" + -prefix STRING prefix of consensus sequence name, default="" + -log output log details, default=[not] + + +Advanced options: + + -size INT size threshold of pairwise hits, default=100 + -ident FLOAT identity threshold of pairwise hits, default=0.6 + -nonlcs INT size threshold of non-LCS overlap, default=50 + -end INT size of remaining sequence at ends when defining breakpoints, default=50bp + -min_depth INT minimal depth at extension, default=depth/3 + -min_extend INT minimal extension size, default=50 + -max_extend INT maximum extension size, default=200 + + -h help + +notes: +(a) Unless you understand the details of this program, please don't set "Advanced options"; +(b) If the reads file is very large, please set a bigger "-n" to split the file into pieces, or memory could be a problem for cross_match +alignment; +(c) "-n" should be equal or larger than "-pa", or at most "-n" cpus will be used even you have more CPUs. + +Individual programs: +=================== + +(1) screen_vector.pl, CleanData.pl, rename.pl +The only input for ReAS assembly is random whole genome shotgun (WGS) read sequences. vector sequences should be pre-masked, and low quality +sequences at read ends should be trimmed. After that, you can normally type following command to prepare data for ReAS: + +cat reads.fa |CleanData.pl |rename.pl >reads.clean.fa + +screen_vector.pl is an additional program help you to screen vector sequences prior to assembly. +vector library can be downloaded from NCBI ftp ftp.ncbi.nih.gov under /pub/UniVec/UniVec_Core + +'X's and 'N's at both ends of reads will be trimmed. Reads with size < 100bp will be filtered out. So if your reads are very +short, like those generated by 454 sequencing technology, be sure to set proper threshold to filter small fragments. Description +infromation on title will be cleaned up, and reads will be renamed with ordered numbers. + +(2) kmer_num +kmer_num is a program from our RePS package. It will count the frequency of each K-mer in the reads set, K should be set +according to genome size. 4^K must exceed genome size, but large K will tax memory capacity and reduce sensitivity. It will +store the frequence of each K-mer as one byte in RAM, and finally output into a binary file. + +(3) kmer2reads +kmer2reads is also adapted from the program in RePS package. kmer2reads will output mer-read index information for k-mers whose +frequence are over the D depth threshold. D is a key parameter, for unbiased WGS cloning, the distribution of K-mer frequence +in unique region should fit Poisson distribution with peak at sequencing coverage. So for false-positive rate nearly 0.1%, +the threshold should be 7, 11 and 14 for 2X, 4X, and 6X coverage, respectively. But in practise, WGS couldn't be perfectly +random, and we are interested in high-depth repeats, so we's better set the parameters a little higher as our experience to avoid false positives. +Normally, we set the threshold at 3 times of sequencing coverage. So that it will only assemble repeats with more than 3 copies on the genome. +The package also provides a smaller program "DepthThreshold_kmer" to calculate the parameter. + +(4) N_mers +N_mers will take kmer2reads output as input. It counts how many high-depth K-mers each read have. You should set a threshold M +(30~100 for typical reads) to decide whether a read contains a repeat fragment or not. Then get the subset of repeat containing reads +to perform analysis. (2), (3) and (4) could run as a pipe like: + +kmer_num -k 17 reads.clean.fa |kmer2reads -d 40 reads.clean.fa |N_mers -M 50 >K17D40M50.N_mers + +"K17D40M50.N_mers" contains two columns: reads ID; number of high-depth k-mers. You can set a lower M (M=1) to keep all the +information when running the pipeline, and try to filter the result by different Ms later using command like: + +cat reads.N_mers | awk '$2>30' + +(5) cat reads.N_mers |cut -f1 >reads.id + +If the reads set is very large (>500,000), it will cost a lot of time for self pairwise alignment. So we can randomly pick up a subset +to do assembly. The command line is like: + +cat reads.id | RandomList.pl 0.5 >reads.sub.id + +then get the sequences of listed reads: + +cat reads.clean.fa | pickListSeq.pl reads.sub.id >reads.sub.fa + +Step (2)-(5) is to get a subset of repeat containing reads to perform assembly. It will save a lot of computational time but of course +also lose some sensitivity. So these steps are not neccessary, if you have enough computational resource, you can go to step (6) directly +with "reads.clean.fa". + +(6) run_HighDseg.pl +It's a small pipeline to perform pairwise alignment and cut high-depth repeat fragments from reads. We used a shell +"multi_run2.sh" to submit jobs on MPI structure supercomputers, you can use your task submitting programs instead of it. Please +revise the script directly at line 72. It provides several choices for the pairwise alignment tool. We suggest you to use cross_match. +Be remember to set the right D depth threshold here, especially when you randomly picked up a fraction of reads, D should be adjusted +synchronously. + +run_HighDseg.pl -r reads.sub.fa -n 1000 -a 20 -b seg.bk -d 10 + +Faster_HighDseg is a faster version. Normally it will significantly reduce computational time up to >1000 times. The maximum number of +pairwise alignment performed in run_HighDseg.pl is O(num_reads^2), but will reduce to O(num_reads*depth_threshold*10). But it read all +data into memorey and used pthread technology, so need a huge memory. + +(7) cutSeg.pl +It will cut the fragments out from reads according to repeat fragment boundary information get from (6). + +cat seg.bk | cutSeg.pl reads.sub.fa >seg.fa + +(8) run_SegLink.pl +Refine pairwise alignment for the repeat segments, and get linkage information among them. Alignment hits of low-complexity sequence (LCS) +will be removed. Default will be binary output. It will significantly reduce output size. + +run_SegLink.pl -r seg.fa -n 1000 -a 20 + +It will generate file "seg.link" + +Faster_SegLink.pl is a faster version, used similar strategy as Faster_HighDseg. It's rather fast, but needs a huge memory. + +(9) DoAssembly +Perform assembly for all repeat segments according to alignment linkage information. + +DoAssembly -r seg.fa -l seg.link -b seg.bk -d 10 -o consensus.fa -p "reas_" -s >log + +(10) join_fragments.pl + +During the process of assembly, I set very stringent parameters to avoid misassemblies. Based on test, I found ReAS can assemble almost all +Highly abundant TEs out, but the result is a little fragmental, so here have a step to further join consensus sequences according to overlap +information. + +(11) rmRedundant.pl + +remove redundance among the assembled consensus sequences. + + +Some other programs are also usable for some alternative strategy: +================================================================== + +(1) "Clustering" will perform clustering based on program (7) result, then split linkage and sequences into each +group. Correspondingly, "run_DoAssembly.pl" is a batch shell of DoAssembly. + +(2) "DepthThreshold_kmer" and "DepthThreshold_overlap" are used to estimate D threshold. + + +Some more programs for quality control: +======================================= + +(1) vsKnownLib.pl +Estimation of completeness of assembly by aligning ReAS TEs with known TEs. + +cross_match -masklevel 101 ReAS_TE.fa Known_TE.fa |ConverAlign.pl >ReAS_Known.align +vsKnownLib.pl Known.size ReAS_Known.align >estimate.out + +(2) LibOverlap +Comparison of the overlap of genome RepeatMasker results by two libraries + +(3) geneMask.pl +Calculate fraction of exon and intron region masked for each known gene + +(4) statDepth.pl +Average, minimal and maximum depth of each TE on the reads dataset. + +(5) anno_vsProtein.pl +Annotate the assembled repeat sequences by aligning with known TE related proteins, including gag, pol, reverse transcriptase, ribonuclease H, +integrase enzyme, transposases, envolope proteins. + +More programs to annotate the assembled repeats: + +(be added soon. ...) + + +Some notes regarding memory saving: +=================================== +ReAS is handling large scale reads data, so memory could be a big problem when facing extreme situation. Here are some suggestions +regarding how to make ReAS runnable with less memory demand: + +(1) Randomly pick up a small subset of reads for ReAS assembly. Not like genome assembly, which always need at least 6x reads to get +a draft assembly, ReAS is aiming for abundant repeat sequences, which always have many copies on a genome, so we can even capture most +repeats for a survey sequencing lower than 1x. Also reducing input size will greatly save cpu time; + +(2) Set a larger "-n" to split sequence file into many pieces. Cross_match will save alignment hits in memory, and output after all alignments +finished, so split the file into pieces will significantly reduce memory requirement of cross_match; + +(3) Clustering and then do assembly for each cluster respectively; + +(4) Also, we can try to set a high depth threshold and get a small fraction of reads to assemble highly abundant repeat first, then remove +reads belonging to these TEs, and set lower depth threshold to perform assembly in the remaining reads; + +(5) Finally, some TEs in the genome are known already, we can remove reads belonging to these TEs prior to assembly. +6. Acknowledges: +kmer_num, kmer2reads are copied from our genome assembler RePS, which were written by Heng Li, Shengting Li, and Yong +Zhang. + +7. Citation: + +ReAS freely available for non-commercial users through ReAS@genomics.org.cn. If you use the program, please cite our paper: +Li R, Ye J, Li S, Wang J, Han Y, et al (2005) ReAS: Recovery of ancestral sequences for transposable elements from the +unassembled reads of a whole genome shotgun. PLoS Comput Biol 1(4): e43. + + +8. Bug report + +Although I have done a lot of test, I'm sure there must still be some undetected bugs. So I really hope you can point them out to me +whenever you find something strange look like bugs. You can report bug to ReAS@genomics.org.cn or direct to my E-mail lirq@genomics.org.cn +Thanks very much. + +Thanks for interested in our ReAS. \ No newline at end of file