Copied readme to root

46ba326f · Haarst, Jan van · 0d7a0175 · 46ba326f
Commit 46ba326f authored 8 years ago by Haarst, Jan van
--- a/README.md
+++ b/README.md
+ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun
+
+Version: 2.02
+
+Ruiqiang Li
+
+lirq@genomics.org.cn
+Bioinformatics Department of Beijing Genomics Institute
+Beijing, 101300
+China
+
+lirq@bmb.sdu.dk
+Department of Biochemistry and Molecular Biology
+University of Southern Denmark
+Campusvej 55
+DK-5230 Odense M, Denmark
+
+1. Introduction:
+
+ReAS aims to recover ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun.
+
+Transposable elements (TEs) make up a significant proportion of many eukaryotic genomes, totaling almost 45% for human, and 
+50% to 80% for rice, maize, and wheat. They play important evolutionary roles, and can be essential tools for genome 
+analyses. RepeatMasker (A. Smit and P. Green, unpublished data) is one of the most commonly used algorithms for detecting 
+TEs. It relies on a comparison to known TEs from libraries like Repbase, which represents many years of manual labor. 
+Computational tools like REPuter, RECON, and RepeatGluer automate this process. However, almost all of the genomes sequenced 
+today employ a whole genome shotgun (WGS) method that is incapable of assembling the most recent TEs, and any efforts to 
+force such an assembly together generally increase the probability of assembly errors. For example, in the mouse and rice 
+genomes, 15% and 14% of the reads were left out of the assemblies, respectively. Some of the unassembled reads are due to 
+centromeres or telomeres, but we know in rice that many are recent TEs. Unassembled reads are the most informative reads for 
+TE recovery as they are the least diverged from their ancestral sequences. Despite this fact, all of the above tools were 
+only tested on assembled genomes and it is not clear how effectively or efficiently they might incorporate the information in 
+the unassembled reads. Hence, we developed a new algorithm, ReAS (from ����recovery of ancestral sequences����), to produce 
+the requisite TE library using only the unassembled reads of a WGS.
+
+ReAS aims for transposable elements, but as a program, it will recognize all repeat elements which over represented in the 
+read set. There will of course be some artifacts. WGS could not be really random, some unique regions will be assembled as 
+repeats when sampling depth is over the threshold. Also some recently duplicated gene families or genome fragments may be assembled 
+out. rRNAs, centromere and telomere units were unavoidable. So please keep in mind that not all ReAS assembled sequences are 
+TEs. But good news is that after we tested in rice, seems very few coding genes were involved. And for rRNAs, 
+centromeres, telomeres and pseudogenes, they are also taken as repeats and should be masked during annotation.
+
+2. Algorithm:
+
+ReAS starts with WGS raw reads. It used K-mer strategy to define MDRs (mathematically defined repeats), then try to assemble 
+these MDRs together into BDRs (biologically defined repeats) with accurate boundaries.
+
+ReAS v1 selected initial K-mer as a bait, then retrieve all reads containning the K-mer, and trim the reads down to 100-bp 
+fragments centered at the K-mer. The fragments were clustered together according to pairwise similarity. Multi-alignment tool 
+ClustalW was used to get consensus for the groups of fragments. High-depth k-mers at the ends of assembled consensus sequence 
+were used to retrieve reads again. So the consensus was extended step by step at both ends until reaching unique region that 
+no further extensions are possible. Each time, we extend 100bp, after getting ends, a finishing step is used to get the last 
+few bases. For details, please refer to our ReAS paper:
+
+Li R, Ye J, Li S, Wang J, Han Y, et al (2005) ReAS: Recovery of ancestral sequences for transposable elements from the 
+unassembled reads of a whole genome shotgun. PLoS Comput Biol 1(4): e43.
+
+ReAS v1 is an integrated program, which will read the whole read set and the list of high-depth kmers into memory at the 
+beginning of assembly. Therefor, it's efficient. As we tested in rice, assembling repeats for a typical eukaryotic 
+genome (400MB~3GB; 6X WGS; repeat content: ~40%) will need only about 30CPU days. And ReAS realized multi-process and internally, 
+so actually it needs only two days in our IBM690 supercomputer with 20 parallel CPUs. But the problem is It 
+requires a huge memory (at least 50GB for genome size about 400MB like rice), and the parallel strategy is only available on 
+MPI structure supercomputers.
+
+Hence, we changed some of the strategies, and developed ReAS version 2. It's an incompact pipeline, comprised of a set of 
+independent programs. We provided some alternative choices for some of the steps, so that people can decide according to 
+their computer system. In all, it has following steps:
+
+(1) set D depth threshold, generate a list of high-depth K-mers from the read set;
+(2) pick out reads which contains a certain number (M) of high-depth K-mers, we think these reads contains repeat fragments;
+(3) do pairwise alignment among the reads, for each read, map all alignment hits on it, then cut the high-depth segments out 
+according to a presetted depth threshold;
+(4) refine pairwise alignment among the high-depth read segments to get overlapping bigraph among the repeat segments;
+(5) path searching and consensus assembly from the network.
+
+Pairwise alignment tool could be cross_match, blastn, patternhunter, or any others. Researchers can use their favorite tools, 
+only need to convert the output into our standard format. As our experience, we strongly suggest to use cross_match, it has 
+the best performance and efficiency for read sequences alignment. We used MUSCLE (http://www.drive5.com/muscle/) as multi-alignment
+tool to assemble consensus sequences.
+
+3. Compiling and installation:
+
+ReAS v2 is a set of C++ programs and perl scripts. You can simply type "make" or "gmake" to compile C++ codes under the source code 
+directory, then type "make install" to copy perl scripts and C++ excutable files into bin directory. v3.2.1 or higher version of gcc
+is needed.
+
+The programs have to be compiled in 64 bit. For ReAS is specially designed for big genomes which has a lot of repeats, it will always
+need to handle huge reads set, so 64 bit is needed to support >4GB memory. Althrough the program is also runnable under 32 bit compilation,
+the result will be seriously wrong. Please change the options in "Makefile" according to your linux/unix operational system. 
+For example, we should set "-mpowerpc64 -maix64" for IBM AIX system, and set nothing for normal 64 bit linux system.
+
+Don't forget to add "bin/" into your searching path "$PATH".
+
+4. External programs needed:
+
+As we need cross_match, blastn or patternhunter to perform pairwise alignment, And need MUSCLE (http://www.drive5.com/muscle/) 
+to do multi-alignment. There are not included in ReAS package, So please install them independently. Dust is used to mask 
+low-complexity repeats, it's freely available from http://blast.wustl.edu/pub/ .
+
+5. Pipeline and programs:
+
+The whole pipeline starts with raw sequencing reads as the only input dataset. And final output is a file containing consensus
+sequences for all repeats. We call them "reads.fa" and "consensus.fa" respectively on following introduction.
+
+ReAS has an interface shell to run the whole pipeline. You can also try to run the pipeline step by step manually so that to
+adjust parameters whenever as needed.
+
+The integrated pipeline:
+========================
+
+reas_all.pl
+
+Usage : ./reas_all.pl [options]
+	-read	STRING	reads file, fasta format
+	-k	INT	k-mer size, default=17
+	-d	INT	depth threshold, default=6
+	-m	INT	threshold of number of high-depth k-mers when defining repeat, default=1
+	-subset	FLOAT	randomly pick up a fraction of reads for assembly, default=1(all)
+	-n	INT	split the data file into number of pieces, default=1
+	-pa	INT	number of parallel processors, default=1
+	-bound	STRING	output repeat boundaries file, default="seg.bk"
+	-link	STRING	output bigraph linkage file, default="seg.link"
+	-multi	INT	a subset for multi-alignment, default=-1(all), suggest 1000 if set
+	-output	STRING	output final consensus file, default="consensus.fa"
+	-prefix	STRING	prefix of consensus sequence name, default=""
+	-log		output log details, default=[not]
+
+
+Advanced options:
+
+	-size	INT	size threshold of pairwise hits, default=100
+	-ident	FLOAT	identity threshold of pairwise hits, default=0.6
+	-nonlcs	INT	size threshold of non-LCS overlap, default=50
+	-end	INT	size of remaining sequence at ends when defining breakpoints, default=50bp
+	-min_depth	INT	minimal depth at extension, default=depth/3
+	-min_extend	INT	minimal extension size, default=50
+	-max_extend	INT	maximum extension size, default=200
+
+	-h	help
+
+notes:
+(a) Unless you understand the details of this program, please don't set "Advanced options";
+(b) If the reads file is very large, please set a bigger "-n" to split the file into pieces, or memory could be a problem for cross_match
+alignment;
+(c) "-n" should be equal or larger than "-pa", or at most "-n" cpus will be used even you have more CPUs.
+
+Individual programs:
+===================
+
+(1) screen_vector.pl, CleanData.pl, rename.pl
+The only input for ReAS assembly is random whole genome shotgun (WGS) read sequences. vector sequences should be pre-masked, and low quality
+sequences at read ends should be trimmed. After that, you can normally type following command to prepare data for ReAS:
+
+cat reads.fa |CleanData.pl |rename.pl >reads.clean.fa
+
+screen_vector.pl is an additional program help you to screen vector sequences prior to assembly.
+vector library can be downloaded from NCBI ftp ftp.ncbi.nih.gov under /pub/UniVec/UniVec_Core
+
+'X's and 'N's at both ends of reads will be trimmed. Reads with size < 100bp will be filtered out. So if your reads are very
+short, like those generated by 454 sequencing technology, be sure to set proper threshold to filter small fragments. Description
+infromation on title will be cleaned up, and reads will be renamed with ordered numbers.
+
+(2) kmer_num
+kmer_num is a program from our RePS package. It will count the frequency of each K-mer in the reads set, K should be set 
+according to genome size. 4^K must exceed genome size, but large K will tax memory capacity and reduce sensitivity. It will 
+store the frequence of each K-mer as one byte in RAM, and finally output into a binary file.
+
+(3) kmer2reads
+kmer2reads is also adapted from the program in RePS package. kmer2reads will output mer-read index information for k-mers whose 
+frequence are over the D depth threshold. D is a key parameter, for unbiased WGS cloning, the distribution of K-mer frequence 
+in unique region should fit Poisson distribution with peak at sequencing coverage. So for false-positive rate nearly 0.1%, 
+the threshold should be 7, 11 and 14 for 2X, 4X, and 6X coverage, respectively. But in practise, WGS couldn't be perfectly 
+random, and we are interested in high-depth repeats, so we's better set the parameters a little higher as our experience to avoid false positives.
+Normally, we set the threshold at 3 times of sequencing coverage. So that it will only assemble repeats with more than 3 copies on the genome.
+The package also provides a smaller program "DepthThreshold_kmer" to calculate the parameter.
+
+(4) N_mers
+N_mers will take kmer2reads output as input. It counts how many high-depth K-mers each read have. You should set a threshold M 
+(30~100 for typical reads) to decide whether a read contains a repeat fragment or not. Then get the subset of repeat containing reads
+to perform analysis. (2), (3) and (4) could run as a pipe like:
+
+kmer_num -k 17 reads.clean.fa |kmer2reads -d 40 reads.clean.fa |N_mers -M 50 >K17D40M50.N_mers
+
+"K17D40M50.N_mers" contains two columns: reads ID; number of high-depth k-mers. You can set a lower M (M=1) to keep all the
+information when running the pipeline, and try to filter the result by different Ms later using command like:
+
+cat reads.N_mers | awk '$2>30' 
+
+(5) cat reads.N_mers |cut -f1 >reads.id
+
+If the reads set is very large (>500,000), it will cost a lot of time for self pairwise alignment. So we can randomly pick up a subset
+to do assembly. The command line is like:
+
+cat reads.id | RandomList.pl 0.5 >reads.sub.id
+
+then get the sequences of listed reads:
+
+cat reads.clean.fa | pickListSeq.pl reads.sub.id >reads.sub.fa
+
+Step (2)-(5) is to get a subset of repeat containing reads to perform assembly. It will save a lot of computational time but of course
+also lose some sensitivity. So these steps are not neccessary, if you have enough computational resource, you can go to step (6) directly
+with "reads.clean.fa".
+
+(6) run_HighDseg.pl
+It's a small pipeline to perform pairwise alignment and cut high-depth repeat fragments from reads. We used a shell 
+"multi_run2.sh" to submit jobs on MPI structure supercomputers, you can use your task submitting programs instead of it. Please 
+revise the script directly at line 72. It provides several choices for the pairwise alignment tool. We suggest you to use cross_match.
+Be remember to set the right D depth threshold here, especially when you randomly picked up a fraction of reads, D should be adjusted
+synchronously.
+
+run_HighDseg.pl -r reads.sub.fa -n 1000 -a 20 -b seg.bk -d 10
+
+Faster_HighDseg is a faster version. Normally it will significantly reduce computational time up to >1000 times. The maximum number of 
+pairwise alignment performed in run_HighDseg.pl is O(num_reads^2), but will reduce to O(num_reads*depth_threshold*10). But it read all 
+data into memorey and used pthread technology, so need a huge memory.
+
+(7) cutSeg.pl
+It will cut the fragments out from reads according to repeat fragment boundary information get from (6).
+
+cat seg.bk | cutSeg.pl reads.sub.fa >seg.fa
+
+(8) run_SegLink.pl
+Refine pairwise alignment for the repeat segments, and get linkage information among them. Alignment hits of low-complexity sequence (LCS)
+will be removed. Default will be binary output. It will significantly reduce output size.
+
+run_SegLink.pl -r seg.fa -n 1000 -a 20
+
+It will generate file "seg.link"
+
+Faster_SegLink.pl is a faster version, used similar strategy as Faster_HighDseg. It's rather fast, but needs a huge memory.
+
+(9) DoAssembly
+Perform assembly for all repeat segments according to alignment linkage information.
+
+DoAssembly -r seg.fa -l seg.link -b seg.bk -d 10 -o consensus.fa -p "reas_" -s >log
+
+(10) join_fragments.pl
+
+During the process of assembly, I set very stringent parameters to avoid misassemblies. Based on test, I found ReAS can assemble almost all
+Highly abundant TEs out, but the result is a little fragmental, so here have a step to further join consensus sequences according to overlap
+information.
+
+(11) rmRedundant.pl
+
+remove redundance among the assembled consensus sequences.
+
+
+Some other programs are also usable for some alternative strategy:
+==================================================================
+
+(1) "Clustering" will perform clustering based on program (7) result, then split linkage and sequences into each 
+group. Correspondingly, "run_DoAssembly.pl" is a batch shell of DoAssembly.
+
+(2) "DepthThreshold_kmer" and "DepthThreshold_overlap" are used to estimate D threshold.
+
+
+Some more programs for quality control:
+=======================================
+
+(1) vsKnownLib.pl
+Estimation of completeness of assembly by aligning ReAS TEs with known TEs.
+
+cross_match -masklevel 101 ReAS_TE.fa Known_TE.fa |ConverAlign.pl >ReAS_Known.align
+vsKnownLib.pl Known.size ReAS_Known.align >estimate.out
+
+(2) LibOverlap
+Comparison of the overlap of genome RepeatMasker results by two libraries
+
+(3) geneMask.pl
+Calculate fraction of exon and intron region masked for each known gene
+
+(4) statDepth.pl
+Average, minimal and maximum depth of each TE on the reads dataset.
+
+(5) anno_vsProtein.pl
+Annotate the assembled repeat sequences by aligning with known TE related proteins, including gag, pol, reverse transcriptase, ribonuclease H,
+integrase enzyme, transposases, envolope proteins.
+
+More programs to annotate the assembled repeats:
+
+(be added soon. ...)
+
+
+Some notes regarding memory saving:
+===================================
+ReAS is handling large scale reads data, so memory could be a big problem when facing extreme situation. Here are some suggestions
+regarding how to make ReAS runnable with less memory demand:
+
+(1) Randomly pick up a small subset of reads for ReAS assembly. Not like genome assembly, which always need at least 6x reads to get
+a draft assembly, ReAS is aiming for abundant repeat sequences, which always have many copies on a genome, so we can even capture most
+repeats for a survey sequencing lower than 1x. Also reducing input size will greatly save cpu time;
+
+(2) Set a larger "-n" to split sequence file into many pieces. Cross_match will save alignment hits in memory, and output after all alignments
+finished, so split the file into pieces will significantly reduce memory requirement of cross_match;
+
+(3) Clustering and then do assembly for each cluster respectively;
+
+(4) Also, we can try to set a high depth threshold and get a small fraction of reads to assemble highly abundant repeat first, then remove
+reads belonging to these TEs, and set lower depth threshold to perform assembly in the remaining reads;
+
+(5) Finally, some TEs in the genome are known already, we can remove reads belonging to these TEs prior to assembly.
+6. Acknowledges:
+kmer_num, kmer2reads are copied from our genome assembler RePS, which were written by Heng Li, Shengting Li, and Yong 
+Zhang.
+
+7. Citation:
+
+ReAS freely available for non-commercial users through ReAS@genomics.org.cn. If you use the program, please cite our paper:
+Li R, Ye J, Li S, Wang J, Han Y, et al (2005) ReAS: Recovery of ancestral sequences for transposable elements from the 
+unassembled reads of a whole genome shotgun. PLoS Comput Biol 1(4): e43.
+
+
+8. Bug report
+
+Although I have done a lot of test, I'm sure there must still be some undetected bugs. So I really hope you can point them out to me 
+whenever you find something strange look like bugs. You can report bug to ReAS@genomics.org.cn or direct to my E-mail lirq@genomics.org.cn
+Thanks very much.
+
+Thanks for interested in our ReAS.
\ No newline at end of file