Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
R
reas
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Container registry
Model registry
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Haarst, Jan van
reas
Commits
46ba326f
Commit
46ba326f
authored
8 years ago
by
Haarst, Jan van
Browse files
Options
Downloads
Patches
Plain Diff
Copied readme to root
parent
0d7a0175
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.md
+319
-0
319 additions, 0 deletions
README.md
with
319 additions
and
0 deletions
README.md
0 → 100644
+
319
−
0
View file @
46ba326f
ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun
Version: 2.02
Ruiqiang Li
lirq@genomics.org.cn
Bioinformatics Department of Beijing Genomics Institute
Beijing, 101300
China
lirq@bmb.sdu.dk
Department of Biochemistry and Molecular Biology
University of Southern Denmark
Campusvej 55
DK-5230 Odense M, Denmark
1.
Introduction:
ReAS aims to recover ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun.
Transposable elements (TEs) make up a significant proportion of many eukaryotic genomes, totaling almost 45% for human, and
50% to 80% for rice, maize, and wheat. They play important evolutionary roles, and can be essential tools for genome
analyses. RepeatMasker (A. Smit and P. Green, unpublished data) is one of the most commonly used algorithms for detecting
TEs. It relies on a comparison to known TEs from libraries like Repbase, which represents many years of manual labor.
Computational tools like REPuter, RECON, and RepeatGluer automate this process. However, almost all of the genomes sequenced
today employ a whole genome shotgun (WGS) method that is incapable of assembling the most recent TEs, and any efforts to
force such an assembly together generally increase the probability of assembly errors. For example, in the mouse and rice
genomes, 15% and 14% of the reads were left out of the assemblies, respectively. Some of the unassembled reads are due to
centromeres or telomeres, but we know in rice that many are recent TEs. Unassembled reads are the most informative reads for
TE recovery as they are the least diverged from their ancestral sequences. Despite this fact, all of the above tools were
only tested on assembled genomes and it is not clear how effectively or efficiently they might incorporate the information in
the unassembled reads. Hence, we developed a new algorithm, ReAS (from ����recovery of ancestral sequences����), to produce
the requisite TE library using only the unassembled reads of a WGS.
ReAS aims for transposable elements, but as a program, it will recognize all repeat elements which over represented in the
read set. There will of course be some artifacts. WGS could not be really random, some unique regions will be assembled as
repeats when sampling depth is over the threshold. Also some recently duplicated gene families or genome fragments may be assembled
out. rRNAs, centromere and telomere units were unavoidable. So please keep in mind that not all ReAS assembled sequences are
TEs. But good news is that after we tested in rice, seems very few coding genes were involved. And for rRNAs,
centromeres, telomeres and pseudogenes, they are also taken as repeats and should be masked during annotation.
2.
Algorithm:
ReAS starts with WGS raw reads. It used K-mer strategy to define MDRs (mathematically defined repeats), then try to assemble
these MDRs together into BDRs (biologically defined repeats) with accurate boundaries.
ReAS v1 selected initial K-mer as a bait, then retrieve all reads containning the K-mer, and trim the reads down to 100-bp
fragments centered at the K-mer. The fragments were clustered together according to pairwise similarity. Multi-alignment tool
ClustalW was used to get consensus for the groups of fragments. High-depth k-mers at the ends of assembled consensus sequence
were used to retrieve reads again. So the consensus was extended step by step at both ends until reaching unique region that
no further extensions are possible. Each time, we extend 100bp, after getting ends, a finishing step is used to get the last
few bases. For details, please refer to our ReAS paper:
Li R, Ye J, Li S, Wang J, Han Y, et al (2005) ReAS: Recovery of ancestral sequences for transposable elements from the
unassembled reads of a whole genome shotgun. PLoS Comput Biol 1(4): e43.
ReAS v1 is an integrated program, which will read the whole read set and the list of high-depth kmers into memory at the
beginning of assembly. Therefor, it's efficient. As we tested in rice, assembling repeats for a typical eukaryotic
genome (400MB~3GB; 6X WGS; repeat content: ~40%) will need only about 30CPU days. And ReAS realized multi-process and internally,
so actually it needs only two days in our IBM690 supercomputer with 20 parallel CPUs. But the problem is It
requires a huge memory (at least 50GB for genome size about 400MB like rice), and the parallel strategy is only available on
MPI structure supercomputers.
Hence, we changed some of the strategies, and developed ReAS version 2. It's an incompact pipeline, comprised of a set of
independent programs. We provided some alternative choices for some of the steps, so that people can decide according to
their computer system. In all, it has following steps:
(1) set D depth threshold, generate a list of high-depth K-mers from the read set;
(2) pick out reads which contains a certain number (M) of high-depth K-mers, we think these reads contains repeat fragments;
(3) do pairwise alignment among the reads, for each read, map all alignment hits on it, then cut the high-depth segments out
according to a presetted depth threshold;
(4) refine pairwise alignment among the high-depth read segments to get overlapping bigraph among the repeat segments;
(5) path searching and consensus assembly from the network.
Pairwise alignment tool could be cross_match, blastn, patternhunter, or any others. Researchers can use their favorite tools,
only need to convert the output into our standard format. As our experience, we strongly suggest to use cross_match, it has
the best performance and efficiency for read sequences alignment. We used MUSCLE (http://www.drive5.com/muscle/) as multi-alignment
tool to assemble consensus sequences.
3.
Compiling and installation:
ReAS v2 is a set of C++ programs and perl scripts. You can simply type "make" or "gmake" to compile C++ codes under the source code
directory, then type "make install" to copy perl scripts and C++ excutable files into bin directory. v3.2.1 or higher version of gcc
is needed.
The programs have to be compiled in 64 bit. For ReAS is specially designed for big genomes which has a lot of repeats, it will always
need to handle huge reads set, so 64 bit is needed to support >4GB memory. Althrough the program is also runnable under 32 bit compilation,
the result will be seriously wrong. Please change the options in "Makefile" according to your linux/unix operational system.
For example, we should set "-mpowerpc64 -maix64" for IBM AIX system, and set nothing for normal 64 bit linux system.
Don't forget to add "bin/" into your searching path "$PATH".
4.
External programs needed:
As we need cross_match, blastn or patternhunter to perform pairwise alignment, And need MUSCLE (http://www.drive5.com/muscle/)
to do multi-alignment. There are not included in ReAS package, So please install them independently. Dust is used to mask
low-complexity repeats, it's freely available from http://blast.wustl.edu/pub/ .
5.
Pipeline and programs:
The whole pipeline starts with raw sequencing reads as the only input dataset. And final output is a file containing consensus
sequences for all repeats. We call them "reads.fa" and "consensus.fa" respectively on following introduction.
ReAS has an interface shell to run the whole pipeline. You can also try to run the pipeline step by step manually so that to
adjust parameters whenever as needed.
The integrated pipeline:
========================
reas_all.pl
Usage : ./reas_all.pl [options]
-read STRING reads file, fasta format
-k INT k-mer size, default=17
-d INT depth threshold, default=6
-m INT threshold of number of high-depth k-mers when defining repeat, default=1
-subset FLOAT randomly pick up a fraction of reads for assembly, default=1(all)
-n INT split the data file into number of pieces, default=1
-pa INT number of parallel processors, default=1
-bound STRING output repeat boundaries file, default="seg.bk"
-link STRING output bigraph linkage file, default="seg.link"
-multi INT a subset for multi-alignment, default=-1(all), suggest 1000 if set
-output STRING output final consensus file, default="consensus.fa"
-prefix STRING prefix of consensus sequence name, default=""
-log output log details, default=[not]
Advanced options:
-size INT size threshold of pairwise hits, default=100
-ident FLOAT identity threshold of pairwise hits, default=0.6
-nonlcs INT size threshold of non-LCS overlap, default=50
-end INT size of remaining sequence at ends when defining breakpoints, default=50bp
-min_depth INT minimal depth at extension, default=depth/3
-min_extend INT minimal extension size, default=50
-max_extend INT maximum extension size, default=200
-h help
notes:
(a) Unless you understand the details of this program, please don't set "Advanced options";
(b) If the reads file is very large, please set a bigger "-n" to split the file into pieces, or memory could be a problem for cross_match
alignment;
(c) "-n" should be equal or larger than "-pa", or at most "-n" cpus will be used even you have more CPUs.
Individual programs:
===================
(1) screen_vector.pl, CleanData.pl, rename.pl
The only input for ReAS assembly is random whole genome shotgun (WGS) read sequences. vector sequences should be pre-masked, and low quality
sequences at read ends should be trimmed. After that, you can normally type following command to prepare data for ReAS:
cat reads.fa |CleanData.pl |rename.pl >reads.clean.fa
screen_vector.pl is an additional program help you to screen vector sequences prior to assembly.
vector library can be downloaded from NCBI ftp ftp.ncbi.nih.gov under /pub/UniVec/UniVec_Core
'X's and 'N's at both ends of reads will be trimmed. Reads with size < 100bp will be filtered out. So if your reads are very
short, like those generated by 454 sequencing technology, be sure to set proper threshold to filter small fragments. Description
infromation on title will be cleaned up, and reads will be renamed with ordered numbers.
(2) kmer_num
kmer_num is a program from our RePS package. It will count the frequency of each K-mer in the reads set, K should be set
according to genome size. 4^K must exceed genome size, but large K will tax memory capacity and reduce sensitivity. It will
store the frequence of each K-mer as one byte in RAM, and finally output into a binary file.
(3) kmer2reads
kmer2reads is also adapted from the program in RePS package. kmer2reads will output mer-read index information for k-mers whose
frequence are over the D depth threshold. D is a key parameter, for unbiased WGS cloning, the distribution of K-mer frequence
in unique region should fit Poisson distribution with peak at sequencing coverage. So for false-positive rate nearly 0.1%,
the threshold should be 7, 11 and 14 for 2X, 4X, and 6X coverage, respectively. But in practise, WGS couldn't be perfectly
random, and we are interested in high-depth repeats, so we's better set the parameters a little higher as our experience to avoid false positives.
Normally, we set the threshold at 3 times of sequencing coverage. So that it will only assemble repeats with more than 3 copies on the genome.
The package also provides a smaller program "DepthThreshold_kmer" to calculate the parameter.
(4) N_mers
N_mers will take kmer2reads output as input. It counts how many high-depth K-mers each read have. You should set a threshold M
(30~100 for typical reads) to decide whether a read contains a repeat fragment or not. Then get the subset of repeat containing reads
to perform analysis. (2), (3) and (4) could run as a pipe like:
kmer_num -k 17 reads.clean.fa |kmer2reads -d 40 reads.clean.fa |N_mers -M 50 >K17D40M50.N_mers
"K17D40M50.N_mers" contains two columns: reads ID; number of high-depth k-mers. You can set a lower M (M=1) to keep all the
information when running the pipeline, and try to filter the result by different Ms later using command like:
cat reads.N_mers | awk '$2>30'
(5) cat reads.N_mers |cut -f1 >reads.id
If the reads set is very large (>500,000), it will cost a lot of time for self pairwise alignment. So we can randomly pick up a subset
to do assembly. The command line is like:
cat reads.id | RandomList.pl 0.5 >reads.sub.id
then get the sequences of listed reads:
cat reads.clean.fa | pickListSeq.pl reads.sub.id >reads.sub.fa
Step (2)-(5) is to get a subset of repeat containing reads to perform assembly. It will save a lot of computational time but of course
also lose some sensitivity. So these steps are not neccessary, if you have enough computational resource, you can go to step (6) directly
with "reads.clean.fa".
(6) run_HighDseg.pl
It's a small pipeline to perform pairwise alignment and cut high-depth repeat fragments from reads. We used a shell
"multi_run2.sh" to submit jobs on MPI structure supercomputers, you can use your task submitting programs instead of it. Please
revise the script directly at line 72. It provides several choices for the pairwise alignment tool. We suggest you to use cross_match.
Be remember to set the right D depth threshold here, especially when you randomly picked up a fraction of reads, D should be adjusted
synchronously.
run_HighDseg.pl -r reads.sub.fa -n 1000 -a 20 -b seg.bk -d 10
Faster_HighDseg is a faster version. Normally it will significantly reduce computational time up to >1000 times. The maximum number of
pairwise alignment performed in run_HighDseg.pl is O(num_reads^2), but will reduce to O(num_reads
*depth_threshold*
10). But it read all
data into memorey and used pthread technology, so need a huge memory.
(7) cutSeg.pl
It will cut the fragments out from reads according to repeat fragment boundary information get from (6).
cat seg.bk | cutSeg.pl reads.sub.fa >seg.fa
(8) run_SegLink.pl
Refine pairwise alignment for the repeat segments, and get linkage information among them. Alignment hits of low-complexity sequence (LCS)
will be removed. Default will be binary output. It will significantly reduce output size.
run_SegLink.pl -r seg.fa -n 1000 -a 20
It will generate file "seg.link"
Faster_SegLink.pl is a faster version, used similar strategy as Faster_HighDseg. It's rather fast, but needs a huge memory.
(9) DoAssembly
Perform assembly for all repeat segments according to alignment linkage information.
DoAssembly -r seg.fa -l seg.link -b seg.bk -d 10 -o consensus.fa -p "reas_" -s >log
(10) join_fragments.pl
During the process of assembly, I set very stringent parameters to avoid misassemblies. Based on test, I found ReAS can assemble almost all
Highly abundant TEs out, but the result is a little fragmental, so here have a step to further join consensus sequences according to overlap
information.
(11) rmRedundant.pl
remove redundance among the assembled consensus sequences.
Some other programs are also usable for some alternative strategy:
==================================================================
(1) "Clustering" will perform clustering based on program (7) result, then split linkage and sequences into each
group. Correspondingly, "run_DoAssembly.pl" is a batch shell of DoAssembly.
(2) "DepthThreshold_kmer" and "DepthThreshold_overlap" are used to estimate D threshold.
Some more programs for quality control:
=======================================
(1) vsKnownLib.pl
Estimation of completeness of assembly by aligning ReAS TEs with known TEs.
cross_match -masklevel 101 ReAS_TE.fa Known_TE.fa |ConverAlign.pl >ReAS_Known.align
vsKnownLib.pl Known.size ReAS_Known.align >estimate.out
(2) LibOverlap
Comparison of the overlap of genome RepeatMasker results by two libraries
(3) geneMask.pl
Calculate fraction of exon and intron region masked for each known gene
(4) statDepth.pl
Average, minimal and maximum depth of each TE on the reads dataset.
(5) anno_vsProtein.pl
Annotate the assembled repeat sequences by aligning with known TE related proteins, including gag, pol, reverse transcriptase, ribonuclease H,
integrase enzyme, transposases, envolope proteins.
More programs to annotate the assembled repeats:
(be added soon. ...)
Some notes regarding memory saving:
===================================
ReAS is handling large scale reads data, so memory could be a big problem when facing extreme situation. Here are some suggestions
regarding how to make ReAS runnable with less memory demand:
(1) Randomly pick up a small subset of reads for ReAS assembly. Not like genome assembly, which always need at least 6x reads to get
a draft assembly, ReAS is aiming for abundant repeat sequences, which always have many copies on a genome, so we can even capture most
repeats for a survey sequencing lower than 1x. Also reducing input size will greatly save cpu time;
(2) Set a larger "-n" to split sequence file into many pieces. Cross_match will save alignment hits in memory, and output after all alignments
finished, so split the file into pieces will significantly reduce memory requirement of cross_match;
(3) Clustering and then do assembly for each cluster respectively;
(4) Also, we can try to set a high depth threshold and get a small fraction of reads to assemble highly abundant repeat first, then remove
reads belonging to these TEs, and set lower depth threshold to perform assembly in the remaining reads;
(5) Finally, some TEs in the genome are known already, we can remove reads belonging to these TEs prior to assembly.
6.
Acknowledges:
kmer_num, kmer2reads are copied from our genome assembler RePS, which were written by Heng Li, Shengting Li, and Yong
Zhang.
7.
Citation:
ReAS freely available for non-commercial users through ReAS@genomics.org.cn. If you use the program, please cite our paper:
Li R, Ye J, Li S, Wang J, Han Y, et al (2005) ReAS: Recovery of ancestral sequences for transposable elements from the
unassembled reads of a whole genome shotgun. PLoS Comput Biol 1(4): e43.
8.
Bug report
Although I have done a lot of test, I'm sure there must still be some undetected bugs. So I really hope you can point them out to me
whenever you find something strange look like bugs. You can report bug to ReAS@genomics.org.cn or direct to my E-mail lirq@genomics.org.cn
Thanks very much.
Thanks for interested in our ReAS.
\ No newline at end of file
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment