Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
SAM harmonization
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Brankovics, Balazs
SAM harmonization
Commits
394333af
Commit
394333af
authored
1 year ago
by
Balázs Brankovics
Browse files
Options
Downloads
Patches
Plain Diff
refactor sam-similarity.pl
parent
29cfaf59
Branches
main
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
src/sam-similarity.pl
+21
-23
21 additions, 23 deletions
src/sam-similarity.pl
with
21 additions
and
23 deletions
src/sam-similarity.pl
+
21
−
23
View file @
394333af
...
...
@@ -31,7 +31,7 @@ my $info = {
# - none => STDIN
# - one => only SAM file
# - three
my
(
$sam
,
$rfasta
,
$qfasta
)
=
@ARGV
;
my
(
$sam
,
$r
ef_
fasta
,
$q
uery_
fasta
)
=
@ARGV
;
# Check input files
my
$error
=
"";
...
...
@@ -46,19 +46,18 @@ if (@ARGV) {
$error
.=
"
ERROR: If you wish to specify FASTA files, please specify both reference and query files.
\n
";
}
elsif
(
@ARGV
==
3
)
{
# Check the FASTA files
if
(
!
-
e
$rfasta
||
-
z
$rfasta
)
{
$error
.=
"
ERROR: Reference FASTA file ('
$rfasta
') is empty or missing
\n
";
if
(
!
-
e
$r
ef_
fasta
||
-
z
$r
ef_
fasta
)
{
$error
.=
"
ERROR: Reference FASTA file ('
$r
ef_
fasta
') is empty or missing
\n
";
}
if
(
!
-
e
$qfasta
||
-
z
$qfasta
)
{
$error
.=
"
ERROR: Query FASTA file ('
$qfasta
') is empty or missing
\n
";
if
(
!
-
e
$q
uery_
fasta
||
-
z
$q
uery_
fasta
)
{
$error
.=
"
ERROR: Query FASTA file ('
$q
uery_
fasta
') is empty or missing
\n
";
}
}
}
# Print help or errors if needed
biointbasics::
print_help
(
\
@ARGV
,
$info
,
$error
);
my
$files
=
1
;
# Store length info based on SAM file for IDs
my
%ref
;
my
%query
;
...
...
@@ -73,7 +72,6 @@ my %q;
# Keep all the individual identity scores
my
@scores
;
my
$samfh
;
# Read from STDIN if no file is specified or '-' is used
if
(
!
$sam
||
$sam
eq
'
-
')
{
...
...
@@ -118,8 +116,8 @@ while(<$samfh>) {
# Merge overlapping or continuous hits before calculating (breadth) coverage
&merge_overlapping
(
\
%r
);
&merge_overlapping
(
\
%q
);
my
$cov
r
=
&calc_coverage
(
\
%r
);
my
$cov
q
=
&calc_coverage
(
\
%q
);
my
$
ref_
cov
=
&calc_coverage
(
\
%r
);
my
$
query_
cov
=
&calc_coverage
(
\
%q
);
# Report ANI if there were any hits => denominator > 0
unless
(
$denom
)
{
...
...
@@ -137,44 +135,44 @@ for (@sort) {
print
"(
Range
of
similarity
scores
[
$sort
[
0
],
$sort
[
-
1
]];
arithmetic
mean
"
. (
$sum
/ scalar
@sort
) .
")
\
n
"
;
# Get full lengths for coverage info
my (
$len
_r
,
$len
_q
);
my (
$
ref_
len
,
$
query_
len
);
# Use FASTA files if possible
if (
$rfasta
&&
$qfasta
) {
if (
$r
ef_
fasta
&&
$q
uery_
fasta
) {
# store seq info hash: ID -> sequence
my
$ref_seq
= {};
my
$query_seq
= {};
biointbasics::read_fasta(
$ref_seq
, [],
$rfasta
);
biointbasics::read_fasta(
$query_seq
, [],
$qfasta
);
biointbasics::read_fasta(
$ref_seq
, [],
$r
ef_
fasta
);
biointbasics::read_fasta(
$query_seq
, [],
$q
uery_
fasta
);
# Add up sequence lengths
for (values %
$ref_seq
) {
$len
_r
+= length(
$_
);
$
ref_
len
+= length(
$_
);
}
for (values %
$query_seq
) {
$len
_q
+= length(
$_
);
$
query_
len
+= length(
$_
);
}
} else {
# Get length info from SAM data
for (values %ref) {
$len
_r
+=
$_
;
$
ref_
len
+=
$_
;
}
for (values %query) {
$len
_q
+=
$_
;
$
query_
len
+=
$_
;
}
# Warn user
print STDERR
"
WARNING:
SAM
file
is
used
to
calculate
reference
and
query
length
.
This
may
be
inacurate
if
it
was
filtered
.\
n
"
;
}
# Print coverage info
print
"
R
covered:
$cov
r
"
;
print
"
("
, (sprintf '%.2f', (100 *
$cov
r
/
$len
_r
)),
"
%
)"
;
print
"
R
covered:
$
ref_
cov
"
;
print
"
("
, (sprintf '%.2f', (100 *
$
ref_
cov
/
$
ref_
len
)),
"
%
)"
;
print
"
\
n
"
;
print
"
Q
covered:
$cov
q
"
;
print
"
("
, (sprintf '%.2f', (100 *
$cov
q
/
$len
_q
)),
"
%
)"
;
print
"
Q
covered:
$
query_
cov
"
;
print
"
("
, (sprintf '%.2f', (100 *
$
query_
cov
/
$
query_
len
)),
"
%
)"
;
print
"
\
n
"
;
# Print summary in 2 line TSV
print
"
# ", join("\t", "Similarity", "Identical", "Aligned", "A-covered", "B-covered", "A-length", "B-length"), "\n";
print
join
("
\t
",
$numer
/
$denom
,
$numer
,
$denom
,
$cov
r
,
$cov
q
,
$len
_r
,
$len
_q
),
"
\n
";
print
join
("
\t
",
$numer
/
$denom
,
$numer
,
$denom
,
$
ref_
cov
,
$
query_
cov
,
$
ref_
len
,
$
query_
len
),
"
\n
";
#===SUBROUTINES=================================================================
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment