The SeqSphere+ consensus caller is used to calculate a consensus

  • for read alignments resulting of SKESA assemblies that were remapped with BWA
  • for read alignments resulting of SPAdes assemblies that were remapped with BWA
  • for read alignments created by reference mapping with BWA
  • for correcting the consensus of Velvet assemblies

Consensus Base

  1. Get winner base:
    • Get for each strand the winner base with the highest sum of base qualities.
    • If gap appears more often in the reads than any of the other bases, a gap is returned.
    • If the two strand winner bases are not equal, mix both strands together, and get again the winner base with the highest sum of base qualities.
  2. Ambiguity heck:
    • If sum of base quality of the winner base for both strands reaches the read support threshold (default: 60%) related to the sum of all base qualities, the winner base is taken as potential consensus base.
    • Else the ambiguity N is called as potential consensus base.
  3. Cpverage check:
    • If the coverage strands is below the coverage threshold (default: 0) , an N is called.


For remapping alignments of SPAdes assemblies and for reference mappings that have an estimated average genome coverage below 50, the coverage threshold is set to 5. Else no threshold is set, so minimum coverage is required. Both thresholds can be can be configured in the short-read assembler settings of a pipeline script.


Consensus Base Quality

The consensus base qualities are caluclated from the two best base qualities for each strand (based on a MIRA-like process):

For each strand, the best quality is taken.
Then 10% of the next best base quality is added.
The total of both strand qualities gives the final quality, with cap 90.
Consensus gaps have always quality 0.

Example:
forward:
    A 30   -> 30     \
    A 20   ->  2     /  -> 32  \
    A 20                        \
    A 20                         \
                                   ->  60     
reverse:                         /
    A 26   -> 26     \          /
    A 20   ->  2     /  -> 28  /
    A 15                         

Uncovered Reference Genome Positions in BAM Files

If the reference genome has regions that are uncovered in the genome under study, those gaps are deleted in the resulting consensus sequence and juxtapositional covered regions are concatenated. This rule applies also for FASTA files that are exported.

Filtering Reads in BAM Files

If one of the following flags are found, the reads are skipped when calling the consensus, and are not imported into samples:

  • Read marked as unmapped
  • Read mapping quality is below 10 (configurable)
  • Read has not a single best hit (i.e., X0 flag is not equal to 1)
  • Read fails vendor quality check
  • Read marked as duplicate (however, marking duplicates is not performed within SeqSphere+)
  • Read alignment start is 0, length is 0 or name is *