OverviewMash Screen (citation) is a software for fast contamination screening using the MinHash algorithm. For this purpose, SeqSphere+ comes with a Mash reference database (sketch size of s=1,000 and k=21) that contains all prokaryotic NCBI Genome entries with status complete or chromosome that were filtered for taxonomic reliable genus and species information. Important: Mash Screen requires Linux or Windows with installed Windows Subsystem For Linux. Pipeline Contamination Check (Mash Screen)If the option Perform contamination check (Mash Screen) is enabled in the pipeline script Mash Screen is started to find the closest matches for the FASTA assembly contigs of a Sample in the Mash reference database (only contamination with a second species above 9 precent ratio can be reliable detected from FASTA files). If multiple different species are found above the predefined thresholds (Identity >=0.95, Shared-hashes >=100), a potential contamination is reported. If a potential contamination is detected, a warning is logged in the pipeline log. Up to seven fields of the procedure statistics are filled with the results of the contamination check:
* only filled if potential contamination found The Top Species Match is always filled even if the top match does not reach the thresholds. All fields can be exported, shown in a comparison table, and viewed in the procedure details panel of a sample. If a contamination is found, it is highlighted as a warning in this panel. The first two fields are default fields when creating a comparison table. The closely related species defined in Mash equivalency groups are not treated as contamination. Tools Menu Contamination Check (Mash Screen)The menu function Tools | Genome Utilities | Contamination Check (Mash Screen) can be used to screen for contaminants in a read file (FASTQ) or an assembly contigs file (FASTA/GB/BAM/ACE). For read data the forward reads file is recommended to be used (from FASTQ files contamination with a second species can be reliable detected above a 1 precent ratio). When the dialog is confirmed with Start, Mash Screen is started to find matches for the query in the Mash reference database. The resulting matches are filtered by thresholds for Identity and Shared-hashes. The default thresholds (Identity >=0.95, Shared-hashes >=100) can be changed. By default, the option 'Winner takes all' is enabled, to remove redundancy in the result. If this option is not enabled, every matching strain from the same species of the reference database is reported in the result. The result is shown in a dialog window containing an exportable table with all matches above the defined thresholds. The table can be exported and has the following columns:
|