ContentsOverviewMash Distance (citation) is a software for fast genome distance estimation using the MinHash algorithm. In SeqSphere+ Mash Distance is used for rapid species identification and automatic project choosing in the pipeline. For this purpose, SeqSphere+ comes with a Mash reference database (sketch size of s=1,000 and k=21) that contains all prokaryotic NCBI RefSeq Genome entries with status complete or chromosome that were filtered for taxonomic reliable genus and species information. Important: Mash Distance requires Linux or Windows with installed Windows Subsystem For Linux. Pipeline Automatic Project Choosing (Mash Distance)In the section Define Projects of a pipeline script, the option Automatically choose project (Mash Distance) can be enabled to choose the correct project for a processed sample automatically. If enabled, Mash Distance is started to find the closest match for the input file of a sample (FASTA/GB/BAM/ACE or first FASTQ file) in the Mash reference database. If a match above the thresholds (Mash-distance <=0.1, Matching-hashes >=100) is found, the project that provides task templates for that species is automatically chosen. This option can only be used if only one project per species is defined in the pipeline. If no matches above the threshold were found or if no project was found in the pipeline script for a matching species, the sample processing fails and the pipeline continues with the next sample. To be able to handle closely related and/or synonymous species, Mash equivalency groups have been defined for some species. Distinction between Escherichia and ShigellaIf the Top Match of the Mash species identification is Escherichia or Shigella, the function will try to further distinguish between these closely related genera using the presence of ipaH and Shigella-specific STs as defined by ShigaPass [PubMed 36951906]. If the genus cannot be determined this way, the species will be assigned based on the If species identification fails settings in the pipeline script. Tools Menu Identification (Mash Distance)The menu function Tools | Genome Utilities | Identification (Mash Distance) can be used for rapid identification of a read file (FASTQ) or an assembly contigs file (FASTA/GB/BAM/ACE). For read data, the forward reads file is recommended to be used. When the dialog is confirmed with Start, Mash Distance is started to find matches for the query in the Mash reference database. The resulting matches are filtered by thresholds for Mash-distance and Matching-hashes. The default thresholds (Mash-distance <=0.1, Matching-hashes >=100) can be changed. The result is shown in a dialog window containing an exportable table with all matches above the defined thresholds. By selecting a row and right-clicking the entry can be browsed at the NCBI website. The table has the following columns:
|