Benchmarking¶
This page contains information regarding test datasets as well as benchmarking statistics. The current Autometa pipeline was compared against its previous version as well other binning pipelines.
Test datasets¶
Simulated¶
Communities were simulated using ART, a sequencing read simulator, with a collection of 3000 bacteria randomly retrieved. Genomes were retrieved until the provided total length was reached.
e.g. -l 1250 would translate to 1250Mbp as the sum of total lengths for all bacterial genomes retrieved.
# Work out coverage level for art_illumina
# C = [(LN)/G]/2
# C = coverage
# L = read length (total of paired reads)
# G = genome size in bp
# -p : indicate a paired-end read simulation or to generate reads from both ends of amplicons
# -ss : HS25 -> HiSeq 2500 (125bp, 150bp)
# -f : fold of read coverage simulated or number of reads/read pairs generated for each amplicon
# -m : the mean size of DNA/RNA fragments for paired-end simulations
# -s : the standard deviation of DNA/RNA fragment size for paired-end simulations.
# -l : the length of reads to be simulated
$ coverage = ((250 * reads) / (length * 1000000))
$ art_illumina -p -ss HS25 -l 125 -f $coverage -o simulated_reads -m 275 -s 90 -i asm_path
| Community | Num. Genomes | Num. Control Sequences |
|---|---|---|
| 78.125Mbp | 21 | 4,044 |
| 156.25Mbp | 38 | 3,573 |
| 312.50Mbp | 85 | 7,708 |
| 625Mbp | 166 | 17,590 |
| 1250Mbp | 319 | 41,507 |
| 2500Mbp | 656 | 67,702 |
| 5000Mbp | 1,288 | 140,529 |
| 10000Mbp | 2,638 | 285,262 |
You can download all the Simulated communities using this link. Individual communities can be downloaded using the links in the above table.
For more information on simulated communities, check the README.md located in the simulated_communities directory.
Synthetic¶
51 bacterial isolates were assembled into synthetic communities which we’ve titled MIX51.
The initial synthetic community was prepared using a mixture of fifty-one bacterial isolates. The synthetic community’s DNA was extracted for sequencing, assembly and binning.
You can download the MIX51 community using this link.
Download datasests¶
Using autometa built-in module¶
Todo
Address Issue #110 and add steps here.
Using command line¶
You can download the individual assemblies of different datasests with the help of gdown using command line. If you have installed autometa using conda then gdown should already be installed. If not, you can install it using conda install -c conda-forge gdown or pip install gdown.
Example for the 78Mbp community
- Navigate to the 78Mbp community dataset using the link mentioned above.
- Get the file ID by navigating to any of the files and right clicking, then selecting the
get linkoption. This will have acopy linkbutton that you should use. The link for the metagenome assembly (ie. metagenome.fna.gz) should look like this :https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing - The file ID is within the / forward slashes between file/d/ and /, e.g:
# Pasted from copy link button:
https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing
# begin file ID ^ ------------------------------^ end file ID
- Copy the file ID
- Now that we have the File ID, you can specify the ID or use the drive.google.com prefix. Both should work.
file_id="15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y"
gdown --id ${file_id} -O metagenome.fna.gz
# or
gdown https://drive.google.com/uc?id=${file_id} -O metagenome.fna.gz
Note
Unfortunately, at the moment gdown doesn’t support downloading entire directories from Google drive. There is an open Pull request on the gdown repository addressing this specific issue which we are keeping a close eye on and will update this documentation when it is merged.
Benchmarks¶
Todo
Add the Benchmarking statistics