autometa.common.external package¶
Submodules¶
autometa.common.external.bedtools module¶
Script containing wrapper functions for bedtools.
-
autometa.common.external.bedtools.genomecov(ibam, lengths, out, force=False)¶ Run bedtools genomecov with input ibam and lengths to retrieve metagenome coverages.
Parameters: - ibam (str) – </path/to/indexed/BAM/file.ibam>. Note: BAM must be sorted by position.
- lengths (str) – </path/to/genome/lengths.tsv> tab-delimited cols=[contig,length]
- out (str) – </path/to/alignment.bed> The bedtools genomecov output is a tab-delimited file with the following columns: 1. Chromosome 2. Depth of coverage 3. Number of bases on chromosome with that coverage 4. Size of chromosome 5. Fraction of bases on that chromosome with that coverage See also: http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html
- force (bool) – force overwrite of out if it already exists (default is False).
Returns: </path/to/alignment.bed>
Return type: str
Raises: FileExistsError– out file already exists and force is FalseOSError– Why the exception is raised.
-
autometa.common.external.bedtools.main()¶
-
autometa.common.external.bedtools.parse(bed, out=None, force=False)¶ Calculate coverages from bed file.
Parameters: - bed (str) – </path/to/file.bed>
- out (str) – if provided will write to out. I.e. </path/to/coverage.tsv>
- force (bool) – force overwrite of out if it already exists (default is False).
Returns: index=’contig’, col=’coverage’
Return type: pd.DataFrame
Raises: ValueError– out incorrectly formatted to be read as pandas DataFrame.FileNotFoundError– bed does not exist
autometa.common.external.bowtie module¶
Script containing wrapper functions for bowtie2.
-
autometa.common.external.bowtie.align(db: str, sam: str, fwd_reads: List[T] = None, rev_reads: List[T] = None, se_reads: List[T] = None, cpus: int = 0, **kwargs) → str¶ Align reads to bowtie2-index db (at least one *_reads argument is required).
Parameters: - db (str) – </path/to/prefix/bowtie2/database>. I.e. db.{#}.bt2
- sam (str) – </path/to/out.sam>
- fwd_reads (list, optional) – [</path/to/forward_reads.fastq>, …]
- rev_reads (list, optional) – [</path/to/reverse_reads.fastq>, …]
- se_reads (list, optional) – [</path/to/single_end_reads.fastq>, …]
- cpus (int, optional) – Num. processors to use (the default is 0).
- **kwargs (dict, optional) – Additional optional args to supply to bowtie2. Must be in format: key = flag value = flag-value
Returns: </path/to/out.sam>
Return type: str
Raises: ChildProcessError– bowtie2 failed
-
autometa.common.external.bowtie.build(assembly: str, out: str) → str¶ Build bowtie2 index.
Parameters: - assembly (str) – </path/to/assembly.fasta>
- out (str) – </path/to/output/database> Note: Indices written will resemble </path/to/output/database.{#}.bt2>
Returns: </path/to/output/database>
Return type: str
Raises: ChildProcessError– bowtie2-build failed
-
autometa.common.external.bowtie.main()¶
-
autometa.common.external.bowtie.run(cmd: str) → bool¶ Run cmd via subprocess.
Parameters: cmd (str) – Executable input str Returns: True if no returncode from subprocess.call else False Return type: bool
autometa.common.external.diamond module¶
Class and functions related to running diamond on metagenome sequences
-
autometa.common.external.diamond.blast(fasta: str, database: str, outfpath: str, blast_type: str = 'blastp', evalue: float = 1e-05, maxtargetseqs: int = 200, cpus: int = 2, tmpdir: str = None, force: bool = False, verbose: bool = False) → str¶ Performs diamond blastp search using query sequence against diamond formatted database
Parameters: - fasta (str) – Path to fasta file having the query sequences. Should be amino acid sequences in case of BLASTP and nucleotide sequences in case of BLASTX
- database (str) – Path to diamond formatted database
- outfpath (str) – Path to output file
- blast_type (str, optional) – blastp to align protein query sequences against a protein reference database, blastx to align translated DNA query sequences against a protein reference database, by default ‘blastp’
- evalue (float, optional) – cutoff e-value to count hit as significant, by default float(‘1e-5’)
- maxtargetseqs (int, optional) – max number of target sequences to retrieve per query by diamond, by default 200
- cpus (int, optional) – Number of processors to be used, by default uses all the processors of the system
- tmpdir (str, optional) – Path to temporary directory. By default, same as the output directory
- force (bool, optional) – overwrite existing diamond results, by default False
- verbose (bool, optional) – log progress to terminal, by default False
Returns: Path to BLAST results
Return type: str
Raises: FileNotFoundError– fasta file does not existValueError– provided blast_type is not ‘blastp’ or ‘blastx’subprocess.CalledProcessError– Failed to run blast
-
autometa.common.external.diamond.makedatabase(fasta: str, database: str, cpus: int = 2) → str¶ Creates a database against which the query sequence would be blasted
Parameters: - fasta (str) – Path to fasta file whose database needs to be made e.g. ‘<path/to/fasta/file>’
- database (str) – Path to the output diamond formatted database file e.g. ‘<path/to/database/file>’
- cpus (int, optional) – Number of processors to be used. By default uses all the processors of the system
Returns: Path to diamond formatted database
Return type: str
Raises: subprocess.CalledProcessError– Failed to create diamond formatted database
-
autometa.common.external.diamond.parse(results: str, bitscore_filter: float = 0.9, verbose: bool = False) → dict¶ Retrieve diamond results from output table
Parameters: - results (str) – Path to BLASTP output file in outfmt9
- bitscore_filter (0 < float <= 1, optional) – Bitscore filter applied to each sseqid, by default 0.9 Used to determine whether the bitscore is above a threshold value. For example, if it is 0.9 then only bitscores >= 0.9 * the top bitscore are accepted
- verbose (bool, optional) – log progress to terminal, by default False
Returns: {qseqid: {sseqid, sseqid, …}, …}
Return type: dict
Raises: FileNotFoundError– diamond results table does not existValueError– bitscore_filter value is not a float or not in range of 0 to 1
autometa.common.external.hmmer module¶
Functions related to running hmmer on metagenome sequences
-
autometa.common.external.hmmer.annotate_parallel(orfs, hmmdb, outfpath, cpus, seed=42)¶
-
autometa.common.external.hmmer.annotate_sequential(orfs, hmmdb, outfpath, cpus, seed=42)¶
-
autometa.common.external.hmmer.filter_markers(infpath, outfpath, cutoffs, orfs=None, force=False)¶ Filter markers from hmmscan output table that are above cutoff values.
Parameters: - infpath (str) – </path/to/hmmscan.tsv>
- outfpath (str) – </path/to/output.markers.tsv>
- cutoffs (str) – </path/to/cutoffs.tsv>
- orfs (str, optional) – Default will attempt to translate recovered qseqids to contigs </path/to/prodigal/called/orfs.fasta>
- force (bool, optional) – Overwrite existing outfpath (the default is False).
Returns: </path/to/output.markers.tsv>
Return type: str
Raises: FileNotFoundError– infpath or cutoffs not foundFileExistsError– outfpath already exists and force=FalseAssertionError– No returned markers pass the cutoff thresholds. I.e. final df is empty.
-
autometa.common.external.hmmer.hmmpress(fpath)¶ Runs hmmpress on fpath.
Parameters: fpath (str) – </path/to/kindom.markers.hmm>
Returns: </path/to/hmmpressed/kindom.markers.hmm>
Return type: str
Raises: FileNotFoundError– fpath not found.subprocess.CalledProcessError– hmmpress failed
-
autometa.common.external.hmmer.hmmscan(orfs, hmmdb, outfpath, cpus=0, force=False, parallel=True, gnu_parallel=False, seed=42)¶ Runs hmmscan on dataset ORFs and provided hmm database.
Note
Only one of parallel and gnu_parallel may be provided as True
Parameters: - orfs (str) – </path/to/orfs.faa>
- hmmdb (str) – </path/to/hmmpressed/database.hmm>
- outfpath (str) – </path/to/output.hmmscan.tsv>
- cpus (int, optional) – Num. cpus to use. 0 will run as many cpus as possible (the default is 0).
- force (bool, optional) – Overwrite existing outfpath (the default is False).
- parallel (bool, optional) – Will use multithreaded parallelization offered by hmmscan (the default is True).
- gnu_parallel (bool, optional) – Will parallelize hmmscan using GNU parallel (the default is False).
- seed (int, optional) – set RNG seed to <n> (if 0: one-time arbitrary seed) (the default is 42).
Returns: </path/to/output.hmmscan.tsv>
Return type: str
Raises: ValueError– Both parallel and gnu_parallel were provided as TrueFileExistsError– outfpath already existssubprocess.CalledProcessError– hmmscan failed
-
autometa.common.external.hmmer.main()¶
autometa.common.external.prodigal module¶
Functions to retrieve orfs from provided assembly using prodigal
-
autometa.common.external.prodigal.aggregate_orfs(search_str: str, outfpath: str) → None¶
-
autometa.common.external.prodigal.annotate_parallel(assembly: str, prots_out: str, nucls_out: str, cpus: int) → None¶
-
autometa.common.external.prodigal.annotate_sequential(assembly: str, prots_out: str, nucls_out: str) → None¶
-
autometa.common.external.prodigal.contigs_from_headers(fpath: str) → Mapping[str, str]¶ Get ORF id to contig id translations using prodigal assigned ID from description.
First determines if all of ID=3495691_2 from description is in header. “3495691_2” represents the 3,495,691st gene in the 2nd sequence.
Example
#: prodigal versions < 2.6 record >>>record.id 'k119_1383959_3495691_2' >>>record.description 'k119_1383959_3495691_2 # 688 # 1446 # 1 # ID=3495691_2;partial=01;start_type=ATG;rbs_motif=None;rbs_spacer=None' >>>record.description.split('#')[-1].split(';')[0].strip() 'ID=3495691_2' >>>orf_id = '3495691_2' '3495691_2' >>>record.id.replace(f'_{orf_id}', '') 'k119_1383959' #: prodigal versions >= 2.6 record >>>record.id 'k119_1383959_2' >>>record.id.rsplit('_',1)[0] 'k119_1383959'
Parameters: fpath (str) – </path/to/prodigal/called/orfs.fasta> Returns: contigs translated from prodigal ORF description. {orf_id:contig_id, …} Return type: dict
-
autometa.common.external.prodigal.main()¶
-
autometa.common.external.prodigal.orf_records_from_contigs(contigs: Union[List[T], Set[T]], fpath: str) → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f343c807e90>]¶ Retrieve list of ORFs headers from contigs. Prodigal annotated ORFs are required as the input fpath.
Parameters: - contigs (iterable) – iterable of contigs from which to retrieve ORFs
- fpath (str) – </path/to/prodigal/called/orfs.fasta>
Returns: ORF SeqIO.SeqRecords from provided contigs. i.e. [SeqRecord, …]
Return type: list
Raises: ExceptionName– Why the exception is raised.
-
autometa.common.external.prodigal.run(assembly: str, nucls_out: str, prots_out: str, force: bool = False, cpus: int = 0) → Tuple[str, str]¶ Calls ORFs from provided input assembly
Parameters: - assembly (str) – </path/to/assembly.fasta>
- nucls_out (str) – </path/to/nucls.out>
- prots_out (str) – </path/to/prots.out>
- force (bool) – overwrite outfpath if it already exists (the default is False).
- cpus (int) – num cpus to use. Default (cpus=0) will run as many `cpus` as possible
Returns: (nucls_out, prots_out)
Return type: 2-Tuple
Raises: FileExistsError– nucls_out or prots_out already existssubprocess.CalledProcessError– prodigal FailedChildProcessError– nucls_out or prots_out not writtenIOError– nucls_out or prots_out incorrectly formatted
autometa.common.external.samtools module¶
Script containing wrapper functions for samtools
-
autometa.common.external.samtools.main()¶
-
autometa.common.external.samtools.sort(sam, bam, cpus=2)¶ Views then sorts sam file by leftmost coordinates and outputs to bam.
Parameters: - sam (str) – </path/to/alignment.sam>
- bam (str) – </path/to/output/alignment.bam>
- cpus (int, optional) – Number of processors to be used. By default uses all the processors of the system
Raises: TypeError– cpus must be an integer greater than zeroFileNotFoundError– Specified path is incorrect or the file is emptyExternalToolError– Samtools did not run successfully, returns subprocess traceback and command run