autometa.common package¶

Subpackages¶

autometa.common.external package

Submodules¶

autometa.common.coverage module¶

Calculates coverage of contigs

autometa.common.coverage.from_spades_names(records)¶

Retrieve coverages from SPAdes scaffolds headers.

Example SPAdes header : NODE_83_length_162517_cov_224.639

Parameters:	records (list) – [SeqRecord,…]
Returns:	index=contig, name=’coverage’, dtype=float
Return type:	pd.Series

autometa.common.coverage.get(fasta, out, from_spades=False, fwd_reads=None, rev_reads=None, se_reads=None, sam=None, bam=None, lengths=None, bed=None, cpus=1)¶

Get coverages for assembly fasta file using provided files or if the metagenome assembly was generated from SPAdes, use the k-mer coverages provided in each contig’s header by specifying from_spades=True.

Either fwd_reads and rev_reads and/or se_reads or,`sam`, or bam, or bed must be provided if from_spades=False.

Notes

Will begin coverage calculation based on files provided checking in the following order:

bed
bam
sam
fwd_reads and rev_reads and se_reads

Event sequence to calculate contig coverages:

align reads to generate alignment.sam
sort samfile to generate alignment.bam
calculate assembly coverages to generate alignment.bed
calculate contig coverages to generate coverage.tsv

Parameters:	fasta (str) – </path/to/assembly.fasta> out (str) – </path/to/output/coverages.tsv> from_spades (bool, optional) – If True, will attempt to parse record ids for coverage information. This is only compatible with SPAdes assemblies. (the Default is False). fwd_reads (list, optional) – [</path/to/forward_reads.fastq>, …] rev_reads (list, optional) – [</path/to/reverse_reads.fastq>, …] se_reads (list, optional) – [</path/to/single_end_reads.fastq>, …] sam (str, optional) – </path/to/alignments.sam> bam (str, optional) – </path/to/alignments.bam> lengths (str, optional) – </path/to/lengths.tsv> bed (str, optional) – </path/to/alignments.bed> cpus (int, optional) – Number of cpus to use for coverage calculation.
Returns:	index=contig cols=[‘coverage’]
Return type:	pd.DataFrame

autometa.common.coverage.main()¶

autometa.common.coverage.make_length_table(fasta, out)¶

Writes a tab-delimited length table to out given an input fasta.

Parameters:	fasta (str) – </path/to/assembly.fasta> out (str) – </path/to/lengths.tsv>
Returns:	</path/to/lengths.tsv>
Return type:	str

autometa.common.coverage.normalize(df)¶

autometa.common.exceptions module¶

File containing customized AutometaErrors for more specific exception handling

exception autometa.common.exceptions.AutometaError¶

Bases: Exception

Base class for Autometa Errors.

exception autometa.common.exceptions.BinningError¶

Bases: autometa.common.exceptions.AutometaError

BinningError exception class.

Exception called when issues arise during or after the binning process.

This is usually a result of no clusters being recovered.

exception autometa.common.exceptions.ChecksumMismatchError¶

Bases: autometa.common.exceptions.AutometaError

ChecksumMismatchError exception class

Exception called when checksums do not match.

exception autometa.common.exceptions.DatabaseOutOfSyncError(value)¶

Bases: autometa.common.exceptions.AutometaError

Raised when NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other :param AutometaError: Base class for other exceptions :type AutometaError: class

__str__()¶: Operator overloading to return the text message written while raising the error, rather than the message of __str__ by base exception :returns: Message written alongside raising the exception :rtype: str

exception autometa.common.exceptions.ExternalToolError(cmd, err)¶

Bases: autometa.common.exceptions.AutometaError

Raised when samtools sort is not executed properly.

Parameters:	AutometaError (class) – Base class for other exceptions

exception autometa.common.exceptions.TableFormatError¶

Bases: autometa.common.exceptions.AutometaError

TableFormatError exception class.

Exception called when Table format is incorrect.

This is usually a result of a table missing the ‘contig’ column as this is often used as the index.

autometa.common.kmers module¶

autometa.common.markers module¶

Autometa Marker class consisting of various methods to annotate sequences with marker sets depending on sequence set taxonomy

autometa.common.markers.get(kingdom: str, orfs: str, dbdir: str, scans: str = None, out: str = None, force: bool = False, format: str = 'wide', cpus: int = 2, parallel: bool = True, gnu_parallel: bool = False, seed: int = 42) → pandas.core.frame.DataFrame¶

Retrieve contigs’ markers from markers database that pass cutoffs filter.

Parameters:	kingdom (str) – kingdom to annotate markers choices = [‘bacteria’, ‘archaea’] orfs (str) – Path to amino-acid ORFs file dbdir – Directory should contain hmmpressed marker genes database files. scans (str, optional) – Path to existing hmmscan table to filter by cutoffs out (str, optional) – Path to write annotated markers table. force (bool, optional) – Whether to overwrite existing out file path, by default False. format (str, optional) – wide - returns wide dataframe of contig PFAM counts (default) long - returns long dataframe of contig PFAM counts list - returns list of pfams for each contig counts - returns count of pfams for each contig cpus (int, optional) – Number of cores to use if running in parallel, by default all available. parallel (bool, optional) – Whether to run hmmscan using its parallel option, by default True. gnu_parallel (bool, optional) – Whether to run hmmscan using gnu parallel, by default False. seed (int, optional) – Seed to pass into hmmscan for determinism, by default 42.
Returns:	wide - pd.DataFrame(index_col=contig, columns=[PFAM,…]) long - pd.DataFrame(index_col=contig, columns=[‘sacc’,’count’]) list - {contig:[pfam,pfam,…],contig:[…],…} counts - {contig:count, contig:count,…}
Return type:	pd.Dataframe or dict
Raises:	`ValueError` – Why the exception is raised.

autometa.common.markers.load(fpath, format='wide')¶

Read markers table into specified format.

Parameters:	fpath (str) – </path/to/kingdom.markers.tsv> format (str, optional) – wide - index=contig, cols=[domain sacc,..] (default) long - index=contig, cols=[‘sacc’,’count’] list - {contig:[sacc,…],…} counts - {contig:len([sacc,…]), …}
Returns:	wide - index=contig, cols=[domain sacc,..] (default) long - index=contig, cols=[‘sacc’,’count’] list - {contig:[sacc,…],…} counts - {contig:len([sacc,…]), …}
Return type:	pd.DataFrame or dict
Raises:	`FileNotFoundError` – Provided fpath does not exist `ValueError` – Provided format is not in choices: choices = wide, long, list or counts

autometa.common.markers.main()¶

autometa.common.metagenome module¶

Script containing Metagenome class for general handling of metagenome assembly

class autometa.common.metagenome.Metagenome(assembly)¶

Bases: object

Autometa Metagenome Class.

Parameters:	assembly (str) – </path/to/metagenome/assembly.fasta>

sequences¶

[seq,…]

Type:	list

seqrecords¶

[SeqRecord,…]

Type:	list

nseqs¶

Number of sequences in assembly.

Type:	int

length_weighted_gc¶

Length weighted average GC% of assembly.

Type:	float

size¶

Total assembly size in bp.

Type:	int

largest_seq¶

id of longest sequence in assembly

Type:	str

* self.fragmentation_metric()

* self.describe()

* self.length_filter()

describe() → pandas.core.frame.DataFrame¶

Return dataframe of details.

# assembly : Assembly input into Metagenome(…) [index column] # nseqs : Number of sequences in assembly # size : Size or total sum of all sequence lengths # N50 : # N10 : # N90 : # length_weighted_gc_content : Length weighted average GC content # largest_seq : Largest sequence in assembly

Returns:
Return type:	pd.DataFrame

fragmentation_metric(quality_measure: float = 0.5) → int¶

Describes the quality of assembled genomes that are fragmented in contigs of different length.

Note

For more information see this metagenomics wiki from Matthias Scholz

Parameters:	quality_measure (0 < float < 1) – Description of parameter quality_measure (the default is .50). I.e. default measure is N50, but could use .1 for N10 or .9 for N90
Returns:	Minimum contig length to cover quality_measure of genome (i.e. length weighted median)
Return type:	int

gc_content() → pandas.core.frame.DataFrame¶

Retrieves GC content from sequences in assembly

Returns:	index=”contig”, columns=[“gc_content”,”length”]
Return type:	pd.DataFrame

largest_seq

Retrieve the name of the largest sequence in the provided assembly.

Returns:	record ID of the largest sequence in assembly.
Return type:	str

length_filter(out: str, cutoff: int = 3000, force: bool = False)¶

Filters sequences by length with provided cutoff.

Note

A WARNING will be emitted and the original metagenome will be returned if no contigs pass the length filter cutoff.

Parameters:	out (str) – Path to write length filtered output fasta file. cutoff (int, optional) – Lengths above or equal to cutoff that will be retained (the default is 3000). force (bool, optional) – Overwrite existing out file (the default is False).
Returns:	autometa Metagenome object with only assembly sequences above the cutoff threshold.
Return type:	Metagenome
Raises:	`TypeError` – cutoff value must be a float or integer `ValueError` – cutoff value must be a positive real number `FileExistsError` – filepath consisting of sequences that passed filter already exists

length_weighted_gc

Retrieve the length weighted average GC percentage of provided assembly.

Returns:	GC percentage weighted by contig length.
Return type:	float

nseqs

Retrieve the number of sequences in provided assembly.

Returns:	Number of sequences parsed from assembly
Return type:	int

seqrecords

Retrieve SeqRecord objects from provided assembly.

Returns:	[SeqRecord, SeqRecord, …]
Return type:	list

sequences

Retrieve the sequences from provided assembly.

Returns:	[seq, seq, …]
Return type:	list

size

Retrieve the summation of sizes for each contig in the provided assembly.

Returns:	Total summation of contig sizes in assembly
Return type:	int

autometa.common.metagenome.main()¶

autometa.common.utilities module¶

File containing common utilities functions to be used by Autometa scripts.

autometa.common.utilities.calc_checksum(fpath: str) → str¶

Retrieve md5 checksum from provided fpath.

See:

https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file

fpath : str

</path/to/file>

str

space-delimited hexdigest of fpath using md5sum and basename of fpath. e.g. ‘hash filename

‘

FileNotFoundError

Provided fpath does not exist

TypeError

fpath is not a string

autometa.common.utilities.file_length(fpath: str, approximate: bool = False) → int¶

Retrieve the number of lines in fpath

See: https://stackoverflow.com/q/845058/13118765

Parameters:	fpath (str) – Description of parameter fpath. approximate (bool) – If True, will approximate the length of the file from the file size.
Returns:	Number of lines in fpath
Return type:	int
Raises:	`FileNotFoundError` – provided fpath does not exist

autometa.common.utilities.gunzip(infpath: str, outfpath: str, delete_original: bool = False, block_size: int = 65536) → str¶

Decompress gzipped infpath to outfpath and write checksum of outfpath upon successful decompression.

Parameters:	infpath (str) – </path/to/file.gz> outfpath (str) – </path/to/file> delete_original (bool) – Will delete the original file after successfully decompressing infpath (Default is False). block_size (int) – Amount of infpath to read in to memory before writing to outfpath (Default is 65536 bytes).
Returns:	</path/to/file>
Return type:	str
Raises:	`FileExistsError` – outfpath already exists and is not empty

autometa.common.utilities.make_pickle(obj: Any, outfpath: str) → str¶

Serialize a python object (obj) to outfpath. Note: Opposite of unpickle()

Parameters:	obj (any) – Python object to serialize to outfpath. outfpath (str) – </path/to/pickled/file>.
Returns:	</path/to/pickled/file.pkl>
Return type:	str
Raises:	`ExceptionName` – Why the exception is raised.

autometa.common.utilities.read_checksum(fpath: str) → str¶

Read checksum from provided checksum formatted fpath.

Note: See write_checksum for how a checksum file is generated.

Parameters:	fpath (str) – </path/to/file.md5>
Returns:	checksum retrieved from fpath.
Return type:	str
Raises:	`TypeError` – Provided fpath was not a string. `FileNotFoundError` – Provided fpath does not exist.

autometa.common.utilities.tarchive_results(outfpath: str, src_dirpath: str) → str¶

Generate a tar archive of Autometa Results

See: https://stackoverflow.com/questions/2032403/how-to-create-full-compressed-tar-file-using-python

Parameters:	outfpath (str) – </path/to/output/tar/archive.tar.gz \|\| </path/to/output/tar/archive.tgz src_dirpath (str) – </paths/to/directory/to/archive>
Returns:	</path/to/output/tar/archive.tar.gz \|\| </path/to/output/tar/archive.tgz
Return type:	str
Raises:	`FileExistsError` – outfpath already exists

autometa.common.utilities.timeit(func: function) → function¶

Time function run time (to be used as a decorator). I.e. when defining a function use python’s decorator syntax

Example

@timeit
def your_function(args):
    ...

Notes

See: https://docs.python.org/2/library/functools.html#functools.wraps

Parameters:	func (function) – function to decorate timer
Returns:	timer decorated func.
Return type:	function

autometa.common.utilities.unpickle(fpath: str) → Any¶

Load a serialized fpath from make_pickle().

Parameters:	fpath (str) – </path/to/file.pkl>.
Returns:	Python object that was serialized to file via make_pickle()
Return type:	any
Raises:	`ExceptionName` – Why the exception is raised.

autometa.common.utilities.untar(tarchive: str, outdir: str, member: str = None) → str¶

Decompress a tar archive (may be gzipped or bzipped). passing in member requires an outdir also be provided.

See: https://docs.python.org/3.8/library/tarfile.html#module-tarfile

Parameters:	tarchive (str) – </path/tarchive.tar.[compression]> outdir (str) – </path/to/output/directory> member (str, optional) – member file to extract.
Returns:	</path/to/extracted/member/file> if member else </path/to/output/directory>
Return type:	str
Raises:	`IsADirectoryError` – outdir already exists `ValueError` – tarchive is not a tar archive `KeyError` – member was not found in tarchive

autometa.common.utilities.write_checksum(infpath: str, outfpath: str) → str¶

Calculate checksum for infpath and write to outfpath.

Parameters:	infpath (str) – </path/to/input/file> outfpath (str) – </path/to/output/checksum/file>
Returns:	Description of returned object.
Return type:	NoneType
Raises:	`FileNotFoundError` – Provided infpath does not exist `TypeError` – infpath or outfpath is not a string

autometa.common package¶

Subpackages¶

Submodules¶

autometa.common.coverage module¶

autometa.common.exceptions module¶

autometa.common.kmers module¶

autometa.common.markers module¶

autometa.common.metagenome module¶

autometa.common.utilities module¶

Module contents¶