autometa.common package

Submodules

autometa.common.coverage module

Calculates coverage of contigs

autometa.common.coverage.from_spades_names(records)

Retrieve coverages from SPAdes scaffolds headers.

Example SPAdes header : NODE_83_length_162517_cov_224.639

Parameters:records (list) – [SeqRecord,…]
Returns:index=contig, name=’coverage’, dtype=float
Return type:pd.Series
autometa.common.coverage.get(fasta, out, from_spades=False, fwd_reads=None, rev_reads=None, se_reads=None, sam=None, bam=None, lengths=None, bed=None, cpus=1)

Get coverages for assembly fasta file using provided files or if the metagenome assembly was generated from SPAdes, use the k-mer coverages provided in each contig’s header by specifying from_spades=True.

Either fwd_reads and rev_reads and/or se_reads or,`sam`, or bam, or bed must be provided if from_spades=False.

Notes

Will begin coverage calculation based on files provided checking in the following order:

  1. bed
  2. bam
  3. sam
  4. fwd_reads and rev_reads and se_reads

Event sequence to calculate contig coverages:

  1. align reads to generate alignment.sam
  2. sort samfile to generate alignment.bam
  3. calculate assembly coverages to generate alignment.bed
  4. calculate contig coverages to generate coverage.tsv
Parameters:
  • fasta (str) – </path/to/assembly.fasta>
  • out (str) – </path/to/output/coverages.tsv>
  • from_spades (bool, optional) – If True, will attempt to parse record ids for coverage information. This is only compatible with SPAdes assemblies. (the Default is False).
  • fwd_reads (list, optional) – [</path/to/forward_reads.fastq>, …]
  • rev_reads (list, optional) – [</path/to/reverse_reads.fastq>, …]
  • se_reads (list, optional) – [</path/to/single_end_reads.fastq>, …]
  • sam (str, optional) – </path/to/alignments.sam>
  • bam (str, optional) – </path/to/alignments.bam>
  • lengths (str, optional) – </path/to/lengths.tsv>
  • bed (str, optional) – </path/to/alignments.bed>
  • cpus (int, optional) – Number of cpus to use for coverage calculation.
Returns:

index=contig cols=[‘coverage’]

Return type:

pd.DataFrame

autometa.common.coverage.main()
autometa.common.coverage.make_length_table(fasta, out)

Writes a tab-delimited length table to out given an input fasta.

Parameters:
  • fasta (str) – </path/to/assembly.fasta>
  • out (str) – </path/to/lengths.tsv>
Returns:

</path/to/lengths.tsv>

Return type:

str

autometa.common.coverage.normalize(df)

autometa.common.exceptions module

File containing customized AutometaErrors for more specific exception handling

exception autometa.common.exceptions.AutometaError

Bases: Exception

Base class for Autometa Errors.

exception autometa.common.exceptions.BinningError

Bases: autometa.common.exceptions.AutometaError

BinningError exception class.

Exception called when issues arise during or after the binning process.

This is usually a result of no clusters being recovered.

exception autometa.common.exceptions.ChecksumMismatchError

Bases: autometa.common.exceptions.AutometaError

ChecksumMismatchError exception class

Exception called when checksums do not match.

exception autometa.common.exceptions.DatabaseOutOfSyncError(value)

Bases: autometa.common.exceptions.AutometaError

Raised when NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other :param AutometaError: Base class for other exceptions :type AutometaError: class

__str__()

Operator overloading to return the text message written while raising the error, rather than the message of __str__ by base exception :returns: Message written alongside raising the exception :rtype: str

exception autometa.common.exceptions.ExternalToolError(cmd, err)

Bases: autometa.common.exceptions.AutometaError

Raised when samtools sort is not executed properly.

Parameters:AutometaError (class) – Base class for other exceptions
exception autometa.common.exceptions.TableFormatError

Bases: autometa.common.exceptions.AutometaError

TableFormatError exception class.

Exception called when Table format is incorrect.

This is usually a result of a table missing the ‘contig’ column as this is often used as the index.

autometa.common.kmers module

autometa.common.markers module

Autometa Marker class consisting of various methods to annotate sequences with marker sets depending on sequence set taxonomy

autometa.common.markers.get(kingdom: str, orfs: str, dbdir: str, scans: str = None, out: str = None, force: bool = False, format: str = 'wide', cpus: int = 2, parallel: bool = True, gnu_parallel: bool = False, seed: int = 42) → pandas.core.frame.DataFrame

Retrieve contigs’ markers from markers database that pass cutoffs filter.

Parameters:
  • kingdom (str) – kingdom to annotate markers choices = [‘bacteria’, ‘archaea’]
  • orfs (str) – Path to amino-acid ORFs file
  • dbdir – Directory should contain hmmpressed marker genes database files.
  • scans (str, optional) – Path to existing hmmscan table to filter by cutoffs
  • out (str, optional) – Path to write annotated markers table.
  • force (bool, optional) – Whether to overwrite existing out file path, by default False.
  • format (str, optional) –
    • wide - returns wide dataframe of contig PFAM counts (default)
    • long - returns long dataframe of contig PFAM counts
    • list - returns list of pfams for each contig
    • counts - returns count of pfams for each contig
  • cpus (int, optional) – Number of cores to use if running in parallel, by default all available.
  • parallel (bool, optional) – Whether to run hmmscan using its parallel option, by default True.
  • gnu_parallel (bool, optional) – Whether to run hmmscan using gnu parallel, by default False.
  • seed (int, optional) – Seed to pass into hmmscan for determinism, by default 42.
Returns:

  • wide - pd.DataFrame(index_col=contig, columns=[PFAM,…])
  • long - pd.DataFrame(index_col=contig, columns=[‘sacc’,’count’])
  • list - {contig:[pfam,pfam,…],contig:[…],…}
  • counts - {contig:count, contig:count,…}

Return type:

pd.Dataframe or dict

Raises:

ValueError – Why the exception is raised.

autometa.common.markers.load(fpath, format='wide')

Read markers table into specified format.

Parameters:
  • fpath (str) – </path/to/kingdom.markers.tsv>
  • format (str, optional) –
    • wide - index=contig, cols=[domain sacc,..] (default)
    • long - index=contig, cols=[‘sacc’,’count’]
    • list - {contig:[sacc,…],…}
    • counts - {contig:len([sacc,…]), …}
Returns:

  • wide - index=contig, cols=[domain sacc,..] (default)
  • long - index=contig, cols=[‘sacc’,’count’]
  • list - {contig:[sacc,…],…}
  • counts - {contig:len([sacc,…]), …}

Return type:

pd.DataFrame or dict

Raises:
  • FileNotFoundError – Provided fpath does not exist
  • ValueError – Provided format is not in choices: choices = wide, long, list or counts
autometa.common.markers.main()

autometa.common.metagenome module

Script containing Metagenome class for general handling of metagenome assembly

class autometa.common.metagenome.Metagenome(assembly)

Bases: object

Autometa Metagenome Class.

Parameters:assembly (str) – </path/to/metagenome/assembly.fasta>
sequences

[seq,…]

Type:list
seqrecords

[SeqRecord,…]

Type:list
nseqs

Number of sequences in assembly.

Type:int
length_weighted_gc

Length weighted average GC% of assembly.

Type:float
size

Total assembly size in bp.

Type:int
largest_seq

id of longest sequence in assembly

Type:str
* self.fragmentation_metric()
* self.describe()
* self.length_filter()
describe() → pandas.core.frame.DataFrame

Return dataframe of details.

# assembly : Assembly input into Metagenome(…) [index column] # nseqs : Number of sequences in assembly # size : Size or total sum of all sequence lengths # N50 : # N10 : # N90 : # length_weighted_gc_content : Length weighted average GC content # largest_seq : Largest sequence in assembly
Returns:
Return type:pd.DataFrame
fragmentation_metric(quality_measure: float = 0.5) → int

Describes the quality of assembled genomes that are fragmented in contigs of different length.

Note

For more information see this metagenomics wiki from Matthias Scholz

Parameters:quality_measure (0 < float < 1) – Description of parameter quality_measure (the default is .50). I.e. default measure is N50, but could use .1 for N10 or .9 for N90
Returns:Minimum contig length to cover quality_measure of genome (i.e. length weighted median)
Return type:int
gc_content() → pandas.core.frame.DataFrame

Retrieves GC content from sequences in assembly

Returns:index=”contig”, columns=[“gc_content”,”length”]
Return type:pd.DataFrame
largest_seq

Retrieve the name of the largest sequence in the provided assembly.

Returns:record ID of the largest sequence in assembly.
Return type:str
length_filter(out: str, cutoff: int = 3000, force: bool = False)

Filters sequences by length with provided cutoff.

Note

A WARNING will be emitted and the original metagenome will be returned if no contigs pass the length filter cutoff.

Parameters:
  • out (str) – Path to write length filtered output fasta file.
  • cutoff (int, optional) – Lengths above or equal to cutoff that will be retained (the default is 3000).
  • force (bool, optional) – Overwrite existing out file (the default is False).
Returns:

autometa Metagenome object with only assembly sequences above the cutoff threshold.

Return type:

Metagenome

Raises:
  • TypeError – cutoff value must be a float or integer
  • ValueError – cutoff value must be a positive real number
  • FileExistsError – filepath consisting of sequences that passed filter already exists
length_weighted_gc

Retrieve the length weighted average GC percentage of provided assembly.

Returns:GC percentage weighted by contig length.
Return type:float
nseqs

Retrieve the number of sequences in provided assembly.

Returns:Number of sequences parsed from assembly
Return type:int
seqrecords

Retrieve SeqRecord objects from provided assembly.

Returns:[SeqRecord, SeqRecord, …]
Return type:list
sequences

Retrieve the sequences from provided assembly.

Returns:[seq, seq, …]
Return type:list
size

Retrieve the summation of sizes for each contig in the provided assembly.

Returns:Total summation of contig sizes in assembly
Return type:int
autometa.common.metagenome.main()

autometa.common.utilities module

File containing common utilities functions to be used by Autometa scripts.

autometa.common.utilities.calc_checksum(fpath: str) → str

Retrieve md5 checksum from provided fpath.

See:
https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file
fpath : str
</path/to/file>
str
space-delimited hexdigest of fpath using md5sum and basename of fpath. e.g. ‘hash filename

FileNotFoundError
Provided fpath does not exist
TypeError
fpath is not a string
autometa.common.utilities.file_length(fpath: str, approximate: bool = False) → int

Retrieve the number of lines in fpath

See: https://stackoverflow.com/q/845058/13118765

Parameters:
  • fpath (str) – Description of parameter fpath.
  • approximate (bool) – If True, will approximate the length of the file from the file size.
Returns:

Number of lines in fpath

Return type:

int

Raises:

FileNotFoundError – provided fpath does not exist

autometa.common.utilities.gunzip(infpath: str, outfpath: str, delete_original: bool = False, block_size: int = 65536) → str

Decompress gzipped infpath to outfpath and write checksum of outfpath upon successful decompression.

Parameters:
  • infpath (str) – </path/to/file.gz>
  • outfpath (str) – </path/to/file>
  • delete_original (bool) – Will delete the original file after successfully decompressing infpath (Default is False).
  • block_size (int) – Amount of infpath to read in to memory before writing to outfpath (Default is 65536 bytes).
Returns:

</path/to/file>

Return type:

str

Raises:

FileExistsErroroutfpath already exists and is not empty

autometa.common.utilities.make_pickle(obj: Any, outfpath: str) → str

Serialize a python object (obj) to outfpath. Note: Opposite of unpickle()

Parameters:
  • obj (any) – Python object to serialize to outfpath.
  • outfpath (str) – </path/to/pickled/file>.
Returns:

</path/to/pickled/file.pkl>

Return type:

str

Raises:

ExceptionName – Why the exception is raised.

autometa.common.utilities.read_checksum(fpath: str) → str

Read checksum from provided checksum formatted fpath.

Note: See write_checksum for how a checksum file is generated.

Parameters:

fpath (str) – </path/to/file.md5>

Returns:

checksum retrieved from fpath.

Return type:

str

Raises:
  • TypeError – Provided fpath was not a string.
  • FileNotFoundError – Provided fpath does not exist.
autometa.common.utilities.tarchive_results(outfpath: str, src_dirpath: str) → str

Generate a tar archive of Autometa Results

See: https://stackoverflow.com/questions/2032403/how-to-create-full-compressed-tar-file-using-python

Parameters:
  • outfpath (str) – </path/to/output/tar/archive.tar.gz || </path/to/output/tar/archive.tgz
  • src_dirpath (str) – </paths/to/directory/to/archive>
Returns:

</path/to/output/tar/archive.tar.gz || </path/to/output/tar/archive.tgz

Return type:

str

Raises:

FileExistsErroroutfpath already exists

autometa.common.utilities.timeit(func: function) → function

Time function run time (to be used as a decorator). I.e. when defining a function use python’s decorator syntax

Example

@timeit
def your_function(args):
    ...

Notes

See: https://docs.python.org/2/library/functools.html#functools.wraps

Parameters:func (function) – function to decorate timer
Returns:timer decorated func.
Return type:function
autometa.common.utilities.unpickle(fpath: str) → Any

Load a serialized fpath from make_pickle().

Parameters:fpath (str) – </path/to/file.pkl>.
Returns:Python object that was serialized to file via make_pickle()
Return type:any
Raises:ExceptionName – Why the exception is raised.
autometa.common.utilities.untar(tarchive: str, outdir: str, member: str = None) → str

Decompress a tar archive (may be gzipped or bzipped). passing in member requires an outdir also be provided.

See: https://docs.python.org/3.8/library/tarfile.html#module-tarfile

Parameters:
  • tarchive (str) – </path/tarchive.tar.[compression]>
  • outdir (str) – </path/to/output/directory>
  • member (str, optional) – member file to extract.
Returns:

</path/to/extracted/member/file> if member else </path/to/output/directory>

Return type:

str

Raises:
  • IsADirectoryErroroutdir already exists
  • ValueErrortarchive is not a tar archive
  • KeyErrormember was not found in tarchive
autometa.common.utilities.write_checksum(infpath: str, outfpath: str) → str

Calculate checksum for infpath and write to outfpath.

Parameters:
  • infpath (str) – </path/to/input/file>
  • outfpath (str) – </path/to/output/checksum/file>
Returns:

Description of returned object.

Return type:

NoneType

Raises:
  • FileNotFoundError – Provided infpath does not exist
  • TypeErrorinfpath or outfpath is not a string

Module contents