squid.mave

Module Contents

Classes

InSilicoMAVE

Module for performing in silico MAVE.

Functions

random_shuffle(seq[, alphabet, num_shufs, rng])

Creates random shuffles with equiprobability of characters at each position

dinuc_shuffle(seq[, num_shufs, rng])

Creates shuffles of the given sequence, in which dinucleotide frequencies

Attributes

seq_length

class squid.mave.InSilicoMAVE(mut_generator, mut_predictor, seq_length, mut_window=None, context_agnostic=False, inter_window=None, save_window=None, alphabet=['A', 'C', 'G', 'T'])[source]

Module for performing in silico MAVE.

Parameters:
  • mut_generator (class) – Module for performing random mutagenesis.

  • pred_generator (class) – Module for inferring model predictions.

  • seq_length (int) – Full length L of input sequence.

  • mut_window ([int, int]) – Index of start and stop position along sequence to probe; i.e., [start, stop], where start < stop and both entries satisfy 0 <= int <= L.

  • context_agnostic (boole) – Option for generating global neighborhoods, such that the sequence surrounding a conserved pattern of interest is randomly mutated across the in silico MAVE dataset

  • inter_window ([int, int] or [[int, int], [int, int], ...]) – Index of start and stop position of each inter-site window, where each window defines the boundaries of the sequence in between two sites of interest (optional, for ‘context_agnostic’)

  • save_window ([int <= mut_window[0], int >= mut_window[1]]) – Window used for delimiting sequences that are exported in ‘x_mut’ array; if used, the ‘save_window’ interval must be equal to or larger ‘mut_window’, and if larger, ‘save window’ must contain the interval ‘mut_window’ entirely

  • alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.

generate(x, num_sim, seed=None, verbose=1)[source]

Randomly mutate segments in a set of one-hot DNA sequences.

Parameters:
  • x (torch.Tensor) – One-hot sequence (shape: (L,A)).

  • num_sim (int) – Number of sequences to mutagenize for in silico MAVE.

  • seed (int, optional) – Sets the random number seed.

Returns:

  • x_mut (numpy.ndarray) – Sequences simulated by mut_predictor.

  • y_mut (numpy.ndarray) – Inferred predictions for sequences (shape: (N,1)).

pad_seq(x_mut, x, start_position, stop_position, save_window)[source]

Function to pad mutated sequences on both sides with the original unmutated context.

Parameters:
  • x_mut (numpy.ndarray) – Sequences with randomly mutated segments with length l < L defined by l = stop_position - start_position (shape: (N,l,C)).

  • x (torch.Tensor) – Batch of one-hot sequences (shape: (L,A)).

  • start_position (int) – Index of start position along sequence to probe.

  • stop_position (int) – Index of stop position along sequence to probe.

Returns:

Sequences with randomly mutated segments, padded to correct shape with random DNA (shape: (N,L,C)).

Return type:

numpy.ndarray

pad_seq_random(x_mut, x, start_position, stop_position, dinuc=False, inter=False)[source]

Function to pad mutated sequences on both sides with random DNA.

Parameters:
  • x_mut (numpy.ndarray) – Sequences with randomly mutated segments with length l < L defined by l = stop_position - start_position (shape: (N,l,C)).

  • x (torch.Tensor) – Batch of one-hot sequences (shape: (L,A)).

  • start_position (int) – Index of start position along sequence to probe.

  • stop_position (int) – Index of stop position along sequence to probe.

  • dinuc (boole) – Perform mutagenesis by random shuffle (False) or dinucleotide shuffle (True).

  • inter (boole) – Pad sequence to the left and right of ‘mut_window’ (False) or within ‘inter_window’ (True)

Returns:

Sequences with randomly mutated segments, padded to correct shape with random DNA (shape: (N,L,C)).

Return type:

numpy.ndarray

delimit_range(x, start_position, stop_position)[source]

Function to delimit sequence to a specific region.

Parameters:
  • x (torch.Tensor) – Batch of one-hot sequences (shape: (L,A)).

  • start_position (int) – Index of start position along sequence to probe.

  • stop_position (int) – Index of stop position along sequence to probe.

Returns:

Delimited sequences with length l < L defined by l = stop_position - start_position (shape: (N,l,C)).

Return type:

numpy.ndarray

squid.mave.random_shuffle(seq, alphabet=['A', 'C', 'G', 'T'], num_shufs=None, rng=None)[source]

Creates random shuffles with equiprobability of characters at each position

Parameters:
  • seq (ndarray) – one-hot encoding of sequence

  • num_shufs (int) – the number of shuffles to create; if unspecified, only one shuffle will be created

Returns:

ndarray of shuffled versions of ‘seq’ (shape=(N,L,D)), also one-hot encoded If ‘num_shufs’ is not specified, then the first dimension of N will not be present (i.e. a single string will be returned, or an LxD array).

Return type:

ndarray

squid.mave.dinuc_shuffle(seq, num_shufs=None, rng=None)[source]

Creates shuffles of the given sequence, in which dinucleotide frequencies are preserved.

Parameters:
  • seq (str or ndarray) – either a string of length L, or an L x D NumPy array of one-hot encodings

  • num_shufs (int) – the number of shuffles to create, N; if unspecified, only one shuffle will be created rng: a NumPy RandomState object, to use for performing shuffles

Returns:

  • list (if ‘seq’ is string) – List of N strings of length L, each one being a shuffled version of ‘seq’

  • ndarray (if ‘seq’ is ndarray) – ndarray of shuffled versions of ‘seq’ (shape=(N,L,D)), also one-hot encoded If ‘num_shufs’ is not specified, then the first dimension of N will not be present (i.e. a single string will be returned, or an LxD array).

squid.mave.seq_length = 249[source]