`squid.utils`

Module Contents

`arr2pd`(x[, alphabet])	Function to convert a Numpy array to Pandas dataframe with proper column headings.
`oh2seq`(one_hot[, alphabet])	Function to convert one-hot encoding to a sequence.
`seq2oh`(seq[, alphabet])	Function to convert a sequence to one-hot encoding.
`fix_gauge`(x, gauge[, wt, r])	Function to fix the gauge for an attribution matrix.

squid.utils.arr2pd(x, alphabet=['A', 'C', 'G', 'T'])[source]

Function to convert a Numpy array to Pandas dataframe with proper column headings.

Parameters:

x (numpy.ndarray) – One-hot encoding or attribution map (shape : (L,C)).
alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.

Returns:

x – Dataframe corresponding to the input array.

Return type:

pandas.dataframe

squid.utils.oh2seq(one_hot, alphabet=['A', 'C', 'G', 'T'])[source]

Function to convert one-hot encoding to a sequence.

Parameters:

one_hot (numpy.ndarray) – Input one-hot encoding of sequence (shape : (L,C))
alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.

Returns:

seq – Input sequence with length L.

Return type:

string

squid.utils.seq2oh(seq, alphabet=['A', 'C', 'G', 'T'])[source]

Function to convert a sequence to one-hot encoding.

Parameters:

seq (string) – Input sequence with length L
alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.

Returns:

one_hot – One-hot encoding corresponding to input sequence (shape : (L,C)).

Return type:

numpy.ndarray

squid.utils.fix_gauge(x, gauge, wt=None, r=0.1)[source]

Function to fix the gauge for an attribution matrix.

Parameters:

x (numpy.ndarray) – Attribution scores for a sequence-of-interest (shape : (L,C)).
gauge (gauge mode used to fix model parameters.) –
See https://mavenn.readthedocs.io/en/latest/math.html for more info. ‘uniform’ : hierarchical gauge using a uniform sequence distribution over

the characters at each position observed in the training set (unobserved characters are assigned probability 0).

’empirical’ : uses an empirical distribution computed from the training data. ‘consensus’ : wild-type gauge using the training data consensus sequence. ‘default’ : default gauge (no change).
OH_wt (numpy.ndarray) – Wild-type sequence (one-hot encoding) for ‘wildtype’ or ‘empirical’ gauge (shape : (L,C)).
r (float) – For ‘empirical gauge’, the probability of mutation used during generation of in silico MAVE dataset (should match user-defined ‘mut_rate’).

Returns:

OH – Gauge-fixed one-hot encoding corresponding to input sequence (shape : (L,C)).

Return type:

numpy.ndarray