API

class tmap.tda.mapper.Mapper(verbose=1)[source]

implement the TDA mapper framework for microbiome analysis

filter(data, lens=None)[source]
Parameters:
  • data (numpy.ndarray/pandas.DataFrame) –
  • lens (list) – List of instance of class which is inherited from tmap.tda.filter.Filter.

Input data may need to imputed for remove np.inf or np.nan, or it will raise error in fit step. It is recommended to scale original data with MinMaxScalar to check the completeness of data.

Project/Filter high dimensional data points use the specified lens. If user provides multiple filters as input, it will simply concatenate all output array along axis 1.

Finally, you will get a ndarray with shape (n_data,sum(n_components lens))

map(data, cover, clusterer=DBSCAN(min_samples=1))[source]

map the points cloud with the projection data, and return a TDA graph.

Parameters:
  • data (numpy.ndarray/pandas.DataFrame) – The row number of data must equal to the data you passed to Cover
  • Cover (tmap.tda.cover.Cover) –
  • clusterer (sklearn.cluster) –
Returns:

A dictionary with multiple keys which described below.

During the process, it will output progress information depending on verbose

Basically, it will iterate all hypercubes which generated by cover and cluster samples within a hypercubes into several nodes with providing clusterer. It will drop unclassified samples out and keep samples which are clustered. The name of nodes are annotated by the counting number during iteration. Currently, it doesn’t accept any name behaviour for nodes.

The resulting graph is a dictionary containing multiple keys and corresponding values. For better understanding the meaning of all keys and values. Here is the descriptions of each key.

  1. nodes: Another dictionary for storge the mapping relationships between nodes and samples. Key is the name of nodes. Values is a list of corresponding index of samples.
  2. edges: A list of 2-tuples for indicating edges between nodes.
  3. adj_matrix: A square DataFrame constructed by nodes ID. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. (Unweighted)
  4. sample_names: A list of samples names which assign from the index of providing data. If ‘index’ not in dir(data), it will replace with a range of n_row of data.
  5. node_keys: A list of ordered nodes ID.
  6. node_positions: A dictionary with node as key and position of node as value. Depending on the shape of the cover.data, it will simply calculate the average values of all samples within a node in cover.data and assign it as the position info of the node.
  7. node_sizes: A dictionary with node as key and number of samples within the node as value.
  8. params: A dictionary for storing parameters of cover and cluster

In future, structured class of graph will be implemented and taken as the result of Mapper.

verbose = None

if verbose greater than 1, it will output detail info.

class tmap.tda.cover.Cover(projected_data, resolution=10, overlap=0.5)[source]

Covering the projection data

Parameters:
  • projected_data (numpy.ndarray/pandas.DataFrame) – Normally, projected_data should be the data transformed by MDS or t-SNE or PCA. It decides the way of partition for the original point cloud.
  • resolution (integer) – It decides the number of partition of each axis at projected_data
  • overlap (float) – overlap must greater than 0. It decides the level of the expansion of each partition. If overlap equals to 0.5, each partition (include the first and the last) will expand 0.5 times of orginal.
hypercubes

Generate hypercubes (covering) using a generator function

Returns:It returns a mask for the projected_data. Each row is a list of boolean for indicating which samples are within the current cube. The row number of hypercubes represents the number of partition.
Return type:numpy.ndarray
class tmap.tda.metric.Metric(metric='euclidean')[source]

metric + data -> distance matrix

Define a distance metric and transform data points into a distance matrix.

Parameters:metric (str) –

metric specified a distance metric. For example:

  • cosine
  • euclidean
  • hamming
  • minkowski
  • precomputed: for precomputed distance matrix.
fit_transform(data)[source]

Create and return a distance matrix based on the specified metric.

Parameters:data (np.ndarray/pd.DataFrame) – data: raw data or precomputed distance matrix.
class tmap.tda.plot.Color(target, dtype='numerical', target_by='sample')[source]

map colors to target values for TDA network visualization

  • If target_by set as samples, it means that it will pass original data instead of SAFE score for colorizing the node on the graph.
  • If node set as node, it means that it will pass SAFE score for colorizing the node on the graph. So the target must be a dictionary generated by SAFE_batch function. If you using single SAFE function called SAFE, it will also create a dict which could be used.

The basically code assign the highest values with red and the lowest values with blue. Before we assign the color, it will split the target into 4 parts with np.percentile and scale each part with different upper and lower boundary.

It is normally have 4 distinct parts in color bar but it will easily missing some parts due to some skewness which misleading the values of percentile.

Parameters:
  • target (list/np.ndarray/pd.Series/dict) – target values for samples or nodes
  • dtype (str) – type of target values, “numerical” or “categorical”
  • target_by (str) – target type of “sample” or “node”
get_colors(nodes, cmap=None)[source]
Parameters:
  • nodes (dict) – nodes from graph
  • cmap – not implemented yet…For now, it only accept manual assigned dict like {sample1:color1,sample2:color2…..}
Returns:

nodes colors with keys, and the color map of the target values

Return type:

tuple (first is a dict node_ID:node_color, second is a tuple (node_ID_index,node_color))

get_sample_colors(cmap=None)[source]
Parameters:
  • nodes (dict) – nodes from graph
  • cmap – not implemented yet…For now, it only accept manual assigned dict like {sample1:color1,sample2:color2…..}
Returns:

nodes colors with keys, and the color map of the target values

Return type:

tuple (first is a dict node_ID:node_color, second is a tuple (node_ID_index,node_color))

tmap.tda.plot.show(graph, color=None, fig_size=(10, 10), node_size=10, edge_width=2, mode='spring', notshow=False, **kwargs)[source]

Network visualization of TDA mapper

Using matplotlib as basic engine, it is easily add title or others elements. :param tmap.tda.Graph.Graph graph: :param Color/str color: Passing tmap.tda.plot.Color or just simply color string. :param tuple fig_size: height and width :param int node_size: With given node_size, it will scale all nodes with same size node_size/max(node_sizes) * node_size **2. The size of nodes also depends on the biggest node which contains maxium number of samples. :param int edge_width: Line width of edges. :param str/None mode: Currently, Spring layout is the only one style supported. :param float strength: Optimal distance between nodes. If None the distance is set to 1/sqrt(n) where n is the number of nodes. Increase this value to move nodes farther apart. :return: plt.figure

tmap.tda.plot.vis_progressX(graph, simple=False, mode='file', color=None, _color_SAFE=None, min_size=10, max_size=40, **kwargs)[source]

For dynamic visualizing tmap construction process, it performs a interactive graph based on plotly with a slider to present the process from ordination to graph step by step. Currently, it doesn’t provide any API for overriding the number of step from ordination to graph. It may be implemented at the future.

If you want to draw a simple graph with edges and nodes instead of the process, try the params simple.

This visualized function is mainly based on plotly which is a interactive Python graphing library. The params mode is trying to provide multiple type of return for different purpose. There are three different modes you can choose including “file” which return a html created by plotly, “obj” which return a reusable python dict object and “web” which normally used at notebook and make inline visualization possible.

The color part of this function has a little bit complex because of the multiple sub-figures. Currently, it use the tmap.tda.plot.Color class to auto generate color with given array. More detailed about how to auto generate color could be reviewed at the annotation of tmap.tda.plot.Color.

In this function, there are two kinds of color need to implement.

  • First, all color and its showing text values of samples points should be followed by given color params. The color could be any array which represents some measurement of Nodes or Samples. It doesn’t have to be SAFE score.
  • Second, The _color_SAFE param should be a Color with a nodes-length array, which is normally a SAFE score.
Parameters:
  • graph (tmap.tda.Graph.Graph) –
  • mode (str) – [file|obj|web]
  • simple (bool) –
  • color
  • _color_SAFE
  • kwargs
Returns:

tmap.netx.SAFE.SAFE_batch(graph, metadata, n_iter=1000, nr_threshold=0.5, neighborhoods=None, shuffle_by='node', _mode='enrich', agg_mode='sum', verbose=1, name=None, **kwargs)[source]

Entry of SAFE analysis Map sample meta-data to node associated values (using means), and perform SAFE batch analysis for multiple features

For more information, you should see How tmap work

Parameters:
  • graph (tmap.tda.Graph.Graph) –
  • metadata (np.ndarray/pd.DataFrame) –
  • n_iter (int) – Permutation times. For some features with skewness values, it should be higher in order to stabilize the resulting SAFE score.
  • nr_threshold (float) – Float in range of [0,100]. The threshold is used to cut path distance with percentiles
  • neighborhoods
  • shuffle_by
  • _mode
  • agg_mode
  • verbose
Returns:

return dict {feature: {node_ID:p-values(fdr)} } .

tmap.netx.coenrichment_analysis.pairwise_coenrichment(graph, safe_scores, n_iter=5000, p_value=0.05, _pre_cal_enriched=None, verbose=1)[source]

Pair-wise calculation for co-enrichment of each feature found at safe_scores. If _pre_cal_enriched was given, n_iter and p_value is not useless. Or you should modify n_iter and p_value to fit the params you passed to SAFE algorithm.

Parameters:
  • graph (tmap.tda.Graph.Graph) –
  • safe_scores (pd.DataFrame) – A SAFE score output from SAFE_batch which must contain all values occur at fea.
  • n_iter (int) – Permutation times used at SAFE_batch.
  • p_value (float) – The p-value to determine the enriched nodes.
  • _pre_cal_enriched (dict) – A pre calculated enriched_centroid which comprised all necessary features will save time.
  • verbose
Returns:

tmap.netx.coenrichment_analysis.coenrichment_for_nodes(graph, nodes, enriched_centroid, name, safe_scores=None, SAFE_pvalue=None, _filter=True, mode='both')[source]

Coenrichment main function With given feature and its enriched nodes, we could construct a contingency table when we comparing to other feature and its enriched nodes. For statistical test association between different features, fisher-exact test was applied to each constructed contingency table. Fisher-exact test only consider the association between two classifications instead of the ratio. For coenrichment, we also implement a accessory function called is_enriched to further judge that if the ratio of enrichment is bigger than the non-enrichement.

Besides the global co-enrichment, several local enrichment were observed. With networkx algorithm for finding component, we could extract each local enrichment nodes from global enrichment.

Because of the complex combination between comparison of local enrichment, two different contingency table was constructed.

The contingency table which compare different enriched nodes among components and enriched or non-enriched nodes of other features shown below.

fea fea this comp enriched nodes fea other comp enriched nodes
o_f_enriched_nodes s1 s2
o_f_non-enriched_nodes s3 s4

The other contingency table which compare enriched nodes within specific component or non-enriched nodes and enriched or non-enriched nodes of other features shown below.

fea fea this comp enriched nodes fea non-enriched nodes
o_f_enriched_nodes s1 s2
o_f_non-enriched_nodes s3 s4

For convenient calculation, three different mode [both|global|local] could be choose.

Output objects contain two different kinds of dictionary. One dict is recording the correlation information between different features, the other is recording the raw contingency table info between each comparsion which is a list of nodes representing s1,s2,s3,s4.

If mode equals to ‘both’, it will output global correlative dict, local correlative, metainfo. If mode equals to ‘global’, it will output local correlative, metainfo. If mode equals to ‘local’, it will output global correlative dict, metainfo.

For global correlative, keys are compared features and values are tuples (oddsratio, pvalue) from fisher-exact test. For local correlative, keys are tuples of (index of components, size of components, features) and values are as same as global correlative.

The remaining metainfo is a dictionary shared same key to global/local correlative but contains contingency table info.
Parameters:
  • graph (tmap.tda.Graph.Graph) – tmap constructed graph
  • nodes (list) – a list of nodes you want to process from specific feature
  • name (str) – feature name which doesn’t need to exist at enriched_centroid
  • enriched_centroid (dict) – enriched_centroids output from get_significant_nodes
  • SAFE_pvalue (float) – None or a threshold for SAFE score. If it is a None, it will not filter the nodes.
  • safe_scores (pd.DataFrame) – A DataFrame which store SAFE_scores for filter the nodes. If you want to filter the nodes, it must simultaneously give safe_scores and threshold.
  • mode (str) – [both|global|local]
Returns: