scmagnify.tools.FuncEnrich

scmagnify.tools.FuncEnrich#

class scmagnify.tools.FuncEnrich(gene_sets, geneset_col='geneset', genesymbol_col='genesymbol')#

Performs Over-Representation Analysis (ORA) to identify enriched biological pathways or gene sets from a given list of genes.

Parameters:

gene_sets (str | DataFrame) –
The source of gene sets. Can be: - A pre-loaded long-format pandas DataFrame. - A full path to a .gmt file. - The name of a built-in gene set (e.g., ‘msigdb_gobp’), which will be

loaded from the package’s default data directory.
geneset_col (str (default: 'geneset')) – If gene_sets is a DataFrame, this specifies the column containing the gene set names.
genesymbol_col (str (default: 'genesymbol')) – If gene_sets is a DataFrame, this specifies the column containing the gene symbols.

`add_genesets`(new_sets[, geneset_col, ...])	Adds new gene sets to the object from a dictionary or DataFrame.
`filter_genesets`(pattern[, case, regex, inplace])	Filters the gene sets based on a keyword or regular expression.
`get_overlap_genes`(terms[, sortby, n_top])	Retrieves the overlapping genes for specified enriched terms.
`run_ora`(gene_list[, n_background, top_n_results])	Performs Over-Representation Analysis (ORA) on a given list of genes.

FuncEnrich.add_genesets(new_sets, geneset_col='geneset', genesymbol_col='genesymbol', inplace=True)#

Adds new gene sets to the object from a dictionary or DataFrame.

If any of the new gene set names already exist in the object, they will be overwritten by the new definitions.

Parameters:

new_sets (dict | DataFrame) –
The new gene sets to add. Can be: - A dictionary where keys are gene set names and values are lists

of gene symbols (e.g., {‘MY_SET’: [‘GENE1’, ‘GENE2’]}).
- A long-format pandas DataFrame.
geneset_col (str (default: 'geneset')) – If new_sets is a DataFrame, this specifies the column with gene set names.
genesymbol_col (str (default: 'genesymbol')) – If new_sets is a DataFrame, this specifies the column with gene symbols.
inplace (bool (default: True)) – If True, modifies the current object directly. If False, returns a new FuncEnrich object with the added gene sets.

Return type:

FuncEnrich | None

Returns:

Optional[FuncEnrich] If inplace=False, returns a new FuncEnrich object. If inplace=True, returns None.

FuncEnrich.filter_genesets(pattern, case=False, regex=True, inplace=True)#

Filters the gene sets based on a keyword or regular expression.

This method allows you to narrow down the analysis to a subset of gene sets (e.g., only those related to ‘T_CELL’ or ‘KEGG_’).

Parameters:

pattern (str) – The keyword or regular expression pattern to search for in gene set names.
case (bool (default: False)) – If True, the pattern matching is case-sensitive.
regex (bool (default: True)) – If True, treats the pattern as a regular expression. If False, treats it as a literal string.
inplace (bool (default: True)) – If True, modifies the current object directly. If False, returns a new FuncEnrich object with the filtered gene sets.

Return type:

FuncEnrich | None

Returns:

Optional[FuncEnrich] If inplace=False, returns a new filtered FuncEnrich object. If inplace=True, returns None.

FuncEnrich.get_overlap_genes(terms, sortby=None, n_top=5)#

Retrieves the overlapping genes for specified enriched terms.

Parameters:

terms (list[str]) – A list of enriched term names for which to retrieve overlapping genes.
sortby (Optional[DataFrame] (default: None)) – An optional DataFrame with gene symbols as the index and a numeric column to sort the overlapping genes by (e.g., log fold change).
n_top (int | None (default: 5)) – If sortby is provided, this specifies the number of top genes to return for each term based on the sorting. If None or <=0, returns all overlapping genes without sorting.

Return type:

dict[str, list[str]]

Returns:

Dict[str, List[str]] A dictionary where keys are term names and values are lists of overlapping gene symbols.

FuncEnrich.run_ora(gene_list, n_background=None, top_n_results=10)#

Performs Over-Representation Analysis (ORA) on a given list of genes.

Parameters:

gene_list (list | Series | Index) – A list, Series, or Index of significant gene symbols to be tested for enrichment.
n_background (Optional[int] (default: None)) – The total number of genes in the background universe. If None, the background is defined as all unique genes present in the loaded gene_sets. It is highly recommended to provide the total number of genes detected in your experiment.
top_n_results (int (default: 10)) – The number of top enriched terms to display in a summary table after the run.

Return type:

DataFrame

Returns:

pd.DataFrame A DataFrame containing the ORA results, sorted by the ‘Combined score’.