Use this panel to select gene sets of a collection to be analyzed:
--- Availability of collections is specific to species.
--- Filter gene sets by specifying minimum and maximum number of genes per set.
--- To save run time or use more statistical methods, select individual gene sets by highlighting rows.
--- Check the "Choose multiple collections" box to select gene sets from multiple collections.
Use this option to run GSA on a gene list selected from the background, or all known genes.
--- Only applicable to GSA methods testing over-representation of gene sets in a given gene list, such as Fisher's exact or chi-square test.
--- Prepare the gene list (one gene per line) using one of the 3 identifier types:
1. NCBI gene ID: numbers only; no letters or other symbols (ex. 2308).
2. Official gene symbol: case-sensitive; no synonyms (ex. FOXO1).
3. Ensembl gene ID: ENSG, ENSMUSG, etc. followed by numbers (ex. ENSG00000150907).
--- Copy/paste the gene list into the tex box on the right and upload it. Numbe of recognized gene identifiers will be reported.
--- Uploading background genes is optional. GSA background could be all known genes of a species or all genes in a gene set colleciton.
Use this option to run GSA on a gene-level statistics, such as p value or mean difference.
--- To test over-representation of gene sets, use filter on the right to select top genes.
--- Prepare a table:
1. Each row has one gene and the first column must be unique gene identifier in one of 3 types: NCBI gene ID, Official gene symbol, or Ensembl gene ID.
2. Table has 2 or more columns and its first row must be unique column names.
3. At least one column must be a numeric vector of gene-level statistics.
--- Save the table in a file with one of these extensions: .txt, .tsv, .csv, .rds, .rdata, .rda, .xlsx, .xls, or .html, and upload it for GSA.
For each gene set, over-representation GSA tests n1/N1 >> n0/N0; where N1 is the size of user-specified gene list which is a subset of the background; n1 is the number of the genes within both of the user's gene list and gene set; N0 is all genes of the background other than those in the user's list; and n0 is the number of genes within the gene set, but not within the user's list.
--- The gene sets to be tested could include the whole table from Step 1, or the ones highlighted by the user.
--- There are 3 options of background genes: provided by the user from Step 2 when available; all known genes of the species; or all genes in the whole collection. The first option is usually preferred.
--- There are 3 options of statistical methods to test n1/N1 >> n0/N0
1. Fisher's exact test: p value of each gene set is calculated based on hypergeometric distribution.
test: p value of each gene set is estimated based on the asymptotic chi-squared distribution; using the R chisq.test function.
3. Proportion test: convert the difference of two proportion to Z score and calculate the corresponding p value.
For each gene set, GSA of gene-level statistics tests V1 != V0; where V1 and V0 are vectors of the statistics of genes within the gene set and all the other genes.
--- The gene sets to be tested could include all rows in the table from Step 1, or the ones highlighted by the user.
--- Different GSA methods are applicable to different types of gene-level statistics. GSA Genie will make a best guess what kind of variable the test statistics is:
1. P-like statistics have values from 0 to 1. GSA Genie will replace 0 with the minimal non-zero p value of any gene for the analysis.
2. T-like statistics have values from -Inf to +Inf. Theoretically, they should a bell-shaped distribution around 0.
3. F-like statistics have all positive values usually having a skewed bell-shaped distribution.
4. If GSA Genie or user cannot decide the gene-level statistics is one of the above, only non-parametric methods, such as Wilcoxon rank sum test, or methods based on gene sampling can be used.
--- The piano package allows to add direction to non-directional statistics like p value, if users want to distingush positive and negative mean difference in case of 2-group comparison. The direction information will be defined by an extra column of the uploaded table.
--- Users also have the option to convert the gene-level statisitcs by fitting it to a normal distribtion, log-transforming it or ranking it. If a P-like or F-like statistics is diretioned and logged or fitted to normal, it will be treated as a T-like statistics. Ranking gene will have the same effect on all types of statistics.
--- Most of the GSA methods available are implemented by the R piano package. GSA Genie added two more methods: Student's t and Kolmogorov-Smirnov tests. Methods such as GSEA and Reporter features take too much time to be run online on a large number of gene sets, so users need to select a small number of gene sets to use these methods. Otherwise, users can download the whole collection of gene sets and test them on their own system.
1. GSEA is a popular, but slow GSA method; applicable only to T-like statistics.
2. PAGE is a musch faster alternative of GSEA.
3. Wilcoxon rank sum test is non-parametric method appliable to any types of statistics.
4. Methods only applicable to P-like statistics: Fisher's combined p performs meta-analysis on the p values; Stouffer's converts p values to z scores first and run meta-analysis; Reporter features is the same as Stouffer's with adjustment to background distribution and slower; and Tail strength weights p values according to their ranking.
5. Mean, Median, and Sum are applicable to all types of statistics and compare summary statistics via re-sampling genes.