In IPA (Ingenuity Pathway Analysis), the P-value is obtained through statistical analysis. The P-value is a metric that evaluates whether the observed data is statistically significant compared with what would be obtained under random conditions.
The general flow
In IPA pathway analysis, the P-value is calculated through the following steps.
- Preprocessing the input data: For pathway analysis, raw data such as gene expression data or protein expression data is entered.
- Selecting a pathway: You select the biological pathway you want to analyze. For example, signaling pathways or metabolic pathways related to a specific disease may be chosen.
- Scoring the pathway: The scores of the genes and proteins included in the selected pathway are calculated. For this, differential expression analysis of gene expression data or analysis of changes in protein expression, for example, may be used.
- Permutation test: Based on the scores of the genes and proteins within the pathway, random datasets are generated. These random datasets preserve the characteristics of the input data while assigning the scores of genes and proteins at random.
- Calculating the P-value: Using the random datasets generated by the permutation test, a random distribution of scores is created. Then, the position of the observed score within the random distribution is evaluated, and the P-value is calculated. The P-value indicates the probability that the observed score would be obtained under a random distribution.
The smaller the P-value, the lower the probability that the observed score would be obtained under random conditions. Generally, a P-value of 0.05 or less (usually 0.01 or less) is considered statistically significant. In such cases, because the probability that the observed score would be obtained under random conditions is very low, it is thought that there is a genuinely meaningful biological association.
How is pathway scoring done?
In pathway scoring, the importance and contribution of genes are quantified and evaluated.
As a concrete example, suppose there are three genes (A, B, C) related to a certain pathway, and the expression level of each gene is given as follows.
Expression level of gene A: 10 Expression level of gene B: 5 Expression level of gene C: 8
In this case, suppose the scores of the genes are calculated based on their expression levels and evaluated on, for example, a 10-point scale. Because gene A has the highest expression level, it is given 10 points. Because gene B has a moderate expression level, it is given 5 points. Because gene C also has a high expression level, it is given 8 points.
These scores are then normalized. For example, suppose they are scaled to the range from 0 to 1. In this case, gene A becomes 1.0, gene B becomes 0.5, and gene C becomes 0.8.
In this way, scores can be assigned to the genes within the pathway. This makes it possible to evaluate the importance of the genes and their role within the pathway.
What is a permutation test?
In a permutation test, the data is randomly rearranged in order to perform statistical analysis. Through this random rearrangement, the results that would be obtained when the data is in a random state are predicted.
When the expression of gene A and gene B is known from patient data, in order to evaluate whether these are involved in Pathway X, a permutation test can be performed using the following steps.
- Preprocessing the data: The expression data of gene A and gene B is extracted from the patient data.
- Scoring the genes: Using the expression data of gene A and gene B, scores are assigned to each gene. The method of calculating the scores may be set based on the expression level and importance of the gene.
- Preparing the permutation test: For the permutation test, the score data of gene A and gene B is prepared.
- Running the permutation: The score data of gene A and gene B is rearranged at random, and the permutation test is run. This predicts the results that would be obtained when the association between gene A and gene B occurs under random conditions.
- Repeating the permutation: The permutation is repeated multiple times to generate random datasets. Usually it is repeated several thousand times or more.
- Evaluating the results: The random datasets obtained from the permutation test are compared with the original data. Specifically, the position of the scores of gene A and gene B within the random datasets is evaluated. This makes it possible to statistically evaluate whether the scores of gene A and gene B are involved in Pathway X.
Through the permutation test, it is possible to evaluate whether the scores of gene A and gene B have a statistically significant association with Pathway X. This makes it possible to statistically verify whether a specific gene is involved in a specific pathway.
How is the P-value derived?
- After running the permutation test, the position of the original data within the random datasets is evaluated.
- For this evaluation, a statistic of the original data (for example, the absolute value of the difference between the scores of gene A and gene B, or the correlation coefficient) is calculated.
- Among the random datasets obtained from the permutation test, the proportion in which a statistic greater than or equal to that of the original data was obtained is calculated.
- This proportion becomes the P-value. The P-value indicates the probability that the original data would be obtained in a random state.
For example, let us consider the case of evaluating the absolute value of the difference between the scores of gene A and gene B.
- Using the permutation test, the score data of gene A and gene B is rearranged at random.
- The absolute value of the difference between the scores of gene A and gene B in the original data is calculated.
- Among the random datasets obtained from the permutation test, the number of times an absolute difference greater than or equal to that of the original data was obtained is counted.
- That count is divided by the number of permutation repetitions to calculate a proportion. This becomes the P-value.
The P-value indicates the probability that the original data would be obtained in a random state. The smaller the P-value, the lower the probability that the original data would be obtained in a random state. In statistical hypothesis testing, comparing it with a significance level set in advance (usually 0.05 or 0.01), if the P-value is small, the result can be said to be statistically significant.
The above is one example of how the P-value is calculated. With this, the results of the permutation test can be evaluated statistically, and the probability that the original data would be obtained in a random state can be determined.
Summarized in a diagram, it would look something like this.
