GO分析-GOseq的使用教程
<h1>GOseq的介绍</h1><p>GOseq是一个R包,用于寻找GO terms,即基因富集分析。<span class="fontstyle0">此方法基于 <span class="fontstyle2">Wallenius non-central hyper-geometric distribution<span class="fontstyle0">。相对于普</span></span></span><span class="fontstyle0"><span class="fontstyle2"><span class="fontstyle0">通的超几何分布<span class="fontstyle2">(Hyper-geometric distribution)<span class="fontstyle0">,此分布的特点是从某个类别中抽取个体的概率与从某个类别之外抽取一个个体的概率是不同的,这种概率的不同是通过</span></span></span></span></span><span class="fontstyle0"><span class="fontstyle2"><span class="fontstyle0"><span class="fontstyle2"><span class="fontstyle0">对基因长度的偏好性进行估计得到的,从而能更为准确地计算出 <span class="fontstyle2">GO term </span></span></span></span></span></span><span class="fontstyle0"><span class="fontstyle2"><span class="fontstyle0"><span class="fontstyle2"><span class="fontstyle0"><span class="fontstyle2"><span class="fontstyle0">被差异基因富集的概率。</span></span></span></span></span></span></span></p>
<p><span class="fontstyle0"><span class="fontstyle2"><span class="fontstyle0"><span class="fontstyle2"><span class="fontstyle0"><span class="fontstyle2"><span class="fontstyle0"> </span></span></span></span></span></span></span></p>
<h2>1.GOseq的安装</h2>
<div class="cnblogs_code">
<pre>>BiocManager::install(<span style="color: rgba(128, 0, 0, 1)">"go</span><span style="color: rgba(128, 0, 0, 1)">seq</span><span style="color: rgba(128, 0, 0, 1)">"</span>)</pre>
</div>
<h2>2.参考数据集</h2>
<p>这里我们采用GOseq包里的内置数据集genes来做GO分析</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> <span style="color: rgba(0, 0, 0, 1)">library(goseq)
</span><span style="color: rgba(0, 128, 128, 1)">2</span> <span style="color: rgba(0, 0, 0, 1)">data(genes)
</span><span style="color: rgba(0, 128, 128, 1)">3</span> <span style="color: rgba(0, 0, 0, 1)">head(genes)
</span><span style="color: rgba(0, 128, 128, 1)">4</span> str(genes)</pre>
</div>
<p><img src="https://img2018.cnblogs.com/i-beta/1887694/201912/1887694-20191230192214573-795028514.png"></p>
<p> </p>
<p> 这里genes数据集是EMSEMBL gene的向量集合,其中1代表差异表达</p>
<h2>3.通过getgo函数获得GO terms</h2>
<p>getgo的用法:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> getgo(genes, genome, id,fetch.cats=c(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">GO:CC</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">GO:BP</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">GO:MF</span><span style="color: rgba(128, 0, 0, 1)">"</span>))</pre>
</div>
<p>genes:genes是输入的gene向量或列表</p>
<p>genome:参考基因组,比如hg38,hg19</p>
<p>id:输入基因的类型,比如ensGene</p>
<p>fetch,cats:fetch.cats是"GO:CC", "GO:BP", "GO:MF" & "KEGG"的一系列组合</p>
<p><span style="color: rgba(255, 0, 0, 1)"><strong>这里用supportedOrganisms()函数来查看支持的genome和id</strong></span></p>
<p><span style="color: rgba(255, 0, 0, 1)"><strong>结果:结果是一个列表,包含每一个gene对应的所有的GO ID,这个值是goseq函数中gene2cat参数的输入值</strong></span></p>
<p>举例:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> genes <- c(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000124208</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000182463</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000124201</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000124205</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000124207</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">2</span> getgo(genes,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">hg19</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ensGene</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p><img src="https://img2018.cnblogs.com/i-beta/1887694/201912/1887694-20191230200109405-829352317.png"></p>
<p> </p>
<p><strong><span style="color: rgba(255, 0, 0, 1)"> 这里显示了每一个gene参与到的所有GO ID</span></strong></p>
<h2><span style="color: rgba(0, 0, 0, 1)"><strong>3.通过getlength函数检索gene的长度</strong></span></h2>
<p><span style="color: rgba(0, 0, 0, 1)">getlength用法:</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> getlength(genes, genome, id)</pre>
</div>
<p> </p>
<p><span style="color: rgba(255, 0, 0, 1)"><strong>结果:结果是一个向量,包含所有基因的长度,如果某个基因的长度无法检索到,用NA代替。这个向量是nullp函数中bias.data的输入值。</strong></span></p>
<p>举例:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> genes <- c(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000124208</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000182463</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000124201</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000124205</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ENSG00000124207</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">2</span> getlength(genes,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">hg19</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ensGene</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p><img src="https://img2018.cnblogs.com/i-beta/1887694/201912/1887694-20191230202737295-1167879461.png"></p>
<p> </p>
<p><span style="color: rgba(255, 0, 0, 1)"><strong> 这里基因长度出现了3036.5,是因为这里基因长度取得是转录本长度的中位数。</strong></span></p>
<h2><span style="color: rgba(0, 0, 0, 1)"><strong>4.使用nullp函数(Probability Weighting Function)计算概率加权函数</strong></span></h2>
<p><span style="color: rgba(0, 0, 0, 1)">nullp函数介绍:</span></p>
<p><span style="color: rgba(0, 0, 0, 1)">Calculates a Probability Weighting Function for a set of genes based on a given set of biased data (usually gene length) and each genes status as differentially expressed or not.</span></p>
<p><span style="color: rgba(0, 0, 0, 1)">nullp函数用法:</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> nullp(DEgenes, genome, id, bias.data=NULL,plot.fit=TRUE)</pre>
</div>
<p> </p>
<p><span style="color: rgba(0, 0, 0, 1)">DEgenes:DEgenes的格式是一个二元向量,其中1代表差异表达,0代表非差异表达,还有包括gene id,格式与内置数据集genes一样</span></p>
<p><span style="color: rgba(0, 0, 0, 1)">bias.data:bias.data是一个数值向量,通常是基因转录本长度的中位数,单位是bp.如果设置bias.data=NULL,nullp函数将通过getlength函数来获取gene的长度。所以这里默认设置为bias.data=NULL</span></p>
<p><span style="color: rgba(0, 0, 0, 1)">plot.fit:plot.fit这里将pwf作图,默认设置为plot,fit=TRUE</span></p>
<p><span style="color: rgba(255, 0, 0, 1)"><strong>一般nullp函数后面的参数都选择默认,只用选择设置DEgenes, genome和 id即可</strong></span></p>
<p><span style="color: rgba(255, 0, 0, 1)"><strong>结果:结果是一个数据框,行名为gene id,列名为"DEgenes", "bias.data" 和 "pwf",这个数据框对象是goseq函数的输入,用来计算富集的GO terms,也可以作为plotPWF的输入,用来进一步作图。</strong></span></p>
<p><span style="color: rgba(0, 0, 128, 1)"><span style="color: rgba(0, 0, 0, 1)">举例:</span><strong><br></strong></span></p>
<p> </p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> <span style="color: rgba(0, 0, 0, 1)">data(genes)
</span><span style="color: rgba(0, 128, 128, 1)">2</span> pwf <- nullp(genes, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">hg19</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ensGene</span><span style="color: rgba(128, 0, 0, 1)">'</span>)</pre>
</div>
<p> </p>
<p><img src="https://img2018.cnblogs.com/i-beta/1887694/201912/1887694-20191230212245074-2086596827.png"></p>
<p> </p>
<p> <img src="https://img2018.cnblogs.com/i-beta/1887694/201912/1887694-20191230212420686-509957882.png"></p>
<p> </p>
<h2> 5.使用goseq函数进行GO富集分析</h2>
<p>goseq函数介绍:</p>
<p>Does selection-unbiased testing for category enrichment amongst differentially expressed (DE) genes for RNA-seq data. By default, tests gene ontology (GO) categories, but any categories may be tested.</p>
<p>goseq函数用法:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> goseq(pwf, genome, id, gene2cat = NULL,test.cats=c(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">GO:CC</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">GO:BP</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">GO:MF</span><span style="color: rgba(128, 0, 0, 1)">"</span>),method = <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Wallenius</span><span style="color: rgba(128, 0, 0, 1)">"</span>, repcnt = <span style="color: rgba(128, 0, 128, 1)">2000</span>, use_genes_without_cat=FALSE)</pre>
</div>
<p>pwf:这里的pwf是由nullp函数得到的结果,为一个数据框</p>
<p>gene2cat:这里的gene2cat是由getgo函数得到的结果,如果设置gene2cat=NULL,goseq函数将会自动地用getgo函数来获得GO ID,默认设置是gene2cat=NULL</p>
<p>method:这里method有三种选择,"Wallenius", "Sampling" 和 "Hypergeometric".这里"Sampling" 和 "Hypergeometric"方法几乎从没被使用过</p>
<p><strong><span style="color: rgba(255, 0, 0, 1)">一般goseq函数后面的参数都可以选择默认,只用选择pwf,genome和id这三个参数就可以</span></strong></p>
<p><span style="color: rgba(0, 0, 0, 1)">举例:</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> <span style="color: rgba(0, 0, 0, 1)">data(genes)
</span><span style="color: rgba(0, 128, 128, 1)">2</span> pwf <- nullp(genes,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">hg19</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ensGene</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">3</span> pvals <- goseq(pwf,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">hg19</span><span style="color: rgba(128, 0, 0, 1)">'</span>,<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ensGene</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">4</span> head(pvals)</pre>
</div>
<p><img src="https://img2018.cnblogs.com/i-beta/1887694/201912/1887694-20191230215609080-219718073.png"></p>
<p> </p>
<p><strong><span style="color: rgba(255, 0, 0, 1)"> 这里的选择over_represented_pvalues<0.05就是具有统计学意义的GO ID了</span></strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> enriched.GO<-pvals</pre>
</div>
<p><img src="https://img2018.cnblogs.com/i-beta/1887694/201912/1887694-20191230220517540-1833856009.png"></p><br><br>
来源:https://www.cnblogs.com/yanjiamin/p/12121998.html
頁:
[1]