暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

变异注释-软件SnpEff

罗大黑学生信 2021-07-13
3217

目录

SnpEff 简介

SnpEff 下载安装

SnpEff 下载数据库

SnpEff 注释


SnpEff 简介

        SnpEff是一款注释变异位点(SNP+InDel+MNP)和预测变异对基因的影响(例如氨基酸变化)的工具。


SnpEff 下载安装  

        SnpEff 是免安装的软件,下载解压即可使用;

# Download latest version
> wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
# Unzip file
> unzip snpEff_latest_core.zip


SnpEff 下载数据库 

        一般我们可以直接用snpeff 的命令进行下载;步骤包括:

        1)查询可用的数据库;

        2)下载数据库。

# 查询可用数据库
> java -jar snpEff.jar databases > snpEff.databases.list.txt


#下载数据库, 以human为例
> java -jar snpEff.jar download hg19


# 下载成功之后,会在软件安装目录生成 data文件夹,并且生成一个以数据库名字命名的文件夹
> ll hg19/
total 447784
interactions.bin
nextProt.bin
pwms.bin
sequence.10.bin
sequence.11.bin
sequence.12.bin......


SnpEff 注释

Basic example;

输入文件:input : vcf;

#  input:vcf 8列, 如果构建虚拟的vcf文件;REF ALT不能一样
$ head examples/test.chr22.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO
22 17071756 . T C . . .
22 17072035 . C T . . .
22 17072258 . C A . . .
22 17072674 . G A . . .
22 17072747 . T C . . .
22 17072781 . C T . . .


databases  注释执行命令如下:

# 注释命令 及 output 在输入文件的INFO列新增了一个字段信息,字段的名字叫做ANN;关于ANN中各个部分的详细信息可以参考VCF头部的注释部分。
$ java -Xmx4g -jar snpEff.jar GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf


## 如果指定 配置文件,用 '-c' command; 如果条件允许, 可以用 '-t' command 多线程
$ java -Xmx4g path/to/snpEff/snpEff.jar -c -t 4 path/to/snpEff/snpEff.config GRCh37.75 path/to/snps.vcf
# java -Xmx4g -jar path/to/snpEff/snpEff.jar hg19 test.chr22.vcf > test.anno.hg19_multianno.txt


# Here is how the output looks like
$ head examples/test.chr22.ann.vcf
##SnpEffVersion="4.1 (build 2015-01-07), by Pablo Cingolani"
##SnpEffCmd="SnpEff GRCh37.75 examples/test.chr22.vcf "
##INFO=
##INFO=
##INFO=
#CHROM POS ID REF ALT QUAL FILTER INFO
22 17071756 . T C . . ANN=C|3_prime_UTR_variant|MODIFIER|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.*11A>G|||||11|,C|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397A>G|||||4223|
22 17072035 . C T . . ANN=T|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1406G>A|p.Gly469Glu|1666/2034|1406/1674|469/557||,T|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>A|||||3944|
22 17072258 . C A . . ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1183G>T|p.Gly395Cys|1443/2034|1183/1674|395/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>T|||||3721|
22 17072674 . G A . . ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.767C>T|p.Pro256Leu|1027/2034|767/1674|256/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397C>T|||||3305|

对注释的结果处理

处理前:
##SnpEffVersion="4.3t (build 2017-11-24 10:18), by Pablo Cingolani"
##SnpEffCmd="SnpEff hg19 /data/IndividualArchive/luojy/pipeline/Splitfusion/ngs/example/P4T1-L34/P4T1-L34_for_anno.txt "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
1 4165471 . T A . . ANN=A|intergenic_region|MODIFIER|LINC01346-LOC284661|LINC01346-LOC284661|intergenic_region|LINC01346-LOC284661|||n.4165471T>A||||||
1 9658640 . T A . . ANN=A|missense_variant|MODERATE|TMEM201|TMEM201|transcript|NM_001130924.2|protein_coding|4/11|c.563T>A|p.Ser188Asn|620/3852|563/2001|188/666||WARNING_REF_DOES_NOT_MATCH_GENOME,A|missense_variant|MODERATE|TMEM201|TMEM201|transcript|NM_001010866.3|protein_coding|4/6|c.563T>A|p.Ser188Asn|620/3907|563/1179|188/392||WARNING_REF_DOES_NOT_MATCH_GENOME
1 25792274 . T A . . ANN=A|intron_variant|MODIFIER|TMEM57|TMEM57|transcript|NM_018202.5|protein_coding|6/10|c.1154+6891T>A||||||,A|intron_variant|MODIFIER|TMEM57|TMEM57|transcript|NM_001282564.1|protein_coding|4/8|c.473+11401T>A||||||
1 38010402 . T A . . ANN=A|intron_variant|MODIFIER|SNIP1|SNIP1|transcript|NM_024700.3|protein_coding|2/3|c.328-4046A>T||||||
1 48305059 . T A . . ANN=A|intron_variant|MODIFIER|TRABD2B|TRABD2B|transcript|NM_001194986.1|protein_coding|2/6|c.667-37768A>T||||||


处理后:position gene strand func exonfunc refNM exon cdna
1_4165471 LINC01346-LOC284661 NA NA intergenic_region LINC01346-LOC284661 intron NA
1_9658640 TMEM201 NA protein_coding missense_variant NM_001130924.2 exon4 563
1_25792274 TMEM57 NA protein_coding intron_variant NM_018202.5 intron6 NA
1_38010402 SNIP1 NA protein_coding intron_variant NM_024700.3 intron2 NA
1_48305059 TRABD2B NA protein_coding intron_variant NM_001194986.1 intron2 NA
1_52940791 ZCCHC11 NA protein_coding stop_gained NM_001009881.2 exon13 2440
1_69212036 DEPDC1-AS1-LRRC7 NA NA intergenic_region DEPDC1-AS1-LRRC7 intron NA
1_115256558 NRAS NA protein_coding synonymous_variant NM_002524.4 exon3 153
1_115258735 NRAS NA protein_coding missense_variant NM_002524.4 exon2 47


部分代码:

def anno_process(input_file, output_file):
list_anno = []
with open(input_file, 'r') as r_handle:
reader = csv.DictReader(filter(lambda row: row[0] != '#', r_handle), delimiter='\t',
fieldnames=['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'])
for row in reader:
spams = re.split(r'\|', row['INFO'])
dict_for_anno = {
'position': row['CHROM'] + '_' + row['POS'],
'gene': ('NA' if spams[3] == '' else spams[3]),
'strand': 'NA',
'func': ('NA' if spams[7] == '' else spams[7]),
'exonfunc': ('NA' if spams[1] == '' else spams[1]),
'refNM': ('NA' if spams[6] == '' else spams[6]),
'exon': ('intron' if spams[10] == '' else 'exon') + (
'' if spams[8] == '' else re.split(r'/', spams[8])[0]),
'cdna': ('NA' if spams[12] == '' else re.split(r'/', spams[12])[0]) ## 注意有争议的部分
}
list_anno.append(dict_for_anno)


with open(output_file, 'w') as w_handle:
writer = csv.DictWriter(w_handle, quoting=csv.QUOTE_MINIMAL, delimiter='\t',
fieldnames=['position', 'gene', 'strand', 'func', 'exonfunc', 'refNM', 'exon', 'cdna'])
writer.writeheader()
for row in natsort.natsorted(list_anno, key=itemgetter('position')):
writer.writerow(row)




参考

snpEff使用说明https://www.jianshu.com/p/e03095642ef0


文章转载自罗大黑学生信,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论