目录
SnpEff 简介
SnpEff 下载安装
SnpEff 下载数据库
SnpEff 注释
SnpEff 简介
SnpEff是一款注释变异位点(SNP+InDel+MNP)和预测变异对基因的影响(例如氨基酸变化)的工具。
SnpEff 下载安装
SnpEff 是免安装的软件,下载解压即可使用;
# Download latest version> wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip# Unzip file> unzip snpEff_latest_core.zip
SnpEff 下载数据库
一般我们可以直接用snpeff 的命令进行下载;步骤包括:
1)查询可用的数据库;
2)下载数据库。
# 查询可用数据库> java -jar snpEff.jar databases > snpEff.databases.list.txt#下载数据库, 以human为例> java -jar snpEff.jar download hg19# 下载成功之后,会在软件安装目录生成 data文件夹,并且生成一个以数据库名字命名的文件夹> ll hg19/total 447784interactions.binnextProt.binpwms.binsequence.10.binsequence.11.binsequence.12.bin......
SnpEff 注释
Basic example;
输入文件:input : vcf;
# input:vcf 8列, 如果构建虚拟的vcf文件;REF ALT不能一样$ head examples/test.chr22.vcf#CHROM POS ID REF ALT QUAL FILTER INFO22 17071756 . T C . . .22 17072035 . C T . . .22 17072258 . C A . . .22 17072674 . G A . . .22 17072747 . T C . . .22 17072781 . C T . . .
databases 注释执行命令如下:
# 注释命令 及 output 在输入文件的INFO列新增了一个字段信息,字段的名字叫做ANN;关于ANN中各个部分的详细信息可以参考VCF头部的注释部分。$ java -Xmx4g -jar snpEff.jar GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf## 如果指定 配置文件,用 '-c' command; 如果条件允许, 可以用 '-t' command 多线程$ java -Xmx4g path/to/snpEff/snpEff.jar -c -t 4 path/to/snpEff/snpEff.config GRCh37.75 path/to/snps.vcf# java -Xmx4g -jar path/to/snpEff/snpEff.jar hg19 test.chr22.vcf > test.anno.hg19_multianno.txt# Here is how the output looks like$ head examples/test.chr22.ann.vcf##SnpEffVersion="4.1 (build 2015-01-07), by Pablo Cingolani"##SnpEffCmd="SnpEff GRCh37.75 examples/test.chr22.vcf "##INFO=##INFO=##INFO=#CHROM POS ID REF ALT QUAL FILTER INFO22 17071756 . T C . . ANN=C|3_prime_UTR_variant|MODIFIER|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.*11A>G|||||11|,C|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397A>G|||||4223|22 17072035 . C T . . ANN=T|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1406G>A|p.Gly469Glu|1666/2034|1406/1674|469/557||,T|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>A|||||3944|22 17072258 . C A . . ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1183G>T|p.Gly395Cys|1443/2034|1183/1674|395/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>T|||||3721|22 17072674 . G A . . ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.767C>T|p.Pro256Leu|1027/2034|767/1674|256/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397C>T|||||3305|
对注释的结果处理
处理前:##SnpEffVersion="4.3t (build 2017-11-24 10:18), by Pablo Cingolani"##SnpEffCmd="SnpEff hg19 /data/IndividualArchive/luojy/pipeline/Splitfusion/ngs/example/P4T1-L34/P4T1-L34_for_anno.txt "##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">1 4165471 . T A . . ANN=A|intergenic_region|MODIFIER|LINC01346-LOC284661|LINC01346-LOC284661|intergenic_region|LINC01346-LOC284661|||n.4165471T>A||||||1 9658640 . T A . . ANN=A|missense_variant|MODERATE|TMEM201|TMEM201|transcript|NM_001130924.2|protein_coding|4/11|c.563T>A|p.Ser188Asn|620/3852|563/2001|188/666||WARNING_REF_DOES_NOT_MATCH_GENOME,A|missense_variant|MODERATE|TMEM201|TMEM201|transcript|NM_001010866.3|protein_coding|4/6|c.563T>A|p.Ser188Asn|620/3907|563/1179|188/392||WARNING_REF_DOES_NOT_MATCH_GENOME1 25792274 . T A . . ANN=A|intron_variant|MODIFIER|TMEM57|TMEM57|transcript|NM_018202.5|protein_coding|6/10|c.1154+6891T>A||||||,A|intron_variant|MODIFIER|TMEM57|TMEM57|transcript|NM_001282564.1|protein_coding|4/8|c.473+11401T>A||||||1 38010402 . T A . . ANN=A|intron_variant|MODIFIER|SNIP1|SNIP1|transcript|NM_024700.3|protein_coding|2/3|c.328-4046A>T||||||1 48305059 . T A . . ANN=A|intron_variant|MODIFIER|TRABD2B|TRABD2B|transcript|NM_001194986.1|protein_coding|2/6|c.667-37768A>T||||||处理后:position gene strand func exonfunc refNM exon cdna1_4165471 LINC01346-LOC284661 NA NA intergenic_region LINC01346-LOC284661 intron NA1_9658640 TMEM201 NA protein_coding missense_variant NM_001130924.2 exon4 5631_25792274 TMEM57 NA protein_coding intron_variant NM_018202.5 intron6 NA1_38010402 SNIP1 NA protein_coding intron_variant NM_024700.3 intron2 NA1_48305059 TRABD2B NA protein_coding intron_variant NM_001194986.1 intron2 NA1_52940791 ZCCHC11 NA protein_coding stop_gained NM_001009881.2 exon13 24401_69212036 DEPDC1-AS1-LRRC7 NA NA intergenic_region DEPDC1-AS1-LRRC7 intron NA1_115256558 NRAS NA protein_coding synonymous_variant NM_002524.4 exon3 1531_115258735 NRAS NA protein_coding missense_variant NM_002524.4 exon2 47
部分代码:
def anno_process(input_file, output_file):list_anno = []with open(input_file, 'r') as r_handle:reader = csv.DictReader(filter(lambda row: row[0] != '#', r_handle), delimiter='\t',fieldnames=['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'])for row in reader:spams = re.split(r'\|', row['INFO'])dict_for_anno = {'position': row['CHROM'] + '_' + row['POS'],'gene': ('NA' if spams[3] == '' else spams[3]),'strand': 'NA','func': ('NA' if spams[7] == '' else spams[7]),'exonfunc': ('NA' if spams[1] == '' else spams[1]),'refNM': ('NA' if spams[6] == '' else spams[6]),'exon': ('intron' if spams[10] == '' else 'exon') + ('' if spams[8] == '' else re.split(r'/', spams[8])[0]),'cdna': ('NA' if spams[12] == '' else re.split(r'/', spams[12])[0]) ## 注意有争议的部分}list_anno.append(dict_for_anno)with open(output_file, 'w') as w_handle:writer = csv.DictWriter(w_handle, quoting=csv.QUOTE_MINIMAL, delimiter='\t',fieldnames=['position', 'gene', 'strand', 'func', 'exonfunc', 'refNM', 'exon', 'cdna'])writer.writeheader()for row in natsort.natsorted(list_anno, key=itemgetter('position')):writer.writerow(row)
参考
snpEff使用说明https://www.jianshu.com/p/e03095642ef0
文章转载自罗大黑学生信,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




