Pages

Using R + Bioconductor to Get Flanking Sequence Given Genomic Coordinates

I'm working on a project using next-gen sequencing to fine-map a genetic association in a gene region. Now that I've sequenced the region in a small sample, I'm picking SNPs to genotype in a larger sample. When designing the genotyping assay the lab will need flanking sequence.

This is easy to get for SNPs in dbSNP, but what about novel SNPs? Specifically, given a list of genomic positions where about half of them are potentially novel SNPs in the population I'm studying, how can I quickly and automatically fetch flanking sequence from the reference sequence? I asked how to do this on previously mentioned Biostar, and got several answers within minutes.

Fortunately this can easily be done using R and Bioconductor. For several reasons BioMart wouldn't do what I needed it to do, and I don't know the UCSC/Ensembl schemas well enough to use a query or the Perl API.

This R code below (modified from Adrian's answer on BioStar) does the trick:

# Load the BSgenome package and download the hg19 reference sequence
# For documentation see http://www.bioconductor.org/packages/release/bioc/html/BSgenome.html
source("http://www.bioconductor.org/biocLite.R")
biocLite("BSgenome")
biocLite("BSgenome.Hsapiens.UCSC.hg19") #installs the human genome (~850 MB download).
library('BSgenome.Hsapiens.UCSC.hg19')
# Function to pull flanking sequence. Defaults to +/- 10 bp
getflank <- function(position, alleles="[N/N]", chr="chr12", offset=10) {
leftflank <- getSeq(Hsapiens,chr,position-offset,position-1)
rightflank <- getSeq(Hsapiens,chr,position+1,position+offset)
paste(leftflank,alleles,rightflank,sep="")
}
# Try this as an example. should return "TGGCAACTCC[C/A]TTCCATTTGC",
# which is exactly what you'd see if you searched db SNP:
# http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp&cmd=search&term=rs1520218
getflank(102741896, alleles="[C/A]")
view raw getflank.r hosted with ❤ by GitHub