I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome. Surprisingly there is no option in PLINK to split up a dataset into separate files by chromosome, so I wrote a Perl script to do it myself. The script takes two arguments: 1. the base filename of the binary pedfile (if your files are data.bed, data.bim, data.fam, the base filename will be "data" without the quotes); 2. a base filename for the output files to be split up by chromosome. You'll need PLINK installed for this to work, and I've only tested this on a Unix machine. You can copy the source code below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/perl | |
# 2010-11-11 | |
# (c) Stephen Turner | |
# http://GettingGeneticsDone.blogspot.com/ | |
# http://www.stephenturner.us/ | |
# This script takes as input the base filename of binary pedfiles (*.bed, | |
# *.bim, *.fam) and a base output filename and splits up a dataset by | |
# chromosome. Useful for imputing to 1000 genomes. | |
chomp(my $pwd = `pwd`); | |
my $help = "\nUsage: $0 <BEDfile base> <output base>\n\n"; | |
die $help if @ARGV!=2; | |
$infile_base=$ARGV[0]; #base filename of inputs | |
$outfile_base=$ARGV[1]; #base filename of outputs | |
$plink_exec="plink --nonfounders --allow-no-sex --noweb"; | |
$chr=22; #last chromosome to write out | |
for (1..$chr) { | |
print "Processing chromosome $_\n"; | |
`$plink_exec --bfile $infile_base --chr $_ --make-bed --out ${outfile_base}$_;` | |
} | |