HAPGEN2 is a program that simulates case control datasets at SNP markers that is described in the paper
Z. Su, J. Marchini and P. Donnelly (2011) HAPGEN2: simulation of multiple disease SNPs. Bioinformatics.
It can simulate multiple disease SNPs on a single chromosome, on the assumption that each disease SNP acts independently and are in Hardy-Weinberg equilibrium. We also supply a R package that can simulate interaction between the disease SNPs. We hope to add further facilities to simulate quantitive traits and admixture soon.
The underlying simulation approach can handle markers in linkage disequilibrium (LD) and simulate datasets over large regions such as whole chromosomes. It simulates haplotypes by conditioning on a reference set of population haplotypes and an estimate of the fine-scale recombination rate across the region, so that the simulated data has the same LD patterns as the reference data.
The disease model is specified through a set of disease causing SNPs together with their relative risks. The program is designed to work with publicly available files that contain the haplotypes estimated as part of the HapMap or 1000 Genomes project and the estimated fine-scale recombination map derived from that data. HAPGEN2 is computationally tractable. On a modern desktop it can simulate several thousand case and control data on a whole chromosome at Hapmap marker density within minutes.
HAPGEN2 output data in the FILE FORMAT used by IMPUTE2, SNPTEST and GTOOL
HAPGEN2 is available free to use for academic use only. Please see the LICENCE page.
Pre-compiled versions of the program and example files can be downloaded from this Dropbox link
In addition to the basic HAPGEN2 binary, we provide an R package (called SimulatePhenotypes) for simulation under more complex disease models. This is also available via the Dropbox link.
In order to install an R package you need to uncompress it before installing it, for example:
tar -xzvf SimulatePhenotypes_1.0.tar.gz R CMD INSTALL SimulatePhenotypes
To load the package, type “library(SimulatePhenotypes)” in R before running any of the functions.
HAPGEN2 is a command line program. To illustrate its use we have made an example dataset hapgen2.example.tgz that is found via the Dropbox link above. To unpack the files use a command like
tar -zxvf hapgen2.example.gz
This will create an folder called example, which contains a set of example input files required by HAPGEN2.
If example is placed in the same directory as the HAPGEN2 binary then you can run HAPGEN2 by
./hapgen2 -m ./example/ex.map \ -l ./example/ex.leg \ -h ./example/ex.haps \ -o ./example/ex.out \ -dl 1085679 1 1.5 2.25 2190692 0 2 4 \ -n 100 100 \ -t ./example/ex.tags
This will simulate data for 100 case and 100 control individuals at the SNPs specified in the file example/ex.leg with similar patterns of LD as the haplotypes in example/ex.haps. Two disease SNPs are simulated, at positions 1085679 and 2190692, and with heterozgyote risks 1.5 and 2, homozygote risks 2.25 and 4, and risk alleles set to 1 and 0 at each SNP respectively. The results of the simulation are written to ./example/ex.out.haps, ./example/ex.out.sample, ./example/ex.out.gen, ./example/ex.out.tags and ./example/ex.out.summary that contain the results of the simulation. See below for a description of the options, input file formats and output file formats.
We recommend using the HapMap or 1000 Genome data as input for HAPGEN2. Please see below for instructions on downloading and using them.
NOTE : HAPGEN2 sets the random seed of its random number generator using the time of day to the nearest second. You should be aware of this when running multiple simulations using HAPGEN2 as runs that are started very close in time will produce identical results.
|-h <file>||Required||File of known haplotypes, with one row per SNP and one column per haplotype. Every haplotype file needs a corresponding legend file (see below), and all alleles must be coded as 0 or 1 — no other values are allowed. See the following section for links to the relevant HapMap and 1000 Genomes files.|
|-l <file>||Required||A legend file for the SNP markers. This file should have 4 columns with one line for each SNP. The columns should contain an ID for each SNP i.e. rs id of the marker, the base pair position of each SNP, base represented by 0 and base represented by 1. The first line of the legend file are column labels (these are not used by the program but the file is required to contain a header line). See the example file ex.leg. See the following section for links to the relevant HapMap and 1000 Genomes files.|
|-m <file>||Required||A file containing the fine-scale recombination rate across the region. This file should have 3 columns with one line for each SNP. The columns should contain physical location, rate in cM/Mb to the right of the marker and the cumulative rate in cM to the left of the marker. A header line containing the column labels is required. See the example file ex.map. See the following section for links to the relevant HapMap and 1000 Genomes files.|
|-dl <int> <a> <rr1> <rr2> …||Required||Sets location, risk allele and relative risks for each disease risk. For each disease SNP, four numbers are required in the following order:
For example, -dl 1085679 1 1.5 2.25 2190692 0 2 4 specifies two disease SNPs, at positions 1085679 and 2190692, and with heterozgyote risks 1.5 and 2, homozygote risks 2.25 and 4, and risk alleles set to 1 and 0 at each SNP respectively. There is no limit on the number of disease SNPs. We simulate under a disease model where the disease SNPs are independent, and the haplotypes defined by the disease SNPs are in HWE.
|-n <int> <int>||Recommended||Sets the number of control and the number of case individuals to simulate. For example -n 100 200 simulates 100 control and 200 case individuals. The default is to generate 1 control and 1 case individual.|
|-int <int> <int>||Optional||Specify the lower and upper boundaries of the region in which you wish to carry out simulation. The default is set to 0 and 500000000.|
|-o <file>||Required||Output file prefix. For example -o ex.out[.gz] creates the following files for the case data:
A similar set of files will be produced for the control data, with the same file names except that cases are replaced by controls.
|-output_snp_summary||Optional||Output the pvalues and effect size estimates (under an log additive model test) for each disease SNP and under a joint model for all of the disease SNPs in the simulated genotype data. Note, that for version 2.1.x, this option always used by default (with no option to switch it off) but it turns out that this step is very time consuming and has therefore been made optional from version 2.2.0 onwards.|
|-no_haps_output||Optional||No haplotype data files, *.haps[.gz], will be outputted for the case and control data.|
|-no_gens_output||Optional||No genotype data files, *.gen[.gz], will be outputted for the case and control data. However, if you have provided an input to the -tflag then the *.tags.gen[.gz] will be outputted.|
|-t <file>||Optional||SNP subset file. This option allows the user to output data at only a subset of the SNP markers in the simulated dataset i.e. at a set of tag SNPs. The file should contain the physical location of markers that will be in the output on one line per SNP. The physical locations must match those in the legend file. If this option is selected then a .tags.gen output file will be produced that contains the positions of the SNPs in the output file.|
|-Ne <int>||Optional||Sets effective population size that scales the fine-scale recombination map for the given population. For example, -Ne 11000 sets the effective population size to 11000. For autosomal chromosomes, we highly recommend the values 11418 for CEPH, 17469 for Yoruban and 14269 for Chinese Japanese populations.|
|-theta <real>||Optional||Sets mutation rate in the model. For example, -theta 10 sets the scaled mutation rate to 10. Mutation rate is set to that the expected number of mutations at a given SNP is equal to 1 by default.|
The basic HAPGEN2 executable can only simulate multiple independent disease SNPs. However, the function simulateDiscretePhenotypes in the R package SimulatePhenotypes can simulate phenotype data for a set of genotype data under a multiple-SNP interaction disease model. Therefore, one can first run HAPGEN2 under the null (by setting the effects sizes to 1.0 for all SNPs passed to the -dl flag, or if running version 2.0.2 and above then just omit the -dl flag), load the simulated genotype data into R and pass it into the function to simulate the phenotype data. Since the simulation process is stochastic, the number of individuals simulated with case and control phenotypes can not be controlled. See the help documentation in R for more details on running simulateDiscretePhenotypes.
In addition, SimulatePhenotypes has the following functions:
that allows easily simulation of the two-SNP interaction disease model specified in Marchini et al. . See the help documentation of those functions for more details.
Using HAPGEN2 with the HapMap2, HapMap3 and 1000 Genomes Project Data
A main use of HAPGEN2 will be to simulate genotypes based on the haplotypes from HapMap2, HapMap3 and the 1000 Genomes Project data. In particular, the HapMap3 data allows HAPGEN2 to simulate data for a number of populations and the 1000 Genomes data allows the simulation of high density SNP data. To facilitate this use of such data we have designed HAPGEN2 to use the same input format(haplotype and legend files) as required by IMPUTE and therefore be able to use the haplotype data that is available from the IMPUTE webpage.
If you have a question please send a mail to our maillist
You will need to subscribe to the maillist to do this.