HAPGEN

HAPGEN2 is a program that simulates case control datasets at SNP markers that is described in the paper

Z. Su, J. Marchini and P. Donnelly (2011) HAPGEN2: simulation of multiple disease SNPs. Bioinformatics.

It can simulate multiple disease SNPs on a single chromosome, on the assumption that each disease SNP acts independently and are in Hardy-Weinberg equilibrium. We also supply a R package that can simulate interaction between the disease SNPs. We hope to add further facilities to simulate quantitive traits and admixture soon.

The underlying simulation approach can handle markers in linkage disequilibrium (LD) and simulate datasets over large regions such as whole chromosomes. It simulates haplotypes by conditioning on a reference set of population haplotypes and an estimate of the fine-scale recombination rate across the region, so that the simulated data has the same LD patterns as the reference data.

The disease model is specified through a set of disease causing SNPs together with their relative risks. The program is designed to work with publicly available files that contain the haplotypes estimated as part of the HapMap or 1000 Genomes project and the estimated fine-scale recombination map derived from that data. HAPGEN2 is computationally tractable. On a modern desktop it can simulate several thousand case and control data on a whole chromosome at Hapmap marker density within minutes.

HAPGEN2 output data in the FILE FORMAT used by IMPUTE2, SNPTEST and GTOOL

Download

HAPGEN2 is available free to use for academic use only. Please see the LICENCE page.
Pre-compiled versions of the program and example files can be downloaded from this Dropbox link

https://www.dropbox.com/sh/jlxeyecv2e6rg95/AABXQnqNnsj2j2YzqcpJANmBa?dl=0

In addition to the basic HAPGEN2 binary, we provide an R package (called SimulatePhenotypes) for simulation under more complex disease models. This is also available via the Dropbox link.

In order to install an R package you need to uncompress it before installing it, for example:

tar -xzvf SimulatePhenotypes_1.0.tar.gz
R CMD INSTALL SimulatePhenotypes

To load the package, type “library(SimulatePhenotypes)” in R before running any of the functions.

Running HAPGEN2

Quick example

HAPGEN2 is a command line program. To illustrate its use we have made an example dataset hapgen2.example.tgz that is found via the Dropbox link above. To unpack the files use a command like

tar -zxvf hapgen2.example.gz

This will create an folder called example, which contains a set of example input files required by HAPGEN2.

If example is placed in the same directory as the HAPGEN2 binary then you can run HAPGEN2 by

./hapgen2 -m ./example/ex.map \
-l ./example/ex.leg \
-h ./example/ex.haps \
-o ./example/ex.out \
-dl 1085679 1 1.5 2.25 2190692 0 2 4 \
-n 100 100 \
-t ./example/ex.tags

This will simulate data for 100 case and 100 control individuals at the SNPs specified in the file example/ex.leg with similar patterns of LD as the haplotypes in example/ex.haps. Two disease SNPs are simulated, at positions 1085679 and 2190692, and with heterozgyote risks 1.5 and 2, homozygote risks 2.25 and 4, and risk alleles set to 1 and 0 at each SNP respectively. The results of the simulation are written to ./example/ex.out.haps, ./example/ex.out.sample, ./example/ex.out.gen, ./example/ex.out.tags and ./example/ex.out.summary that contain the results of the simulation. See below for a description of the options, input file formats and output file formats.

We recommend using the HapMap or 1000 Genome data as input for HAPGEN2. Please see below for instructions on downloading and using them.

NOTE : HAPGEN2 sets the random seed of its random number generator using the time of day to the nearest second. You should be aware of this when running multiple simulations using HAPGEN2 as runs that are started very close in time will produce identical results.

Options

Flags Required/Optional Description
-h <file> Required File of known haplotypes, with one row per SNP and one column per haplotype. Every haplotype file needs a corresponding legend file (see below), and all alleles must be coded as 0 or 1 — no other values are allowed. See the following section for links to the relevant HapMap and 1000 Genomes files.
-l <file> Required A legend file for the SNP markers. This file should have 4 columns with one line for each SNP. The columns should contain an ID for each SNP i.e. rs id of the marker, the base pair position of each SNP, base represented by 0 and base represented by 1. The first line of the legend file are column labels (these are not used by the program but the file is required to contain a header line). See the example file ex.leg. See the following section for links to the relevant HapMap and 1000 Genomes files.
 -m <file> Required A file containing the fine-scale recombination rate across the region. This file should have 3 columns with one line for each SNP. The columns should contain physical location, rate in cM/Mb to the right of the marker and the cumulative rate in cM to the left of the marker. A header line containing the column labels is required. See the example file ex.map. See the following section for links to the relevant HapMap and 1000 Genomes files.
-dl <int> <a> <rr1> <rr2> … Required Sets location, risk allele and relative risks for each disease risk. For each disease SNP, four numbers are required in the following order:

  1. physical location of SNP, which must be in the legend file supplied to the -l flag
  2. risk allele (0 or 1), the corresponding base can be found in the legend file
  3. heterozygote disease risk
  4. homozygote disease risk

For example, -dl 1085679 1 1.5 2.25 2190692 0 2 4 specifies two disease SNPs, at positions 1085679 and 2190692, and with heterozgyote risks 1.5 and 2, homozygote risks 2.25 and 4, and risk alleles set to 1 and 0 at each SNP respectively. There is no limit on the number of disease SNPs. We simulate under a disease model where the disease SNPs are independent, and the haplotypes defined by the disease SNPs are in HWE.
This flag is optional for version 2.0.2 and above, when if not supplied then all haplotypes will be simulated under the null.

-n <int> <int> Recommended Sets the number of control and the number of case individuals to simulate. For example -n 100 200 simulates 100 control and 200 case individuals. The default is to generate 1 control and 1 case individual.
-int <int> <int> Optional Specify the lower and upper boundaries of the region in which you wish to carry out simulation. The default is set to 0 and 500000000.
-o <file> Required Output file prefix. For example -o ex.out[.gz] creates the following files for the case data:

  • ex.out.cases.haps[.gz] – A file containing the simulated haplotype data in the same format as the file haplotype file supplied to the -h flag.
  • ex.out.legend (from version 2.1.2 onwards) – A legend file with information about the SNPs in the .haps files.
  • ex.out.cases.gen[.gz] – A file containing the simualted genotype data in the file format compatible with SNPTEST, SNPTEST2, IMPUTE, IMPUTE2 and GTOOL.
  • ex.out.cases.sample – A sample file in the file format compatible with SNPTEST2 for the simulated genotype data.
  • ex.out.cases.tags.gen[.gz] – The genotype data limited to the subset of SNPs specified by the file supplied to the -t flag (if applicable).

A similar set of files will be produced for the control data, with the same file names except that cases are replaced by controls.
A summary file, ex.out.[.gz]summary, will also be produced, which summarises the simulation parameters, input files and output files.

Note:

  • If the output file prefix has a .gz extension then the *.haps.gz, *.gen.gz and *.tags.gen.gz files will be gzipped.
  • It is possible to supress some of the output files using the flags -no_gens_output and -no_haps_output (see below).
-output_snp_summary Optional Output the pvalues and effect size estimates (under an log additive model test) for each disease SNP and under a joint model for all of the disease SNPs in the simulated genotype data. Note, that for version 2.1.x, this option always used by default (with no option to switch it off) but it turns out that this step is very time consuming and has therefore been made optional from version 2.2.0 onwards.
-no_haps_output Optional No haplotype data files, *.haps[.gz], will be outputted for the case and control data.
-no_gens_output Optional No genotype data files, *.gen[.gz], will be outputted for the case and control data. However, if you have provided an input to the -tflag then the *.tags.gen[.gz] will be outputted.
-t <file> Optional SNP subset file. This option allows the user to output data at only a subset of the SNP markers in the simulated dataset i.e. at a set of tag SNPs. The file should contain the physical location of markers that will be in the output on one line per SNP. The physical locations must match those in the legend file. If this option is selected then a .tags.gen output file will be produced that contains the positions of the SNPs in the output file.
-Ne <int> Optional Sets effective population size that scales the fine-scale recombination map for the given population. For example, -Ne 11000 sets the effective population size to 11000. For autosomal chromosomes, we highly recommend the values 11418 for CEPH, 17469 for Yoruban and 14269 for Chinese Japanese populations.
-theta <real> Optional Sets mutation rate in the model. For example, -theta 10 sets the scaled mutation rate to 10. Mutation rate is set to that the expected number of mutations at a given SNP is equal to 1 by default.

Simulating interaction

The basic HAPGEN2 executable can only simulate multiple independent disease SNPs. However, the function simulateDiscretePhenotypes in the R package SimulatePhenotypes can simulate phenotype data for a set of genotype data under a multiple-SNP interaction disease model. Therefore, one can first run HAPGEN2 under the null (by setting the effects sizes to 1.0 for all SNPs passed to the -dl flag, or if running version 2.0.2 and above then just omit the -dl flag), load the simulated genotype data into R and pass it into the function to simulate the phenotype data. Since the simulation process is stochastic, the number of individuals simulated with case and control phenotypes can not be controlled. See the help documentation in R for more details on running simulateDiscretePhenotypes.

In addition, SimulatePhenotypes has the following functions:

  • twoSnpInteractionModel1
  • twoSnpInteractionModel2
  • twoSnpInteractionModel3

that allows easily simulation of the two-SNP interaction disease model specified in Marchini et al. [4]. See the help documentation of those functions for more details.

Using HAPGEN2 with the HapMap2, HapMap3 and 1000 Genomes Project Data

A main use of HAPGEN2 will be to simulate genotypes based on the haplotypes from HapMap2, HapMap3 and the 1000 Genomes Project data. In particular, the HapMap3 data allows HAPGEN2 to simulate data for a number of populations and the 1000 Genomes data allows the simulation of high density SNP data. To facilitate this use of such data we have designed HAPGEN2 to use the same input format(haplotype and legend files) as required by IMPUTE and therefore be able to use the haplotype data that is available from the IMPUTE webpage.

Contact Information

If you have a question please send a mail to our maillist

http://www.jiscmail.ac.uk/OXSTATGEN

You will need to subscribe to the maillist to do this.