Haplotype estimation for biobank scale datasets
SHAPEIT3 introduces several important extensions to the vanilla SHAPEIT algorithm to enable this scalability:
- a fast clustering routine to identify conditioning haplotypes in sub-quadratic time
- early stopping of the HMM when perfect haplotype matches are found
- a redesigned MCMC routine for better performance
These features only become relevant at very large sample sizes. If your sample size is <20,000, we recommend you use SHAPEIT2.
If you use SHAPEIT3 in your research, please cite the following publication:
SHAPEIT3 commands are very similar to SHAPEIT2, with a few additional arguments to enable its fast scaling. Simply enabling the
--fast flag should be sufficient for most users:
shapeit3 \ -B example/gwas \ -M genetic_map.txt \ -O out \ --threads 2 \ --cluster-size 500
A full list of arguments are listed here. In addition to the standard SHAPEIT2 parameters, SHAPEIT3 has the following arguments:
--fastenable fast mode. This enables both the fast conditioning haplotype search and the early stopping HMM. We recommend users enable this when sample sizes are >15,000
--cluster-size 4000the size of the clusters used in the haplotype search. Accuracy and computation time will increase with this value. We have found 4000 provides a good tradeoff
--early-stoppingdo not perform HMM iterations in a window if a perfect match with a conditioning haplotype is found
Software registration and license:
SHAPEIT 3 is freely available for academic use only. To see rules for non-academic use see the LICENCE file (also included with each software download).
Software and licence can be downloaded here.
21 July (v1.0) : First release