52
AlphaImpute v. 1.9 manual The complete, un-abridged, kill-a-tree edition Stefan McKinnon Edwards Roberto Antolin David Wilson John M. Hickey

AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

AlphaImpute v. 1.9 manualThe complete, un-abridged, kill-a-tree edition

Stefan McKinnon Edwards Roberto Antolin David WilsonJohn M. Hickey

Page 2: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,
Page 3: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Foreword

AlphaImpute is a software package for imputing and phasing genotype data in diploidpopulations with pedigree information. To impute an individual’s genotype, the programinfers possible haplotypes of the parental gametes that were inherited by the individual.This requires phase information, and when not present, the program this information.AlphaImpute operates on overlapping chromosome segments, here referred to as cores,to account for recombination events.

Please report bugs or suggestions on how the program / user interface / manualcould be improved or made more user friendly to [email protected] [email protected].

Historical background

AlphaImpute (Hickey et al. 2012) was developed as the next step after AlphaPhase(Hickey et al. 2011). While AlphaPhase inferred the underlying haplotypes that could bepassed inherited, AlphaPhase would utilize this information to impute missing genotypesbased on Mendelian inheritance.

Prior to version v. 1.9, AlphaImpute relied on several algorithms; noticeably ‘GeneProb’(Kerr & Kinghorn 1996) and ‘AlphaPhase’ (Hickey et al. 2011). These were imple-mented as separately executables that AlphaImpute, but are now fully embedded withinAlphaImpute for improved performance and efficiency.

The GeneProb algorithm performed segregation analysis by calculating the probabilityof an ungenotyped individual belonging to each genotype class. This was performed foreach SNP position independently by traversing up and down the pedigree. The use ofthe GeneProb algorithm is now deprecated in favor of a more efficient algorithm.

The AlphaPhase algorithm (Hickey et al. 2011) is responsible for inferring the paternaland maternal gametes that each individual inherited; a process called long-range phas-ing (Kong et al. 2008). This is performed on consecutive SNPs throughout multiple,overlapping chromosomal regions. AlphaPhase furthermore performs haplotype libraryimputation to resolve phases where family structure is lacking. AlphaPhase is now fullyembedded within AlphaImpute, but available as a separate binary from the website.

3

Page 4: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

AlphaImpute builds upon the haplotype library constructed by the AlphaPhase algo-rithm to impute missing genotypes. This allows for imputation of completely ungeno-typed animals and low-density genotyped animals by iteratively matching the inferredhaplotypes to the imputed animals’ alleles and updating the haplotype library with newhaplotypes.

The newest development introduces a hidden Markov Model (HMM) (Antolın et al.2017) that that performs well in unrelated animals or when pedigree is unreliable. TheHMM is based on MaCH (Li et al. 2010), and can be used either alone or in conjunctionwith AlphaImpute’s standard long-range phasing and heuristic imputation.

Availability

AlphaImpute is available as an executable file from the AlphaGenes website: http://www.alphagenes.roslin.ed.ac.uk. Binaries are compiled for Linux (64 bit), Mac OaSX(64 bit), and Windows (64 bit).

System requirements

The amount of RAM the program requires depends on the dataset. It tends to scalelinearly with the number of animals in the supplied pedigree.

OS X

The program will function on any 64-bit intel based mac, with OS 10.8 or later (olderversions might work).

Linux

The program was tested working on linux kernel 3.14 and in theory should support anysystem with SSE3 instructions.

Windows

The program will function on any PC with a 64-bit processor with SSE3 instructions.It requires Windows 7 or later.

4

Page 5: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Conditions of use

Conditions of use

AlphaImpute is available to the scientific community free of charge, but conditional oncrediting its use in any publication with the citations listed below. Commercial usersshould contact John Hickey.

Citations:

• Antolin, R., Nettelblad, C., Gorjanc, G., Money, D. & Hickey, J.M., (2017) A hy-brid method for the imputation of genomic data in livestock populations’, GeneticsSelection Evolution 49. doi:10.1186/s12711-017-0300-y

• Hickey, J.M., Kinghorn, B.P., Tier, B., van der Werf, J.H. & Cleveland, M.A.(2012) A phasing and imputation method for pedigreed populations that resultsin a single-stage genomic evaluation. Genetics Selection Evolution 44, 11.doi:10.1186/1297-9686-44-9

See appendix for bibtex format, together with full list of relevant citations for AlphaIm-pute.

Disclaimer

While every effort has been made to ensure that AlphaImpute does what it claimsto do, there is absolutely no guarantee that the results provided are correct. Use ofAlphaImpute is entirely at your own risk!

Advertisement

You are welcome to check out our Gibbs sampler “AlphaBayes”, specifically designedfor GWAS and genomic selection: http://www.alphagenes.roslin.ed.ac.uk/alphasuite-softwares/alphabayes/

For simulating breeding programs, complete with genomic selection, SNP panels, recom-bination, and more, see “AlphaSim”: http://www.alphagenes.roslin.ed.ac.uk/alphasuite-softwares/alphasim/

Contents1 AlphaImpute introduction 7

2 Quick reference 13

3 Full settings reference 19

5

Page 6: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

4 Frequenty asked questions 35

5 Example 1: AlphaImpute tutorial 37

Appendix I: Bibtex for AlphaImpute 45

Bibliography 46

Index 49

6

Page 7: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

1 AlphaImpute introduction

This section describes shortly what AlphaImpute does, how it works, and how to useit. It is intended as a quick introduction to getting started for imputing genotypes withAlphaImpute.

A full reference with examples is available in the full manual.

The theoretical framework for AlphaImpute and its implemented algorithms can befound in Hickey et al. (2012; 2011) and Antolin et al. (2017).

Contents

What AlphaImpute does: Imputing genotypes . . . . . . . . . . . . . . . . . . . 7How AlphaImpute works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8How to use AlphaImpute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Running AlphaImpute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Imputed genotypes are found in the Results directory . . . . . . . . . . . . . . 10Spec file example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

What AlphaImpute does: Imputing genotypes

AlphaImpute’s primary goal is to impute genotypes. This is needed when a rangeof animals or humans have been genotyped using different single nucleotide polymor-phism (SNP) genotyping panels; expensive SNP panels can genotype many SNPs (‘high-density’) while the cheaper SNP panels genotypes fewer SNPs (‘low-density’). Imputinggenotypes fills in the missing SNPs on the low-density SNP panels, based on the geno-types on the high-density SNP panels.

AlphaImpute employs several algorithms to achieve this. A notable by-product is thephased genotypes, or phases. The phases is the sequence of alleles that were inheritedtogether in the parental gametes that fused to produce the animal. In comparison, thegenotypes is the sequence of homozygote/heterozygote SNPs.

7

Page 8: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

1 AlphaImpute introduction

How AlphaImpute works

AlphaImpute both phases and imputes alleles, which are then combined to impute miss-ing genotypes. This occures in two steps: 1. Long-range phasing and haplotype libraryconstruction of animals genotyped with high-denisty SNP panels. 1. Imputation ofmissing genotypes by matching haplotypes.

To do so, AlphaImpute first identifies all individuals that were genotyped at a largeproportion of the SNP positions. These comprises the high-density genotyped animals.These individuals are phased to derive the haplotypes that are carried by these individ-uals that could be inherited or be passed on to offspring. The phasing process is referredto long-range phasing, and results with the haplotype library construction. This is per-formed on cores, continuous subsets of the chromosome. By having multiple lengths ofcores, the subsets overlap allowing both close and distant inheritances to be identified.

With the haplotype library, missing alleles are imputed by matching the alleles of theknown alleles surrounding the missing alleles with the corresponding alleles of the hap-lotypes in the haplotype library. By using basic rules of Mendelian inheritance andsegregation analysis, plausible haplotypes are found and used to impute the missingalleles.

For a given allele position, multiple cores have been used to identify haplotypes. Onlyhaplotypes that could have been inherited from the parents are used, and where allhaplotypes are in agreement, the missing allele is imputed. If this results in a newhaplotype, the haplotype library is updated.

The imputation is iterated several times, so that alleles that could not be imputed mightbe imputed with a haplotype that was found in a subsequently imputed individual.

AlphaImpute finally outputs imputed phases and genotypes for all individuals.For both outputs, alleles that are in agreement across all cores are outputted as integervalues, or missing where no agreement was found. In addition, the average allele acrossall cores is outputted, as allele or genotype dosages, as numeric values between 0.0 and1.0 for phases and between 0.0 and 2.0 for genotypes.

Summary

AlphaImpute works by:

1. Dividing the animals into two groups, either high-density genotyped or low-densitygenotyped.

2. Phasing the genotypes of the high-density genotyped animals.3. Haplotype library construction, and4. Iteratively using haplotypes to impute missing genotypes.

Step 1 corresponds to the first three items, step 2 to the last.

8

Page 9: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

How to use AlphaImpute

How to use AlphaImpute

AlphaImpute is available as an executable file from the AlphaGenes website: http://www.alphagenes.roslin.ed.ac.uk. Its standard mode of operation is described here.Alternative uses, including re-using previous phasing information, is available in the fullmanual.

AlphaImpute requires at least 3 input files: A file specifying settings (‘spec file’), agenotype file, and a pedigree file1. For the genotype and pedigree files, AlphaImputeaccepts either simple text formats or the PLINK 1.9 formatted files (Purcell et al. 2007,Chang et al. (2015)).

The spec file is text based and discussed in the following section.

AlphaImpute accepts integer or alphanumeric IDs for identifying animals2.

The pedigree file must consist of 3 columns, space or comma separated: ID for theIndividual, ID for the sire, and ID for the dam. The pedigree does not have to be sorted,as this is done during preprocessing. Missing parents should be coded as 0.

The genotype file is formatted with one row per individual, and is space or commaseparated. Multiple spaces or tabs are ignored. The first column of the genotype filecontains the ID for the genotyped individuals, followed by columns for each SNP. SNPsare encoded 0, 1, or 2 for homozygote, heterozygote, and homozygote, respectively.Missing genotypes are coded with any value between 3 and 9. For PLINK 1.9 formattedfiles, we refer to the full manual.

Neither pedigree or genotype file may contain headers. The pedigree file may recordadditional individuals that are not genotyped.

Settings in the spec file

The settings for AlphaImpute are specified in AlphaImputeSpec.txt3. An example ofthe spec file is given in the section ‘Spec file example’. Most options have default valuesand do not require specifying4. Only a few lines are required to specify the input files,number of SNPs, settings for phasing, parallelization, and threshold for distinguishinghigh-density vs. low-density individuals, although good default values for these exists.Examples of the file is available in the zip file with the executable.

1Pedigree free phasing and imputation is possible with AlphaImpute; see setting PedigreePhasing andthe Hidden Markov model.

2The family ID in PLINK 1.9 formatted files is ignored by AlphaImpute.3The AlphaImpute executable accepts a single command line argument, the filename of the spec

file. Without a command line argument, AlphaImpute expects the spec file to be namedAlphaImputeSpec.txt and located in the current directory.

4AlphaImpute will create the file AlphaImputeSpecFileUsed.txt with the values it has used.

9

Page 10: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

1 AlphaImpute introduction

The spec file is a text file comprising two comma separated columns. Lines starting with= are ignored. The name of the setting is left of the comma and is case insensitive. Thevalue to the right of the comma, is case insensitive, with the exception of filenames.

The settings in the spec file are for clarity split into 8 boxes, corresponding to differenttopics.

Running AlphaImpute

AlphaImpute is run from the command line with the simple command:

./AlphaImpute (Linux, Mac OSX)

or

AlphaImpute.exe (Windows).

Consult the specifics of your operating system for how to call executables from otherlocations.

Warning: AlphaImpute will delete the following directories before attempting to cre-ate them! It is your own responsibility to ensure these folders are backed up safely:GeneProb, InputFiles, IterateGeneProb, Miscellaneous, Phasing, Results

The listed directories are created by AlphaImpute to contain recoded data, intermediatedata, and various output such as quality of phasing and imputation that can be used togain insight into the data and the imputation.

Imputed genotypes are found in the Results directory

The directory ‘Results’ contains the main final imputed genotypes and phases. The mainfiles of interest are:

• ImputeGenotypes.txt• ImputeGenotypeProbabilities.txt• ImputePhase.txt• ImputePhaseProbabilities.txt

Filenames without Probabilities are ‘called’ genotypes and alleles. When an alleleor genotype for an animal is imputed in unison across all cores, it is called, otherwiseoutput as missing with integer 9.

Filenames with Probabilities5 are genotype dosages and allele dosages. These containthe average of the imputed genotype or allele across all cores.

5The name ‘probabilities’ has been retained for legacy.

10

Page 11: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Spec file example

The difference between the files are shown in the example below, and a summary givenin Table 1.1.

Example: Phase vs. genotypes, and dosages vs. called.

ImputePhaseProbabilities.txt ImputePhase.txt1067 0.00 0.00 1.00 0.00 0.00 0.00 1067 0 0 1 0 0 01067 0.00 0.00 1.00 0.50 0.50 0.50 1067 0 0 1 9 9 91068 0.00 0.00 1.00 0.00 0.00 0.00 1068 0 0 1 0 0 01068 0.00 0.00 1.00 0.00 0.00 0.00 1068 0 0 1 0 0 01069 0.00 0.00 1.00 0.00 0.00 0.00 1069 0 0 1 0 0 01069 0.00 0.00 1.00 0.00 0.00 0.00 1069 0 0 1 0 0 0

ImputeGenotypeProbabilities.txt ImputeGenotypes.txt1067 0.00 0.00 2.00 0.50 0.50 0.50 1067 0 0 2 9 9 91068 0.00 0.00 2.00 0.00 0.00 0.00 1068 0 0 2 0 0 01069 0.00 0.00 2.00 0.00 0.00 0.00 1069 0 0 2 0 0 0

Table 1.1: Summary of primary output files. No. row is function of number of individuals(n).

. No. rows Allele / genotype dosages Called alleles / genotypes

. Impute*Probabilities.txt Impute*.txt

. —————————— ————————————Phase 2n 0.0 – 1.0 0, 1, 9Genotype n 0.0 – 2.0 0, 1, 2, 9

Spec file example

= BOX 1: Input Files ==========================================================PedigreeFile ,Pedigree.txtGenotypeFile ,Genotypes.txtTrueGenotypeFile ,TrueGenotypes.txtSexChrom ,NoPlinkInputfile ,= BOX 2: SNPs ==================================================================NumberSnp ,1500MultipleHDPanels ,0NumberSnpxChip ,0,0= BOX 3: Filtering====== =======================================================InternalEdit ,No

11

Page 12: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

1 AlphaImpute introduction

EditingParameters ,0.0,0.0,0.0,AllSnpOut= BOX 4: Phasing ===============================================================HDAnimalsThreshold ,90.0NumberPhasingRuns ,4CoreAndTailLengths ,300,350,400,450CoreLengths ,250,300,350,400PedigreeFreePhasing ,NoGenotypeError ,0.0LargeDatasets ,NoUserDefinedAlphaPhaseAnimalsFile ,NonePrePhasedFile ,None= BOX 5: Imputation =========================================================InternalIterations ,5ConservativeHaplotypeLibraryUse ,NoModelRecomb ,Yes= BOX 6: Hidden Markov Model ================================================HMMOption ,NoTemplateHaplotypes ,200BurnInRounds ,5Rounds ,20Seed ,-123456789ThresholdForMissingAlleles ,50.0ThresholdImputed ,90.0= BOX 7: Running options ====================================================ParallelProcessors ,1PreprocessDataOnly ,NoRestartOption ,0= BOX 8: Output =============================================================WellPhasedThreshold ,99.0ResultFolderPath ,Results

12

Page 13: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

2 Quick reference

The spec file is a text file comprising two comma separated columns. Lines starting with= are ignored.

Placeholders are used in this manual to describe the accepted values of settings andare summarized in Table 2.1. These appear enclosed either in angular bracket (<, >)or parentheses. Any other text must be provided as-is. Arguments given with angularbrackets denote type of value, e.g. integers (<integer>) or filenames (<fn>). Stringsgiven in parentheses are options and must be given as-is (e.g. (Yes/No)).

Table 2.1: Placeholder syntax used in manual.Placeholder Expected value

<fn> A filename, absolute or relative to current working directory. Donot use quotation marks. NB! Filenames and paths are casesensitive in OSX and Linux!

<integer> An integer value.<real> A real value (e.g. 0.8).<RealPct> A percentage, given without ‘%’ (e.g. 95.0 for 95%)(Yes/No) Literal strings, either Yes or No are accepted (case insensitive).(a,b,c) One of three literal strings expected.

See table 2.2 for an overview of the available settings.

13

Page 14: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

2Q

uickreference

Table 2.2: All settings understood by AlphaImpute. Full explanation of these can be found in the full reference manual.Setting Description Accepted values

Box 1: Input dataPedigreeFile Filename of inputfile, relative or

absolute. No quotation marks.Filename, max 300 characters.

GenotypeFileTrueGenotypefileSexChrom Impute sex chromosomes. No – default. Yes,<fn>,(Male/Female)Box 2: SNPsNumberSnp Number of SNP positions in input

files.<integer>

MultipleHDPanels Multiple high-density SNP panelswere used.

(Yes/No)

NumberSnpxChip Required when MultipleHDPanels,Yes; number of SNPs per high-densitySNP panel.

<Integer>,<Integer>,...

Box 3: FilteringInternalEdit Removes SNPs that are missing in too

many individuals, and re-evaluatesindividuals in high-density group.Note: Incompatible whenMultipleHDPanels ,Yes.

(Yes/No)

EditingParameters Required when InternalEdit ,Yes. <RealPct>,<RealPct>,<RealPct>,(AllSnpOut/EditedSnpOut)

Box 4: PhasingHDAnimalsThreshold Threshold of non-missing SNPs for

including an animal in thehigh-density group.

<RealPct>

14

Page 15: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Setting Description Accepted values

NumberPhasingRuns Number of core sizes given inCoreAndTailLengths andCoreLengths.

<integer>

. Re-use previous phasing. phasedone,<fn>,<integer>CoreAndTailLengths Size of cores including tails. <integer>,<integer>,...CoreLengths Size of core excluding tails. <integer>,<integer>,...PedigreeFreePhasing Use pedigree information in

long-range phasing.(Yes/No)

GenotypeError Threshold for allowed disagreementbetween cores of two surrogate parentsduring surrogate parent identification.

<RealPct>

LargeDatasets Splits phasing of large datasets intomultiple subsets.

Yes,<integer>,<integer>,(RandomOrder/Off/InputOrder)

. 1st integer controls number of subsets.2nd integer controls maximumnumber of subsets each individualmay appear in. 3rd argument controlshow individuals are distributed amongsubsets.

AlphaPhaseOutput Output control setting forAlphaPhase settings.

(No/Yes/Binary/Verbose)

UserDefinedAlphaPhaseAnimalsFile Specify which animals to use in step 1. <fn>PrephasedFile Provides pre-phased data to

haplotype library construction.<fn>

UseFerdosi Use Ferdosi algorithm to phaseindividuals.

(Yes/No)

Box 6: ImputationInternalIterations Number of iterations of imputations. <integer>

15

Page 16: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

2Q

uickreference

Setting Description Accepted values

ConservativeHaplotypeLibraryUse When Yes, haplotype library is notupdated with new haplotypes duringimputation.

(Yes/No)

ModelRecomb Models recombination after lastimputation.

(Yes/No)

Box 7: Hidden Markov modelHMMOption Enable or disable use of HMM

something.(Yes/No)

HMMParameters Single line for setting followingsettings.

<TemplateHaplotypes>,<BurnInRounds>,<Rounds>,<ParallelProcessors>,<Seed>

TemplateHaplotypes Number of possible gametes whoserecombinations led to a given SNPgenotype.

<integer>

BurnInRounds Number of rounds to discard beforesampling HMM.

<integer>

Rounds Total number of rounds forMonte-Carlo sampling of HMM.

<integer>

Seed Initial seed for random numbergenerator. Must be negative.

<integer>

PhasedAnimalsThreshold Threshold for proportion of phasedanimals for sampling haplotypesamong both impute and high-densitygenotyped animal.

<RealPct>

ThresholdImputed When the proportion of an animal’salleles are phased above thisthreshold, the haploid model is used.

<RealPct>

HaplotypesList ????Box 8: Run time options

16

Page 17: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Setting Description Accepted values

PreprocessDataOnly When Yes, AlphaImpute stops beforephasing.

(Yes/No)

ParallelProcessors Upper limit of how many parallelprocesses are used for phasing andHMM.

<integer>

RestartOption Causes AlphaImpute to run all steps(0), only phasing (1), or onlyimputation (2).

(0,1,2)

Cluster Advanced option. Not discussed here.Box 9: OutputResultFolderPath Specify a different subdirectory for

final output files.<path>

WellPhasedThreshold Phases of individuals with phasingquality above this threshold arewritten toResults/WellPhasedIndividuals.txt

<Real>

17

Page 18: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,
Page 19: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

3 Full settings reference

This section describes the accepted file formats, settings and their acceptable values forthe spec file, and the output files written by AlphaImpute.

ContentsSpecification file, ‘spec file’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Box 1: Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Box 2: SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Box 3: Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Box 4: Phasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Box 5: Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Box 6: Hidden Markov model . . . . . . . . . . . . . . . . . . . . . . . . 30Box 7: Runtime options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Box 8: Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Specification file, ‘spec file’

The settings for AlphaImpute are set with a single text file, referred to as a spec file.By default, AlphaImpute expects the spec file to be named AlphaImputeSpec.txt, butother filenames can be provided as the first command line argument when calling Al-phaImpute.

The spec file is a text file comprising two comma separated columns. Lines startingwith = are ignored. The name of the setting is left of the comma and is case insensitive.The value to the right of the comma, is case insensitive with the exception of filenames(dependent on the operating system).

Placeholders are used in this manual to describe the accepted values of settings andare summarized in Table 3.1. These appear enclosed either in angular bracket (<, >)or parentheses. Any other text must be provided as-is. Arguments given with angularbrackets denote type of value, e.g. integers (<integer>) or filenames (<fn>). Stringsgiven in parentheses are options and must be given as-is (e.g. (Yes/No)).

19

Page 20: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

3 Full settings reference

Table 3.1: Placeholder syntax used in manual.Placeholder Expected value

<fn> A filename, absolute or relative to current working directory. Donot use quotation marks. NB! Filenames and paths are casesensitive in OSX and Linux!

<integer> An integer value.<real> A real value (e.g. 0.8).<RealPct> A percentage, given without ‘%’ (e.g. 95.0 for 95%)(Yes/No) Literal strings, either Yes or No are accepted (case insensitive).(a,b,c) One of three literal strings expected.

Box 1: Input data

Settings: PedigreeFile, GenotypeFile, TrueGenotypeFile, SexChrom, PlinkInputfile

AlphaImpute requires a single ID field to identify individuals. Both numeric and al-phanumeric formatted IDs are accepted.

PedigreeFile ,<fn>

Default value: Pedigree.txt

The pedigree file requires three columns in the following order: the individual’s ID, theID of the individual’s sire, and the ID of the individual’s dam. Missing IDs must becoded as 0.

Columns must be space or comma separated, and no header line may be included. Donot use quotes for filename.

The pedigree may include more individuals than provided in the input genotype file.These extra individuals will by default be imputed as well and written to output files,unless OutputOnlyGenotypedAnimals (p. 34) is explicitly set to No. The pedigree isnot required to include all individuals provided in the input genotype file.

There is no need to sort the pedigree prior to using it in AlphaImpute. This is done inter-nally, and the reordering is printed to file Miscellaneous/InternalDataRecoding.txt.

20

Page 21: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Specification file, ‘spec file’

GenotypeFile ,<fn>

Default: Genotypes.txt

The genotype file must contain one line for each individual. First column corresponds tothe individual’s ID. Following columns correspond to each SNP position for the individ-ual. The number of SNP columns must correspond to the number given in NumberSnp(p. 23) , if the setting is specified.

SNPs must be encoded as integer values between 0 and 9. 0, 1, and 2 correspond, re-spectively to homozygous aa, heterozygous aA or Aa, and homozygous AA, respectively.Major and minor allele, A and a, are without reference to a reference genome, but mustconsistent throughout the input data file. Missing values are coded with values 3 – 9.

Columns must be space or comma separated, and no header line may be included. Donot use quotes for filename.

Special cases:

If the input data is a sex chromosome, see setting SexChrom (p. 22) .

AlphaImpute by default assumes two SNP density panels; one high-density and onelow-density. If multiple, not fully overlapping high-density panels are used, see settingMultipleHdPanels (p. 23) . See PlinkInputfile (p. 22) for using PLINK1.9 formattedfiles for pedigree and genotype input.

TrueGenotypeFile ,<fn>

Default: None

Filename of genotype file with true genotypes of all or subset of individuals. Whenpresent, AlphaImpute calculates imputation accuracies and outputs to standard output.Does not work when using PLINK 1.9 formatted input files.

SexChrom ,No

Default variant. Input data is not assumed to be a sex chromosome.

21

Page 22: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

3 Full settings reference

SexChrom ,Yes,<fn>,(Male/Female)

Input data is a sex chromosome.

Requires an additional file (<fn>) listing which individuals are male or female as atwo-column text file with ID in first column and sex (1 = Male, 2 = Female) in second.When last argument is Male, males are assumed heterogametic, and females are assumedhomogametic. When last argument is Female males are assumed homogametic, andfemales are assumed heterogametic.

PlinkInputfile ,binary,<fn>

Use PLINK 1.9 binary formatted input files (Purcell et al. 2007, Chang et al. (2015)).When used, AlphaImpute will iterate over each chromosome.

AlphaImpute reads genotype data from <fn>.bim and <fn>.bed, and pedigree from<fn>.ped.

PlinkInputfile ,text,<fn>

Use PLINK 1.9 text formatted input files (Purcell et al. 2007, Chang et al. (2015)).When used, AlphaImpute will iterate over each chromosome.

AlphaImpute reads genotype data from <fn>.map and pedigree from <fn>.ped.Do not intermix ordinary input formats (PedigreeFile and GenotypeFile) withPlinkInputfile. If both sets of settings are found in the spec file, the last one listedin the spec file takes precedence.

NB! AlphaImpute does not output the PLINK 1.9 file format. NB! The family ID usedin the PLINK 1.9 file format is ignored by AlphaImpute.

See https://www.cog-genomics.org/plink/1.9/input for more information on thePLINK1.9 formats.

Box 2: SNPs

Settings: NumberSnp, MultipleHDPanels, NumberSnpxChip

22

Page 23: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Specification file, ‘spec file’

NumberSnp ,<integer>

Default value estimated from GenotypeFile (p. 21) .

Number of SNPs in input files. If not given, number is automatically detected.

Special case: If several different high-density panels were used, refer to settingMultipleHDPanels (p. 23) .

MultipleHdPanels ,<integer>

Default value: 0

Sets number of different high-density SNP panels used for genotyping high-density geno-typed animals. Requires setting NumberSnpxChip (p. 23) when larger than 1.

NumberSnpxChip ,<int>,<int>,...

Specifies the number of SNPs on the 1st, 2nd, etc. high-density SNP panel. Requiredwhen MultipleHdPanels (p. 23) is larger than 1.

Box 3: Filtering

Settings: InternalEdit, EditingParameters

Removes SNPs that are missing in too many individuals, and removes individuals fromhigh-density group with too many missing SNPs.

Note: Incompatible with MultipleHdPanels ,Yes (p. 23) .

InternalEdit ,(Yes/No)

Default value: No

When Yes, enables filtering of SNPs and individuals in high-density group. When Yes,setting EditingParameters (p. 24) is required.

23

Page 24: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

3 Full settings reference

EditingParameters,<RealPct>,<RealPct>,<RealPct>,(AllSnpOut/EditedSnpOut)

Default value: 90.0,0.0,0.0,AllSnpOut

First argument is same as HDAnimalsThreshold (p. 24) that sets the threshold of non-missing SNPs for including an individual in the high-density group.

Second argument sets threshold of missing SNPs among individuals in high-densitygroup. SNPs missing above this threshold are excluded from further analysis.

Third argument sets threshold for non-missing SNPs for including an individuals, as firstargument, but applied after removing missing SNPs.

Fourth argument causes AlphaImpute to output either all inputted SNPs (AllSnpOut)or only those that remained after editing SNPs (EditedSnpOut).

Box 4: Phasing

Settings: HDAnimalsThreshold, NumberPhasingRuns, CoreAndTailLengths,CoreLengths, PedigreeFreePhasing, GenotypeError, LargeDatasets, PhasingOnly,UserDefinedAlphaPhaseAnimalsFile, PrePhasedFile

These settings affect which animals are regarded as high-density genotyped, and howthey are phased. These settings are used during long-range phasing and haplotypelibrary construction (step 1).

HDAnimalsThreshold ,<RealPct>

Default value: 90.0 (i.e. 90%)

Animals whose proportion of genotyped SNPs are above this threshold are regarded ashigh-density genotyped SNPs.

Note: This setting is also set by first argument of EditingParameters (p. 24) .HDAnimalsThreshold may be overridden by later occurences of EditingParametersin the spec file.

NumberPhasingRuns ,<integer>

Default value: 4.

Standard setup for AlphaImpute to perform phasing. Accepts integers between 2 and40. This must correspond to the number of cores given in CoreAndTailLengths andCoreLengths (p. 25) .

24

Page 25: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Specification file, ‘spec file’

Figure 3.1: Core- and tail length (A) and position of adjacent cores of different sizes (B).

NumberPhasingRuns ,PhaseDone,<fn>,<integer>

AlphaImpute re-uses phasing from a previous session and will not perform additionalphasing.

<fn> points to path (which output)?

<integer> is a whole integer between 2 and 40. This must correspond to the numberof cores given in CoreAndTailLengths and CoreLengths (p. 25) . NumberPhasingRunsmust not exceed that used during phasing. CoreAndTailLengths and CoreLengthsmust be the same as used during phasing.

CoreAndTailLengths and CoreLengths

CoreAndTailLengths ,<integer>,<integer>,...CoreLengths ,<integer>,<integer>,...

Default values: Estimated from number of SNPs.

Specifies the sizes of cores used for phasing, haplotype library construction, and impu-tation. Each positional argument of the two settings comprises a pair, where the lattervalue cannot be greater than the former. The sizes used cannot exceed the number ofSNPs in the input data.

Accepts between 2 and 40 integers that must correspond to NumberPhasingRuns (p. 24).

The positioning of cores and their tails is displayed in Figure 3.1. The combined core andtails are used exclusively for inferring surrogate parents who passed on the haplotype.The cores are used for phasing, haplotype library construction, and imputation. In-depthdescription of the cores can be found in Hickey et al. (2011).

Example: Small chromosome, few cores

25

Page 26: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

3 Full settings reference

= BOX 3: SNPs ===============================================================NumberSnp ,1500= BOX 5: Phasing ============================================================NumberPhasingRuns ,4CoreAndTailLengths ,300,350,400,450CoreLengths ,250,300,350,400

There is no requirement to have only few cores on a small chromosome.

Example: Large chromosome, many cores

= BOX 3: SNPs ===============================================================NumberSnp ,15000= BOX 5: Phasing ============================================================NumberPhasingRuns ,10CoreAndTailLengths ,200,300,400,500,600,250,325,410,290,1700CoreLengths ,100,200,300,400,500,150,225,310,190,1000

There is no requirement that large a chromosome should be phased and imputed with alarge number of cores.

Example: Re-using previous phasing

The user has run a phasing using settings shown in Example 2 in another directory. Nowthe user imputes an updated dataset, but reuses the phasing from Example 2.

= BOX 3: SNPs ==============================================================NumberSnp ,15000= BOX 5: Phasing ============================================================NumberPhasingRuns ,PhaseDone,../Example 2/Phasing,10CoreAndTailLengths ,200,300,400,500,600,250,325,410,290,1700CoreLengths ,100,200,300,400,500,150,225,310,190,1000

<! – See Section 3: Example X for more information on reusing phasing information.

Stefan TODO –>

PedigreeFreePhasing ,(Yes/No)

Default value: No

Use pedigree information in long-range phasing. In some cases this may be quicker andmore accurate if pedigree information is unreliable.

26

Page 27: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Specification file, ‘spec file’

GenotypeError ,<RealPct>

Default value: 0.0

Threshold for allowed disagreement between cores of two surrogate parents during sur-rogate parent identification. Use values between 0.0 and 100.0.

A value of 1.00 (i.e. 1%) means that across a CoreAndTailLengths of 300 SNPs, 3 ofthese SNP are allowed to be missing or in disagreement between two otherwise compat-ible surrogate parents. Thus these two individuals are allowed to be surrogate parentsof each other in spite of the fact that 1% of their genotypes are missing or are in conflict(i.e. opposing homozygotes). Small values are better (e.g. <1.0%).

LargeDatasets ,No

Default variant.

Phasing is performed on all individuals assigned to high-density group, cf. HDAnimalsThreshold(p. 24) and EditingParameters (p. 24) .

LargeDatasets ,Yes,<int>,<int>,(RandomOrder/Off/InputOrder)

Splits phasing of large datasets into multiple subsets. First argument controls numberof subsets. Second argument controls maximum number of subsets each animal mayappear in. Third argument controls how individuals are distributed among subsets.

When third argument is RandomOrder, the subsets are populated by sampling animals.

When third argument is Off, each animal is phased independently, effectively same assetting number of subsets to number of animals.

When third argument is InputOrder, animals are divided into subsets based on theorder they are read in and they will only be used in a single subset.

UserDefinedAlphaPhaseAnimalsFile ,<fn>

Default value: None

Specifies which animals are regarded as high-density genotyped; overrides settingsHDAnimalsThreshold (p. 24) and EditingParameters (p. 24) .

<fn> points to file with single column of IDs. Specify None to ignore.

27

Page 28: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

3 Full settings reference

AlphaPhaseOutput ,(No/Yes/Binary/Verbose)

Default value: No

When No, no files are written out.

When Yes, haplotype libraries are written to the Phasing directory in user readableform. This enables re-use of phasing and haplotype library construction. When Binary,haplotype libraries are written to the Phasing directory in bary form. This enables re-use of phasing and haplotype library construction. When Verbose, haplotype librariesare written to the Phasing directory in user readable form as well as output of debuggingstatements. This enables re-use of phasing and haplotype library construction and allowsfor debugging phasing errors.

PrephasedFile ,<fn>

Default value: None

Provides pre-phased data to haplotype library construction. Specify None to ignore.Phasing is still performed, but phases of animals given in this file are overwritten.

The expected file format is as the general AlphaImpute genotype file format (p. 21) , butwith two lines per animal. First line corresponds to sequence of alleles on the paternalgamete, second line corresponds to sequence of alleles on the maternal gamete. Firstcolumn is animals ID, followed by a column for each SNP position.

Phased alleles are encoded as integer values 0 or 1, corresponding to minor allele a ormajor allele A, respectively. Missing values are encoded as values between 3 and 9.

UseFerdosi ,(Yes/No)

Default value: No

Uses the algorithm of Ferdosi et al. (2014, (2014)) to derive the phase of a sire thathas many progeny genotyped at high-density. Using the positions and linkage betweenpositions that are opposing homozygote in the offspring, the algorithm defines the posi-tions that the sire is heterozygous and phases them. As a by-product of this process, thepossible locations of recombination points in the offspring and the phase of the offspringare derived.

Requires at least 5-10 high-density genotyped progeny per sire to be effective.

28

Page 29: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Specification file, ‘spec file’

Box 5: Imputation

Settings: InternalIterations, ConservativeHaplotypeLibraryUse, ModelRecomb

Imputation is performed after phasing in step 1 using a heuristic method described inHickey et al. (2012). The phasing step and imputation step can be run as two separateprocesses, but not in parallel as the imputation step requires a completed phasing. Thisis controlled with the setting RestartOption (p. 33) . If the phasing is computationexpensive, the phasing information can be reused if the low-density genotyped animalshave been updated with new animals.

An alternative to the default heuristic imputation method is the probabilistic methodbased on a hidden Markov model (Li et al. 2010, Antolın et al. (2017)). It has anadvantage over the heuristic method when pedigree information is inconsistent, but ismore computational expensive. The heuristic method and probabilistic method can beused either exclusively or in combination. The settings for the probabilistic method isdescribed in Box 6 (p. 31) .

InternalIterations ,<integer>

Default value: 5

The imputation is iterated several times, so that alleles that could not be imputed mightbe imputed with a haplotype that was found in a subsequently imputed individual.Using numbers as low as just 1 or 2 is possible, as AlphaImpute will finish quicker, butat the expense of lower imputation accuracies and yields. Higher numbers will prolongAlphaImpute’s running time with marginal increases in imputation accuracies and yieldsin return.

ConservativeHaplotypeLibraryUse ,(Yes/No)

Default value: No

When No, new haplotypes inferred during imputations are added to the haplotype libraryduring imputation when new haplotypes may be inferred.

When Yes, new haplotypes inferred during imputation are not used to impute othergenotypes.

29

Page 30: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

3 Full settings reference

ModelRecomb ,(Yes/No)

Default value: Yes

When Yes, AlphaImpute performs additional correction after the last imputation itera-tion and writes imputed genotypes to Results/ModelRecomb.txt . During this correc-tion, recombination events are modelled, and alleles that are affected by recombinationevents are masked as missing. This correction can increase imputation accuracy, but atthe expense of yield.

Output files: Results/ModelRecomb.txt, Results/RecombinationInformationNarrowR.txt, Results/RecombinationInformationNarrow.txt , Results/RecombinationInformationR.txt, Results/RecombinationInformation.txt

Box 6: Hidden Markov model

Settings: HMMOption, HMMParameters, TemplateHaplotypes, BurnInRounds, Rounds,Seed, PhasedAnimalsThreshold, WellImputedThreshold, HaplotypesList

An alternative to the default heuristic imputation method is the probabilistic methodbased on a hidden Markov model (Li et al. 2010, Antolın et al. (2017)). It has an advan-tage over the heuristic method when pedigree information is inconsistent or unreliable,but is however more computational expensive. The heuristic method and HMM methodcan be used either exclusively or in combination, see HMMOption (p. 31) . Full details ofthe algorithm and implementation is given in Antolin et al. (2017).

Using HMM does not use the pedigree. If imputation is performed in conjunction withlong-range phasing (step 1) and the heuristic imputation algorithm, a pedigree might berequired for those computations.

The HMM works by populating a pool of template haplotypes based on the observedgenotypes or phases. These template haplotypes correspond to those in the entire pop-ulation that could lead to the observed genotypes. For each animal, the most likelypair of haplotypes are found, taking into account genotype errors and recombinationevents. The genotype errors and recombination events are model hyperparameters thatare estimated within the model. If an animal is phased, a haploid model is used to findeach parental haplotype independently. If not, a diploid model is used to find the pairof parental haplotypes (see WellImputedThreshold).

To reduce computational time, only a subset of all possible haplotypes is sampled for thetemplate haplotypes (see TemplateHaplotypes). If the entire population is well-phased(see PhasedAnimalsThreshold), all animals can be used to sample haplotypes, with theadded benefit of using the faster haploid model. If not, only high-density genotypedanimals are used to sample haplotypes, using only their observed genotypes.

30

Page 31: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Specification file, ‘spec file’

AlphaImpute utilizes a Monte-Carlo procedure that succeedingly samples haplotypes,estimates the hyperparameters, and finds the most likely pairs of haplotypes. The finalestimate is found by summarizing over a number of rounds (Rounds), after discarding anumber of initial ‘burn-in rounds’ (BurnInRounds).

When using HMM for imputation, AlphaImpute also outputs ImputePhaseHMM.txt andImputeGenotypesHMM.txt with the HMM imputed phases and genotypes, and updatesImputePhaseProbabilities.txt and ImputeGenotypeProbabilities.txt with HMMderived allele dosages and genotypes dosages.

HMMOption ,(No/Only/Also/Prephase/NGS)

Default value: No

When Only, AlphaImpute does not perform long-range phasing and haplotype libraryconstruction (step 1), and only performs imputation with the HMM. This option is usefulwhen phasing information is not available or when imputation is required in unrelatedpopulations (Marchini & Howie 2010).

When Also (previously Yes), AlphaImpute performs as standard, and performs theHMM imputation as an additional step after the last heuristic imputation.

When Prephase ? See source code; this string is compared to mixed-case after*turning to lower-case!*

When NGS ?

HMMParameters ,<int>,<int>,<int>,<int>,<int>

Short-hand for setting all HMM parameters, in the order TemplateHaplotypes,BurnInRounds, Rounds, ParallelProcessors, Seed. See respective settings forexplanation of these.

TemplateHaplotypes ,<int>

Default value: 0

Sets the number of template haplotypes that the HMM samples. Larger numbers canimprove imputation accuracy, but at a cost of computation. Computational time isquadratic on number of template haplotypes, i.e. O(n2). Can be combined with anincreased number of processors (ParallelProcessors) to counter increased computa-tional time.

31

Page 32: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

3 Full settings reference

BurnInRounds ,<int>

Default value: 0

See setting Rounds (p. 32)

Rounds ,<int>

Default value: 0

Sets the total number of rounds that the HMM is computed. For each round, the tem-plate haplotypes are re-sampled from eligible animals and model parameters updated,to produce new estimates of imputed phases. The final verdict is summarized across allrounds, after discarding the estimates from the first number of rounds cf. BurnInRounds(p. 32) .

A value exceeding 50 rarely improves imputation accuracy.

Seed ,<int>

Default value: 0

Value to seed random number generation. Must be negative.

PhasedAnimalsThreshold ,<RealPct>

Default value: 0.0

The threshold is compared to the overall proportion of phased alleles after long-rangephasing (step 1) and heuristic imputation; if this threshold is met, both high-density andimputed animals can be sampled for the template haplotypes. If the threshold is notmet, only high-density genotyped animals are sampled for the template haplotypes.

ThresholdImputed ,<RealPct>

Default value: 0.0

The threshold is compared to the proportion of imputed alleles of each animal; if thethreshold is met the animal is imputed with faster, haploid HMM. If the threshold isnot met, the animal is imputed with the diploid HMM.

The haploid HMM estimates each parental haplotype independently when the animal iswell-phased. If the animal is not well-phased, a diploid HMM is used where the parentalhaplotypes are estimated in tandem. Both models utilize the same pool of templatehaplotypes.

32

Page 33: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Specification file, ‘spec file’

HaplotypesList ,<fn>

Default value: None

Box 7: Runtime options

Settings: ParallelProcessors, PreprocessDataOnly, RestartOption

ParallelProcessors ,<integer>

Default value: 8

Allows for parallelization during phasing and imputation. More processors reduces com-putational time. Should be set when run on a grid (e.g. SGE GridEngine) However,ParallelProcessors should not be larger than the number of processors available be-cause it might lead to inefficient performance, due to increased context switches betweenthreads.

PreprocessDataOnly ,(Yes/No)

Default value: No

When Yes, AlphaImpute stops after preprocessing input data. Use option to check dataand examine e.g. grouping of high-density and low-density animals.

RestartOption ,(0,1,2)

Default value: 0

Sets AlphaImpute to run all, or first or second step only.

When 0 (default), AlphaImpute will run through preprocessing, phasing, and imputa-tion.

When 1, AlphaImpute only runs through preprocessing and phasing.

When 2, AlphaImpute only runs through imputation. The latter requires phasing tohave run, and this options allows the user impute an updated low-density set withoutre-running phasing.

To only run preprocessing, set PreprocessDataOnly ,Yes.

33

Page 34: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

3 Full settings reference

Box 8: Output

Settings: WellPhasedThreshold, ResultFolderPath, OutputOnlyGenotypedAnimals

These settings modify the extent of files written to disk.

WellPhasedThreshold ,<RealPct>

Default value: 99.0

Individuals with an imputation quality above this threshold have their imputed phaseswritten to Results/WellPhasedIndividuals.txt . The imputation quality is definedas the proportion of non-missing alleles on both phases.

ResultFolderPath ,<path>

Default: Results

Specify a different directory instead of ‘Results’. See section Results p. 7.

OutputOnlyGenotypedAnimals ,(Yes/No)

Default: No

When Yes, additional animals found in pedigree file (p. 20) are not written to outputfiles.

34

Page 35: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

4 Frequenty asked questions

Please report bugs or suggestions on how the program / user interface / manualcould be improved or made more user friendly to [email protected] [email protected].

ContentsInput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

My genotypes are in multiple files . . . . . . . . . . . . . . . . . . . . . . . 35Error messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

OMP: Error #34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Input

My genotypes are in multiple files

The R-package Siccuracy (https://github.com/stefanedwards/Siccuracy) has been de-veloped for working the genotype files for AlphaImpute.

Error messages

OMP: Error #34

I am receiving the error

OMP: Error #34: System unable to allocate necessary resources for OMP thread:OMP: System error #11: Resource temporarily unavailableOMP: Hint: Try decreasing the value of OMP_NUM_THREADS.forrtl: error (76): Abort trap signal

Answer:

Use the setting ParallelProcessors ,1. Use values larger than 1 if you have morecores to use.

35

Page 36: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,
Page 37: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

5 Example 1: AlphaImpute tutorial

This chapter will guide you through using AlphaImpute and reading output.

To re-iterate from the introduction, AlphaImpute both phases and imputes alleles, whichare then combined to impute missing genotypes. This occures in two steps: 1. Long-range phasing and haplotype library construction of animals genotyped with high-denistySNP panels. 1. Imputation of missing genotypes by matching haplotypes.

This might be important.

Files for this example is available in the zip file containing the file AlphaImpute, in thefolder examples/01.

ContentsBefore you begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Run AlphaImpute with the simple command . . . . . . . . . . . . . . . . . . . 38AlphaImpute’s output to stdout . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Output for step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Output for step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Imputed genotypes are found in the Results directory . . . . . . . . . . . . . . 42Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Before you begin

Unzip the zip file containing the AlphaImpute executable. There should be a sub-directory names examples. For this example, we will be using examples/01.

Note for Windows users: Any forward slash (/) used in filename and directory pathsshould be replaced with Windows’ filepath separator, backslash (\).

Copy the AlphaImpute executable into the example/01 directory1.

The spec file is named AlphaImputeSpec.txt; AlphaImpute will expect this filename ifnot given any other. The file’s contents should be

1Optional, but if not, prepend any call to AlphaImpute with ../../.

37

Page 38: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

5 Example 1: AlphaImpute tutorial

= BOX 1: Input Files ==========================================================PedigreeFile ,Pedigree.txtGenotypeFile ,Genotypes.txt= BOX 4: Phasing ================================================================NumberPhasingRuns ,4CoreAndTailLengths ,300,350,400,450CoreLengths ,250,300,350,400= BOX 5: Imputation =========================================================InternalIterations ,1

We have specified a single round in InternalIterations to speed up the process forthe example. For proper research, use more rounds.

Run AlphaImpute with the simple command

./AlphaImpute (Linux, Mac OSX)

or

AlphaImpute.exe (Windows).

AlphaImpute’s output to stdout

You should expect to see the following output.

************************ ** AlphaImpute ** ************************

Software For Phasing and Imputing Genotypes

Commit: v1.9.8Compiled: Nov 16 2017, 10:46:24

38

Page 39: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

AlphaImpute’s output to stdout

is not valid for the AlphaImpute Spec File.

NOTE: Number of Genotyped animals: 11000 errors in the pedigree due to Mendelian inconsistencies0 snps changed across individuals

1000 indiviudals passed to AlphaPhase1500 snp remain after editing

Data editing completed

Running AlphaPhaseFinished Running AlphaPhase

Phasing Completed

Imputation of base animals completed

Performing imputation loop 1

Parent of origin assigmnent of high density haplotypes completed

Imputation from high-density parents completed at: 151738.936

Haplotype library imputation completed at: 151739.888

Internal imputation from parents haplotype completed at:151741.444

Internal imputation from own haplotype completed at:151747.172

Internal haplotype library imputation completed at:151747.608

Imputation by detection of recombination events completed

Genotype Yield 0.9978533

39

Page 40: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

5 Example 1: AlphaImpute tutorial

************************ ** AlphaImpute ** ************************

Software For Phasing and Imputing Genotypes

No Liability

Analysis Finished

Time Elapsed Hours 0 Minutes 5 Seconds 57.75

Output for step 1

For step 1 we see the following output (numbers to the right are inserted for this exam-ple):

NOTE: Number of Genotyped animals: 1100 (1)0 errors in the pedigree due to Mendelian inconsistencies (2)0 snps changed across individuals

1000 indiviudals passed to AlphaPhase (3)1500 snp remain after editing (4)

Data editing completed

Running AlphaPhase (5)Finished Running AlphaPhase

Phasing Completed

It provides us some usefull information:

1. Number of rows read by AlphaImpute in the genotype file.2. AlphaImpute will check that parents and offspring have ‘compatible’ genotypes.

I.e. if both parents are homozogous at a SNP position, the offspring should also be.If any inconsistencies were found, the SNP is masked as missing, and reported inMiscellaneous/snpMistakes.txt and Miscellaneous/PedigreeMistakes.txt.

40

Page 41: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

AlphaImpute’s output to stdout

3. Number of animals that are genotyped above HDAnimalsThreshold and thus usablefor long-range phasing and haplotype library construction. Adjust the settingHDAnimalsThreshold if the reported number is too low.

4. Result of using InternalEdit and EditingParameters.5. Signals AlphaImput has completed reading the data and is ready to perform long-

range phasing and haplotype library construction2. AlphaImpute can be stoppedat this point by specifying PreprocessDataOnly ,Yes in order to fine-tune thesettings of box 1-4.

Output for step 2

Imputation of base animals completed

Performing imputation loop 1

Parent of origin assigmnent of high density haplotypes completed

Imputation from high-density parents completed at: 151738.936

Haplotype library imputation completed at: 151739.888

Internal imputation from parents haplotype completed at:151741.444

Internal imputation from own haplotype completed at:151747.172

Internal haplotype library imputation completed at:151747.608

Imputation.

This will loop a number of times as specified by InternalIterations. The odd-lookingnumbers given at the end of each line is a timestamp in the form HHMMSS.ms.

Imputation by detection of recombination events completed

2This was formerly performed by an external executable.

41

Page 42: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

5 Example 1: AlphaImpute tutorial

When using ModelRecomb ,Yes (default), recombination events are modelled after thelast imputation loop, and this message is printed. Alleles that are affected by recombi-nation events are masked as missing. This correction can increase imputation accuracy,but at the expense of yield.

The updated imputation is written to Results/ModelRecomb.txt with recombinationinformation written to Results/Recombination*.txt

Genotype Yield 0.9978533

This is the proportion of non-missing genotypes after imputation. Can be used as abenchmark, but it is calculated across all animals, including high-density genotypedanimals.

After the imputation, a boilerplate similar to the one we say in the beginning is out-putted. After that,

Time Elapsed Hours 0 Minutes 5 Seconds 57.75

is written. This line is also written to Miscellaneous/Timer.txt . If AlphaImputeended succesfully, this file will be present and updated.

Imputed genotypes are found in the Results directory

The directory ‘Results’ contains the main final imputed genotypes and phases. The mainfiles of interest are:

• ImputeGenotypes.txt• ImputeGenotypeProbabilities.txt• ImputePhase.txt• ImputePhaseProbabilities.txt

42

Page 43: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Imputed genotypes are found in the Results directory

Filenames without Probabilities are ‘called’ genotypes and alleles. When an alleleor genotype for an animal is imputed in unison across all cores, it is called, otherwiseoutput as missing with integer 9.

Note: Despite some files are named ‘Probabilities’3, they contain the allele dosages.These are averages of the imputed allele across all cores. They cannae be regarded asprobabilities, as they do not necessarily sum to 1.

The difference between the files are shown in the example below, and a summary givenin Table 1.1.

Example: Phase vs. genotypes, and dosages vs. called.

ImputePhaseProbabilities.txt ImputePhase.txt1067 0.00 0.00 1.00 0.00 0.00 0.00 1067 0 0 1 0 0 01067 0.00 0.00 1.00 0.50 0.50 0.50 1067 0 0 1 9 9 91068 0.00 0.00 1.00 0.00 0.00 0.00 1068 0 0 1 0 0 01068 0.00 0.00 1.00 0.00 0.00 0.00 1068 0 0 1 0 0 01069 0.00 0.00 1.00 0.00 0.00 0.00 1069 0 0 1 0 0 01069 0.00 0.00 1.00 0.00 0.00 0.00 1069 0 0 1 0 0 0

ImputeGenotypeProbabilities.txt ImputeGenotypes.txt1067 0.00 0.00 2.00 0.50 0.50 0.50 1067 0 0 2 9 9 91068 0.00 0.00 2.00 0.00 0.00 0.00 1068 0 0 2 0 0 01069 0.00 0.00 2.00 0.00 0.00 0.00 1069 0 0 2 0 0 0

The 4 files highlighted here are ImputePhase.txt, ImputePhaseProbabilities.txt,ImputeGenotypeProbabilities.txt, and ImputeGenotypes.txt. Their format is incommon with the genotype input format, i.e. first column is ID for individual, followedby a column for each SNP position. ImputePhase*.txt files have two rows per individ-ual, first (second) row comprising imputed sequence of alleles inherited on the paternal(maternal) gamete.

Files ImputePhaseProbabilities.txt and ImputeGenotypeProbabilities.txt con-tain allele and genotype dosages. These are the average allele for each SNP positionacross all cores. These are never missing as the worst imputed value approximates theallele frequency among the known genotypes.

Files ImputePhase.txt and ImputeGenotypes.txt contain ‘called’ alleles and geno-types, where the imputed allele was in agreement across all cores. If no agreement wasfound, 9 is used as a missing value.

3The name ‘probabilities’ has been retained for legacy.

43

Page 44: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

5 Example 1: AlphaImpute tutorial

Summary

With just three files and the AlphaImpute executable, we have demonstrated how toimpute missing genotypes, and how to read the outputted data.

Use of other settings where not demonstrated here, but allows greater control overe.g. which individuals are considered high-density genotyped, imputing between mul-tiple high-density and low-density genotype panels, or reusing previous phasing whenthese have been computation expensive.

44

Page 45: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Appendix I: Bibtex for AlphaImpute

Bibtex format for required citations:

@article{antolin_hybrid_2017,title = {A hybrid method for the imputation of genomic data in livestock populations},volume = {49},issn = {1297-9686},url = {http://gsejournal.biomedcentral.com/articles/10.1186/s12711-017-0300-y},doi = {10.1186/s12711-017-0300-y},language = {en},number = {1},urldate = {2017-03-13},journal = {Genetics Selection Evolution},author = {Antol\’{\i}n, Roberto and Nettelblad, Carl and Gorjanc, Gregor and

Money, Daniel and Hickey, John M.},month = dec,year = {2017}

}@article{hickey_imputation_2012,

title = {A phasing and imputation method for pedigreed populations thatresults in a single-stage genomic evaluation},

volume = {44},issn = {1297-9686},url = {http://dx.doi.org/10.1186/1297-9686-44-9},doi = {10.1186/1297-9686-44-9},number = {9},urldate = {2016-06-11},journal = {Genetics Selection Evolution},author = {Hickey, John M. and Kinghorn, Brian P. and Tier, Bruce and

van der Werf, Julius HJ and Cleveland, Matthew A.},pages = {11},year = {2012}

}

45

Page 46: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,
Page 47: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Bibliography

Antolın, R., Nettelblad, C., Gorjanc, G., Money, D. & Hickey, J. M. (2017), ‘A hybridmethod for the imputation of genomic data in livestock populations’, Genetics Selec-tion Evolution 49(1). bibtex: antolin hybrid 2017.URL: http://gsejournal.biomedcentral.com/articles/10.1186/s12711-017-0300-y

Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M. & Lee, J. J.(2015), ‘Second-generation PLINK: rising to the challenge of larger and richerdatasets’, GigaScience 4(1).URL: https://academic.oup.com/gigascience/article-lookup/doi/10.1186/s13742-015-0047-8

Ferdosi, M. H., Kinghorn, B. P., van der Werf, J. H. J. & Gondro, C. (2014), ‘Detectionof recombination events, haplotype reconstruction and imputation of sires using half-sib SNP genotypes’, Genetics Selection Evolution 46(1), 11.URL: http://www.gsejournal.org/content/46/1/11

Ferdosi, M. H., Kinghorn, B. P., van der Werf, J. H., Lee, S. & Gondro, C. (2014),‘hsphase: an R package for pedigree reconstruction, detection of recombination events,phasing and imputation of half-sib family groups’, BMC Bioinformatics 15(1), 172.URL: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-172

Hickey, J. M., Kinghorn, B. P., Tier, B., van der Werf, J. H. & Cleveland, M. A.(2012), ‘A phasing and imputation method for pedigreed populations that results ina single-stage genomic evaluation’, Genetics Selection Evolution 44(9), 11. bibtex:hickey phasing 2012.URL: http://dx.doi.org/10.1186/1297-9686-44-9

Hickey, J. M., Kinghorn, B. P., Tier, B., Wilson, J. F., Dunstan, N. & Werf, J. H. v. d.(2011), ‘A combined long-range phasing and long haplotype imputation method toimpute phase for SNP genotypes’, Genetics Selection Evolution 43(1), 12. bibtex:hickey combined 2011.URL: http://www.gsejournal.org/content/43/1/12/abstract

Kerr, R. J. & Kinghorn, B. P. (1996), ‘An efficient algorithm for segregation analysisin large populations’, Journal of Animal Breeding and Genetics 113(1-6), 457–469.bibtex: kerr efficient 1996.URL: http://onlinelibrary.wiley.com/doi/10.1111/j.1439-0388.1996.tb00636.x/abstract

47

Page 48: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Bibliography

Kong, A., Masson, G., Frigge, M. L., Gylfason, A., Zusmanovich, P., Thorleifsson, G.,Olason, P. I., Ingason, A., Steinberg, S., Rafnar, T., Sulem, P., Mouy, M., Jonsson,F., Thorsteinsdottir, U., Gudbjartsson, D. F., Stefansson, H. & Stefansson, K. (2008),‘Detection of sharing by descent, long-range phasing and haplotype imputation’, Na-ture Genetics 40(9), 1068–1075. bibtex: kong detection 2008.

Li, Y., Willer, C. J., Ding, J., Scheet, P. & Abecasis, G. R. (2010), ‘MaCH: using se-quence and genotype data to estimate haplotypes and unobserved genotypes’, Geneticepidemiology 34(8), 816–834. bibtex: li mach: 2010.

Marchini, J. & Howie, B. (2010), ‘Genotype imputation for genome-wide associationstudies’, Nature Reviews Genetics 11(7), 499–511.URL: http://www.nature.com/doifinder/10.1038/nrg2796

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D.,Maller, J., Sklar, P., de Bakker, P. I. W., Daly, M. J. & Sham, P. C. (2007), ‘PLINK: ATool Set for Whole-Genome Association and Population-Based Linkage Analyses’, TheAmerican Journal of Human Genetics 81(3), 559–575. bibtex: purcell plink: 2007.URL: http://www.sciencedirect.com/science/article/pii/S0002929707613524

48

Page 49: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

49

Page 50: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,
Page 51: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Index

allele dosages, 8, see genotype dosagesAlphaPhase, 3AlphaPhaseOutput, 28

BurnInRounds, 32

ConservativeHaplotypeLibraryUse, 29CoreAndTailLengths, 25CoreLength, 25cores, 8, 25

EditingParameters, 24

Ferdosi, 28Files

Genotype file, 21ImputeGenotypeProbabilities.txt,

10, 31, 42ImputeGenotypes.txt, 10, 42ImputeGenotypesHMM.txt, 31ImputePhase.txt, 10, 42ImputePhaseHMM.txt, 31ImputePhaseProbabilities.txt, 10,

31, 42InternalDataRecoding.txt, 20ModelRecomb.txt, 30Pedigree, 20PedigreeMistakes.txt, 40PLINK 1.9, binary, 22PLINK 1.9, text, 22RecombinationInformation.txt, 30RecombinationInformationNar-

row.txt, 30RecombinationInformationNar-

rowR.txt, 30RecombinationInformationR.txt, 30

Results, 10, 42snpMistakes.txt, 40Timer.txt, 42True genotype file, 21WellPhasedIndividuals.txt, 34

Filtering, box 3, 23

genotype dosages, 8, 10, 43GenotypeError, 27GenotypeFile, 21Genotypes, 9

haplotype library construction, 8, 37HaplotypesList, 33HDAnimalsThreshold, 24Hidden Markov model, box 6, 30high-density, 7, 8HMM, see Hidden Markov model, 30HMMOption, 31HMMParameters, 31

Imputation, box 5, 29imputing genotypes, 7Imputing sex chromosomes, 22input data, 20InternalEdit, 23InternalIterations, 29

LargeDatasets, 27, 27long-range phasing, 3, 8, 37low-density, 7

ModelRecomb, 30Multiple high-density panels, see Multi-

pleHdPanels and NumberSnpx-Chip

51

Page 52: AlphaImpute v. 1.9 manual - alphagenes.roslin.ed.ac.uk€¦ · //. Its standard mode of operation is described here. Alternative uses, including re-using previous phasing information,

Index

MultipleHdPanels, 23

NumberPhasingRuns, 25NumberPhasingRuns, 24NumberSnp, 23NumberSnpxChip, 23

OutputOnlyGenotypedAnimals, 34

ParallelProcessors, 33Pedigree, 9PedigreeFile, 20PedigreeFreePhasing, 26phased genotypes, see phasesPhasedAnimalsThreshold, 32phases, 7Phasing, box 4, 24PlinkInputfile

binary format, 22text format, 22

PrephasedFile, 28PreprocessDataOnly, 33

RestartOption, 33ResultFolderPath, 34Rounds, 31, 32

Seed, 32settings, see spec fileSexChrom, 21, 22spec file, 19Step 1, 8, 37Step 2, 8, 37

TemplateHaplotypes, 31ThresholdImputed, 32TrueGenotypeFile, 21

UseFerdosi, 28UserDefinedAlphaPhaseAnimalsFile, 27

WellPhasedThreshold, 34

52