Step 2: Single Variant Association Tests

Overview

Step 2 performs single-variant association tests using the null model fitted in Step 1. This step supports multiple file formats and allows for conditional analysis and region-specific testing.

Supported File Formats

SAIGE-QTL supports multiple formats for dosages/genotypes:

PLINK (BED/BIM/FAM)
VCF/BCF (with .csi index)
BGEN (with .bgi index)
SAV (with .s1r index)

Key Features

Conditional analysis using summary statistics (with --condition flag)
Subset testing by variant IDs or chromosome ranges
Chunked processing with --markers_per_chunk (default: 10,000, minimum: 1,000)
Missing genotype imputation using best guess method

Getting Started

Check Help Information and the job scripts should use the corresponding command to call the wrapper function step2_tests_qtl.R

Note: Pixi script is showing here for examples

Pixi Installation

# Navigate to SAIGEQTL directory first
cd SAIGEQTL/extdata/
pixi run --manifest-path=../pixi.toml Rscript step2_tests_qtl.R --help

Basic Association Testing

Prepare Region File

Define the cis-region for testing:

regionFile=./input/gene_1_cis_region.txt
echo -e "2\t300001\t610001" > ${regionFile}

Set Output Prefixes

step1prefix=./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1
step2prefix=./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1_cis

Run Association Tests

NOTE: the corresponding command to call the wrapper function step2_tests_qtl.R needs to be modified if different ways used for installing the pcakge as shown above

pixi run --manifest-path=../pixi.toml Rscript step2_tests_qtl.R \
    --bedFile=./input/n.indep_100_n.cell_1.bed \
    --bimFile=./input/n.indep_100_n.cell_1.bim \
    --famFile=./input/n.indep_100_n.cell_1.fam \
    --SAIGEOutputFile=${step2prefix} \
    --chrom=2 \
    --minMAF=0 \
    --minMAC=20 \
    --LOCO=FALSE \
    --GMMATmodelFile=${step1prefix}.rda \
    --SPAcutoff=2 \
    --varianceRatioFile=${step1prefix}.varianceRatio.txt \
    --rangestoIncludeFile=${regionFile} \
    --markers_per_chunk=10000

Alternative Input Formats

BGEN Files

# Add these parameters instead of --bedFile/--bimFile/--famFile
--bgenFile=./input/genotype_100markers.bgen \
--bgenFileIndex=./input/genotype_100markers.bgen.bgi \
--AlleleOrder=ref-first \
--sampleFile=./input/samplelist.txt

VCF/BCF/SAV Files

# Add these parameters instead of PLINK files
--vcfFile=./input/genotype_10markers.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.vcf.gz.csi \
--vcfField=DS \  # Use DS for dosages, GT for genotypes
--chrom=2

Conditional Analysis

Perform conditional analysis by specifying conditioning markers:

pixi run --manifest-path=../pixi.toml Rscript step2_tests_qtl.R \
    --bedFile=./input/n.indep_100_n.cell_1.bed \
    --bimFile=./input/n.indep_100_n.cell_1.bim \
    --famFile=./input/n.indep_100_n.cell_1.fam \
    --SAIGEOutputFile=${step2prefix}_cond \
    --chrom=2 \
    --minMAF=0 \
    --minMAC=20 \
    --LOCO=FALSE \
    --GMMATmodelFile=${step1prefix}.rda \
    --SPAcutoff=2 \
    --varianceRatioFile=${step1prefix}.varianceRatio.txt \
    --rangestoIncludeFile=${regionFile} \
    --markers_per_chunk=10000 \
    --condition=2:300096:2:1,2:609979:1:2

Note: Conditioning markers must be specified as chr:pos:ref:alt and in the same order as stored in the dosage file.

Input Files

1. Dosage/Genotype Files (Required)

PLINK Format

--bedFile=./input/n.indep_100_n.cell_1.bed
--bimFile=./input/n.indep_100_n.cell_1.bim
--famFile=./input/n.indep_100_n.cell_1.fam

BGEN Format

# Files needed:
./input/genotype_100markers.bgen
./input/genotype_100markers.bgen.bgi

VCF Format (Genotypes)

View and Index VCF Files

cd ./input
# Create index file using tabix
tabix --csi -p vcf genotype_10markers.vcf.gz
# View file content
zcat genotype_10markers.vcf.gz | less -S
# Index file: genotype_10markers.vcf.gz.csi

VCF Format (Dosages)

Index and View Dosage VCF

# Create index file using tabix
tabix --csi -p vcf dosage_10markers.vcf.gz
# View file content
zcat dosage_10markers.vcf.gz | less -S
# Index file: dosage_10markers.vcf.gz.csi


#### SAV Format
```bash
# Files needed:
dosage_10markers.sav
dosage_10markers.sav.s1r

2. Sample File (Optional - for BGEN without sample IDs)

View Sample File

# Simple format (no header)
less -S ./input/samplelist.txt

# BGEN .sample format
less -S ./input/genotype_100markers_2chr.sample

3. Step 1 Output Files (Required)

Model file: ./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1.rda
Variance ratio file: ./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1.varianceRatio.txt

Key Parameters Explained

Parameter	Description	Example Value
`--SAIGEOutputFile`	Output file prefix	`./output/results_cis`
`--chrom`	Chromosome to analyze	`2`
`--minMAF`	Minimum minor allele frequency	`0`
`--minMAC`	Minimum minor allele count	`20`
`--LOCO`	Leave-one-chromosome-out	`FALSE`
`--SPAcutoff`	SPA threshold (χ² statistic)	`2`
`--markers_per_chunk`	Markers per output chunk	`10000`
`--condition`	Conditioning markers	`chr:pos:ref:alt,chr:pos:ref:alt`
`--rangestoIncludeFile`	Region file for testing	`./input/region.txt`
`--vcfField`	VCF field to use	`DS` (dosages) or `GT` (genotypes)

Output Files

1. Association Results File

View Results

less -S ./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1_cis

2. Index File

Tracks analysis progress for resuming interrupted jobs:

less -S ./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1_cis.index

Output Format Explanation

Standard Columns

View Header

head -n 1 output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1_cis

Column Descriptions

Column	Description
`CHR`	Chromosome
`POS`	Genome position
`SNPID`	Variant ID
`Allele1`	Reference allele
`Allele2`	Alternative allele (effect allele)
`AC_Allele2`	Allele count of Allele2
`AF_Allele2`	Allele frequency of Allele2
`MissingRate`	Missing rate (or `imputationInfo` if imputed)
`BETA`	Effect size of Allele2
`SE`	Standard error of BETA
`Tstat`	Score statistic of Allele2
`var`	Estimated variance of score statistic
`p.value`	P-value (with SPA if applicable)
`p.value.NA`	P-value without SPA (binary traits only)
`Is.SPA`	Whether SPA converged

Conditional Analysis Columns

When using --condition=, additional columns are output:

BETA_c, SE_c, Tstat_c, var_c, p.value_c, p.value.NA_c

Note: Association results are reported with regard to Allele2 (the alternative allele).

Best Practices

Performance Optimization

Use appropriate --markers_per_chunk values (≥1000, typically 10,000)
Consider chromosomal parallelization for large datasets
Use --is_overwrite_output=FALSE to resume interrupted analyses

Quality Control

Set reasonable --minMAC thresholds (typically ≥20)
Monitor SPA convergence in results
Check for appropriate effect size distributions

File Management

Ensure proper indexing for VCF/BCF files (.csi)
Verify sample ID consistency across files
Use absolute paths for containerized environments

Region-Based Testing

Define appropriate cis-regions (typically ±500kb from gene)
Use single-line range files for VCF/SAV input
Consider multiple ranges for trans-QTL analyses

Troubleshooting

Common Issues

Index File Errors
- Ensure .csi files exist and are properly named
- Recreate index files if corrupted
- Check file permissions and accessibility
Memory Issues
- Reduce --markers_per_chunk for large datasets
- Monitor memory usage during processing
- Use appropriate compute resources
Sample Mismatch
- Verify sample IDs match between genotype and phenotype files
- Check sample file format for BGEN inputs
- Ensure proper sample ordering
Conditional Analysis
- Verify conditioning marker format (chr:pos:ref:alt)
- Ensure markers exist in the dosage file
- Check marker ordering consistency
File Path Issues
- Use absolute paths in containerized environments
- Verify volume mounting in Docker/Singularity
- Check file accessibility and permissions

Performance Tips

Parallel Processing: Run multiple chromosomes simultaneously
Resource Allocation: Monitor CPU and memory usage
Storage: Use fast I/O storage for large datasets
Indexing: Ensure all index files are current and accessible

Next Steps

After successful completion of Step 2, you can:

Proceed to Step 3 for gene-level p-value calculations
Perform additional conditional analyses
Analyze results for significant associations
Conduct meta-analyses across multiple datasets