Step 2: Single Variant Association Tests
Overview
Step 2 performs single-variant association tests using the null model fitted in Step 1. This step supports multiple file formats and allows for conditional analysis and region-specific testing.
Supported File Formats
SAIGE-QTL supports multiple formats for dosages/genotypes:
- PLINK (BED/BIM/FAM)
- VCF/BCF (with .csi index)
- BGEN (with .bgi index)
- SAV (with .s1r index)
Key Features
- Conditional analysis using summary statistics (with
--conditionflag) - Subset testing by variant IDs or chromosome ranges
- Chunked processing with
--markers_per_chunk(default: 10,000, minimum: 1,000) - Missing genotype imputation using best guess method
Getting Started
Check Help Information and the job scripts should use the corresponding command to call the wrapper function step2_tests_qtl.R
Note: Pixi script is showing here for examples
Pixi Installation
# Navigate to SAIGEQTL directory first
cd SAIGEQTL/extdata/
pixi run --manifest-path=../pixi.toml Rscript step2_tests_qtl.R --help
Basic Association Testing
Prepare Region File
Define the cis-region for testing:
regionFile=./input/gene_1_cis_region.txt
echo -e "2\t30\t61" > ${regionFile}
Set Output Prefixes
step1prefix=./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1
step2prefix=./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1_cis
Run Association Tests (install through pixi)
NOTE: the corresponding command prefixes to call the wrapper function step2_tests_qtl.R needs to be modified if SAIGEQTL installed in other ways (e.g. through Docker, etc). Here we just provide the pixi way of running for simplicity as the flags used in step2 running remains the same regardless of how SAIGEQTL is installed.
pixi run --manifest-path=../pixi.toml Rscript step2_tests_qtl.R \
--bedFile=./input/n.indep_100_n.cell_1_full.bed \
--bimFile=./input/n.indep_100_n.cell_1_full.bim \
--famFile=./input/n.indep_100_n.cell_1_full.fam \
--SAIGEOutputFile=${step2prefix} \
--chrom=2 \
--minMAF=0 \
--minMAC=20 \
--LOCO=FALSE \
--GMMATmodelFile=${step1prefix}.rda \
--SPAcutoff=2 \
--varianceRatioFile=${step1prefix}.varianceRatio.txt \
--rangestoIncludeFile=${regionFile} \
--markers_per_chunk=10000
Alternative Input Formats
BGEN Files
# Add these parameters instead of --bedFile/--bimFile/--famFile
--bgenFile=./input/genotype_100markers.bgen \
--bgenFileIndex=./input/genotype_100markers.bgen.bgi \
--AlleleOrder=ref-first \
--sampleFile=./input/samplelist.txt
VCF/BCF/SAV Files
# Add these parameters instead of PLINK files
--vcfFile=./input/genotype_10markers.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.vcf.gz.csi \
--vcfField=DS \ # Use DS for dosages, GT for genotypes
--chrom=2
Conditional Analysis
Perform conditional analysis by specifying conditioning markers:
regionFile=./input/gene_1_cis_region_larger.txt
echo -e "2\t300001\t610000" > ${regionFile}
pixi run --manifest-path=../pixi.toml Rscript step2_tests_qtl.R \
--bedFile=./input/n.indep_100_n.cell_1_full.bed \
--bimFile=./input/n.indep_100_n.cell_1_full.bim \
--famFile=./input/n.indep_100_n.cell_1_full.fam \
--SAIGEOutputFile=${step2prefix}_cond \
--chrom=2 \
--minMAF=0 \
--minMAC=20 \
--LOCO=FALSE \
--GMMATmodelFile=${step1prefix}.rda \
--SPAcutoff=2 \
--varianceRatioFile=${step1prefix}.varianceRatio.txt \
--rangestoIncludeFile=${regionFile} \
--markers_per_chunk=10000 \
--condition=2:300096:C:A
Note: Conditioning markers must be specified as chr:pos:ref:alt and in the same order as stored in the dosage file.
Input Files
1. Dosage/Genotype Files (Required)
PLINK Format
--bedFile=./input/n.indep_100_n.cell_1_full.bed
--bimFile=./input/n.indep_100_n.cell_1_full.bim
--famFile=./input/n.indep_100_n.cell_1_full.fam
BGEN Format
# Files needed:
./input/genotype_100markers.bgen
./input/genotype_100markers.bgen.bgi
VCF Format (Genotypes)
View and Index VCF Files
cd ./input
# Create index file using tabix
tabix --csi -p vcf genotype_10markers.vcf.gz
# View file content
zcat genotype_10markers.vcf.gz | less -S
# Index file: genotype_10markers.vcf.gz.csi
VCF Format (Dosages)
Index and View Dosage VCF
# Create index file using tabix
tabix --csi -p vcf dosage_10markers.vcf.gz
# View file content
zcat dosage_10markers.vcf.gz | less -S
# Index file: dosage_10markers.vcf.gz.csi
#### SAV Format
```bash
# Files needed:
dosage_10markers.sav
dosage_10markers.sav.s1r
2. Sample File (Optional - for BGEN without sample IDs)
View Sample File
# Simple format (no header)
less -S ./input/samplelist.txt
# BGEN .sample format
less -S ./input/genotype_100markers_2chr.sample
3. Step 1 Output Files (Required)
- Model file:
./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1.rda - Variance ratio file:
./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1.varianceRatio.txt
Key Parameters Explained
| Parameter | Description | Example Value |
|---|---|---|
--SAIGEOutputFile | Output file prefix | ./output/results_cis |
--chrom | Chromosome to analyze | 2 |
--minMAF | Minimum minor allele frequency | 0 |
--minMAC | Minimum minor allele count | 20 |
--LOCO | Leave-one-chromosome-out | FALSE |
--SPAcutoff | SPA threshold (χ² statistic) | 2 |
--markers_per_chunk | Markers per output chunk | 10000 |
--condition | Conditioning markers | chr:pos:ref:alt,chr:pos:ref:alt |
--rangestoIncludeFile | Region file for testing | ./input/region.txt |
--vcfField | VCF field to use | DS (dosages) or GT (genotypes) |
Output Files
1. Association Results File
View Results
less -S ./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1_cis
2. Index File
Tracks analysis progress for resuming interrupted jobs:
less -S ./output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1_cis.index
Output Format Explanation
Standard Columns
View Header
head -n 1 output/nindep_100_ncell_100_lambda_2_tauIntraSample_0.5_gene_1_cis
Column Descriptions
| Column | Description |
|---|---|
CHR | Chromosome |
POS | Genome position |
SNPID | Variant ID |
Allele1 | Reference allele |
Allele2 | Alternative allele (effect allele) |
AC_Allele2 | Allele count of Allele2 |
AF_Allele2 | Allele frequency of Allele2 |
MissingRate | Missing rate (or imputationInfo if imputed) |
BETA | Effect size of Allele2 |
SE | Standard error of BETA |
Tstat | Score statistic of Allele2 |
var | Estimated variance of score statistic |
p.value | P-value (with SPA if applicable) |
p.value.NA | P-value without SPA (binary traits only) |
Is.SPA | Whether SPA converged |
Conditional Analysis Columns
When using --condition=, additional columns are output:
BETA_c,SE_c,Tstat_c,var_c,p.value_c,p.value.NA_c
Note: Association results are reported with regard to Allele2 (the alternative allele).
Best Practices
Performance Optimization
- Use appropriate
--markers_per_chunkvalues (≥1000, typically 10,000) - Consider chromosomal parallelization for large datasets
- Use
--is_overwrite_output=FALSEto resume interrupted analyses
Quality Control
- Set reasonable
--minMACthresholds (typically ≥20) - Monitor SPA convergence in results
- Check for appropriate effect size distributions
File Management
- Ensure proper indexing for VCF/BCF files (.csi)
- Verify sample ID consistency across files
- Use absolute paths for containerized environments
Region-Based Testing
- Define appropriate cis-regions (typically ±500kb from gene)
- Use single-line range files for VCF/SAV input
- Consider multiple ranges for trans-QTL analyses
Troubleshooting
Common Issues
- Index File Errors
- Ensure .csi files exist and are properly named
- Recreate index files if corrupted
- Check file permissions and accessibility
- Memory Issues
- Reduce
--markers_per_chunkfor large datasets - Monitor memory usage during processing
- Use appropriate compute resources
- Reduce
- Sample Mismatch
- Verify sample IDs match between genotype and phenotype files
- Check sample file format for BGEN inputs
- Ensure proper sample ordering
- Conditional Analysis
- Verify conditioning marker format (chr:pos:ref:alt)
- Ensure markers exist in the dosage file
- Check marker ordering consistency
- File Path Issues
- Use absolute paths in containerized environments
- Verify volume mounting in Docker/Singularity
- Check file accessibility and permissions
Performance Tips
- Parallel Processing: Run multiple chromosomes simultaneously
- Resource Allocation: Monitor CPU and memory usage
- Storage: Use fast I/O storage for large datasets
- Indexing: Ensure all index files are current and accessible
Next Steps
After successful completion of Step 2, you can:
- Proceed to Step 3 for gene-level p-value calculations
- Perform additional conditional analyses
- Analyze results for significant associations
- Conduct meta-analyses across multiple datasets