Genome · Manual

Help and documentation.

Quick start, workflows, analyses, tools, references and troubleshooting. All local, all traceable.

All help entries visible.

Quick start

Genome analyses FASTQ/BAM/CRAM data from Whole Genome Sequencing (WGS) locally and produces separate domain reports for Medical Genomics and Pharmacogenetics: HLA*LA/T1K, KILDA/LPA, T1K-KIR, Aldy4, ExpansionHunter, microarray panel, PGS Catalog files and SNP search.

What is Genome?

Genome is a macOS app for bioinformatic analysis of your own WGS data. It runs FASTQ→BAM alignment, reads BAM/CRAM/SAM and produces local analysis building blocks: HLA*LA and T1K for HLA typing, T1K for KIR genotyping, KILDA for LPA/KIV-2, Aldy4 for pharmacogenetics, ExpansionHunter for repeat expansions, microarray panel exports and PGS Catalog-based score determination. PDF reports separate raw data, technical evidence and cautious interpretation. No cloud, no data sharing.

System requirements

macOS 26 or newer · Apple Silicon (M1+) required · 16 GB RAM recommended (8 GB minimum) · internal SSD recommended · ~1–2 GB per reference genome · internet connection only required for tool installation and reference-genome download.

Guide values for 30× WGS on Apple Silicon: microarray extraction ~20–40 min · Y VCF ~5–10 min · MT VCF ~2–5 min · FASTQ→BAM ~24–48 hours (M2/M3). On an M4 with internal SSD closer to 24 hours, on older M1 Macs more towards 48 hours. Main factors: coverage, file size, SSD speed, available CPU cores.

First steps
  1. Directories → choose a Reference Library (e.g. /Volumes/SSD/Reference). Reference genomes, panels and external tool resources live here.
  2. References → prepare a reference genome. For Genome analyses use GRCh38/hs38d1 with .fai; HLA*LA additionally needs the matching PRG_MHC_GRCh38_withIMGT graph.
  3. Tools → install/check the required tools. Relevant for the current special analyses: KILDA, HLA*LA, T1K, Aldy4, ExpansionHunter, microarray panel and PGS Catalog files.
  4. FASTQ raw data? → Conversion → choose R1/R2 → Fastp optional → start alignment. Then continue with the resulting BAM.
  5. Directories → select a BAM/CRAM/SAM file. Genome reads build, coverage, sex and index status automatically.
  6. Choose a domain tab: Medical Genomics, Pharmacogenetics, HLA/KIR, LPA/KILDA, repeat expansions, SNP search, PGS or microarray export. PDF reports are produced only from inputs that were actually used.
Typical workflow

Load WGS BAM → choose the desired domain → check prerequisites → start analysis → export PDF report. HLA uses HLA*LA G-group output and optionally T1K as a second evidence layer. T1K-KIR stays separate from HLA. KILDA reports KIV-2/LPA context including quantile=NA as a raw value, Aldy4 belongs exclusively in pharmacogenetics, ExpansionHunter separates repeat and small-variant output. PGS scores are documented from PGS Catalog files, without diagnostic over-promising.

The status bar at the bottom of the window shows the progress of running operations. A red error bar appears when there is a problem; it gives the cause and can be closed with ✕.

Workflow

The Workflow tab bundles presets, reference genome, tools and analysis settings for typical local Genome runs.

What are presets?

Presets are pre-configured workflow profiles for common analysis scenarios. Five built-in presets are available: WGS, Exome, Mitochondrial, Y-chromosome and Consumer genetics. Each preset defines the matching reference genome, the required tools and haplogroup settings.

Activate a preset

Click a preset to view its details, then click 'Activate'. The app shows missing prerequisites (e.g. tools not installed or a missing reference genome) before the preset is applied.

Custom presets

Create your own presets from the current configuration. Give it a name, an icon and a description. The preset stores all current settings and can be re-activated at any time.

💡

Presets only change settings, they do not start a pipeline. After activating a preset you can review the configuration and start the workflow manually.

📁 Directories

Output directory

All generated files (BAMs, VCFs, microarray text files, reports) end up here. Default is the directory of the loaded BAM file. Recommended: choose a dedicated output directory on an SSD. The directory is restored automatically on launch.

Temporary directory

Intermediate files during running processes: unpacked reference genomes, alignment intermediates, sort files. Default: ~/Library/Caches/Genome. These files are deleted automatically after successful processing. On abort, leftovers may remain and can be deleted manually.

Reference Library Important

Central directory for all reference data: reference genomes (.fa / .fa.gz + .fai index), microarray panels (.tab.gz / .vcf.gz) and Haplogrep 3 (haplogrep3/). Recommended: external SSD with at least 5 GB free, since each reference genome takes ~1 GB. On first launch the directory is checked; missing resources are shown by colored indicators in References and Tools.

Load BAM/CRAM file

Loads a BAM, CRAM or SAM file as input. When loading, the app automatically reads: reference-genome name from the BAM header, genome build (hg38/hg19/hs37d5), average read depth (coverage), biological sex (Y/X chromosome reads), file content (WGS/WES/panel) and index status (.bai / .crai). CRAM requires a matching reference genome in the Reference Library.

Without an index file (.bai for BAM, .crai for CRAM) many extraction and analysis functions are unavailable. Create an index with 'samtools index file.bam'.

💡

CRAM files are 40–50% smaller than BAM, but need the reference genome when unpacking. Place the matching genome in the Reference Library before loading CRAM files.

Conversion

Converts FASTQ raw data into aligned BAM files (alignment) or turns BAMs back into FASTQ. Also includes quality control for raw data.

FASTQ → BAM (alignment)

FASTQ → BAM bwa samtools

Aligns paired-end FASTQ files (R1 + R2) against a reference genome. Pipeline: bwa mem → samtools fixmate → samtools sort → samtools markdup. Result: an indexed, sorted, deduplicated BAM file.

Prerequisites FASTQ→BAM

Required: samtools, bwa (or bwa-mem2 for higher speed). The reference genome must be present in the Reference Library and indexed with 'bwa index' (.bwt / .amb / .ann / .pac / .sa files). The bwa index is created automatically on first alignment if missing, taking ~30–60 minutes for a 3 GB genome.

Alignment parameters

Threads: set automatically to the number of logical CPU cores. The read group is generated from the file name (RGID, RGSM, RGPL=ILLUMINA, RGLB=lib1). Markdup removes PCR duplicates. Sorting is coordinate-based (needed for indexing).

Split-read mode Recommended: Supplementary

Controls how chimeric/split reads (reads that align at several places in the genome) are marked in the BAM.

• Supplementary (default, recommended): shorter split hits are marked as supplementary alignments. The modern standard, compatible with all current tools (samtools, GATK 4+, bcftools).

• Secondary (-M): shorter split hits are marked as secondary alignments (bwa -M flag). Needed for older tools such as Picard <2.0. Produces slightly larger BAM files.

The setting is in the Conversion tab below the reference-genome selection.

Resume after abort

If a running FASTQ→BAM alignment is aborted or fails, the app automatically detects existing intermediate files on the next launch. When you click 'Start alignment' a dialog appears with three options:

• Resume: continues the pipeline from the last successful step (e.g. from sort-merge, index or flagstat). • Restart: deletes all intermediate files and starts completely over. • Cancel: no action.

The pipeline can be resumed from any step: sort chunks → merge → markdup → index → flagstat/load.

On I/O timeouts (e.g. on external SSDs) the pipeline does not abort automatically. Instead a dialog appears: 'Retry' re-attempts the failed step, 'Cancel' stops the pipeline. This lets you, for example, reconnect an external SSD and continue.

BAM → FASTQ (back-conversion)

BAM → FASTQ samtools

Converts a BAM file back into two FASTQ files (R1, R2) via samtools collate + fastq. Useful when the original files are missing or a re-alignment against a different reference genome is needed. Unmapped reads are optionally included.

Quality control

Fastp fastp

Fastp analyses FASTQ files for quality, adapter contamination and GC content. Produces an interactive HTML report and optionally cleaned FASTQ files (adapter trimming, low-quality read filtering). Recommended before every alignment. Speed: ~500 MB/s on Apple Silicon.

FastQC FastQC Java

FastQC produces a detailed HTML quality report per FASTQ file. Includes: per-base sequence quality, per-sequence quality scores, sequence duplication levels, overrepresented reads, adapter contamination, k-mer analysis. Requires a Java runtime. Result: an HTML file in the output directory.

💡

For best results: run Fastp before alignment. Use FastQC for a more detailed visual analysis of the raw data. Both tools complement each other and can be run one after the other.

Extraction

Extracts specific datasets from the loaded BAM file. All output lands in the output directory. Requires an indexed BAM/CRAM file and a matching reference genome.

Microarray extraction

Reference panel

The reference panel contains the SNP positions of commercial DNA-chip platforms (.tab.gz or .vcf.gz). Panels are stored in the Reference Library directory and detected automatically on launch. The indicator shows: green = panel present, orange = panel missing (all variants are output without rsID).

Choose output formats

Use the expandable format menu to enable individual platforms. Buttons: 'Recommended' selects the most common formats (23andMe v3/v5, AncestryDNA v2, CombinedKit). 'All' enables every available version. 'None' clears the selection. CombinedKit contains all called SNPs and is suitable for GEDmatch, GEDmatch Genesis and FTDNA.

Header line

The 'Header' switch (in the expanded output-formats menu, next to All/None) controls whether a platform-specific header line is prepended when creating the output files. Enabled by default. When Header is active, the timestamp in the header is always updated to the current date and time, in the platform-correct format (e.g. 23andMe: 'Thu Dec 29 11:59:59 2012', AncestryDNA: '03/21/2013 11:15:47 MDT', MyHeritage: '2019-05-04 14:21:19'). When Header is off, the files are created without a header line (records only). For FTDNA there is no timestamp in the template.

Headers adapt dynamically to the reference-genome build (37/38). When using hg38, build references in the headers are updated automatically (e.g. "build 37" → "build 38", "GRCh37.p13" → "GRCh38.p14").

Extract microarray bcftools

Starts microarray extraction. Internal pipeline: bcftools mpileup (pileup of all reference positions) → bcftools call (variant calling) → panel-specific filtering → format conversion. With a panel: chip-specific SNP filtering + rsID annotation + CombinedKit + individual formats. Without a panel: raw variant VCF. Duration: 15–90 minutes depending on coverage and genome size.

Microarray output files

Per enabled format: one .txt file (tab-separated) in the output directory. File name: [BAM-name]_[format]_[date].txt. Format example 23andMe v5: columns rsid / chromosome / position / allele1allele2. CombinedKit: all called SNPs with rsID if a panel is present.

Mitochondrial DNA

MT FASTA samtools

Extracts the mitochondrial chromosome as a FASTA consensus sequence. Uses: samtools view (MT reads) → samtools mpileup → consensus computation. Suitable for yFull (female), Mitoverse, EMPOP. Output: [name]_MT.fasta.

MT BAM samtools

Extracts all MT reads as a separate BAM file. The chromosome name adapts automatically (chrM for hg38, MT for hs37d5). Suitable for manual analysis and further processing. Output: [name]_MT.bam + .bai index.

MT VCF bcftools

Calls variants on the MT chromosome with bcftools mpileup + call and creates a compressed VCF file. Contains all SNPs and indels of the MT genome. Suitable for Haplogrep (direct import), Mitoverse, PhyloTree-based analysis. Output: [name]_MT.vcf.gz.

Y chromosome

Y+MT BAM yFull

Extracts the Y chromosome and MT DNA together as a BAM. Optimal for yFull (male) where both chromosomes are needed. Build 38 (hg38/hs38) is preferred by yFull. Output: [name]_YMT.bam + .bai.

Y BAM

Extracts only the Y chromosome as a BAM. Suitable for yDNA Warehouse and yTree. Chromosome name: chrY (hg38) or Y (hs37d5). Output: [name]_Y.bam + .bai.

Y VCF bcftools

Calls variants on the Y chromosome and creates a compressed VCF. Suitable for manual analysis and uploading to yFull (as the VCF option). Contains all Y-SNPs and Y-STRs. Output: [name]_Y.vcf.gz.

Y-chromosome extraction only makes sense for male samples. The app detects biological sex automatically from the Y/X read ratio and shows a warning for female samples.

Analysis

Direct analysis functions based on the loaded BAM file without external platforms. HLA*LA/T1K, T1K-KIR, KILDA/LPA, Aldy4, ExpansionHunter, PGS, haplogroups and VCF context stay visible as separate analysis building blocks.

Y haplogroup bcftools ISOGG

Computes the paternal Y haplogroup directly from the BAM file. The app calls Y-SNPs via bcftools, compares them with the ISOGG/PhyloTree database and returns the deepest matching clade. Display: haplogroup, confidence, supporting SNPs. Faster than external uploading, no internet connection needed.

MT haplogroup (Haplogrep 3) Haplogrep 3 PhyloTree

Determines the maternal MT haplogroup with Haplogrep 3. Both men and women have mtDNA; the maternal line can be evaluated in all samples. Genome now uses only Haplogrep 3: modern codebase, configurable phylotrees (default: phylotree-rcrs@17.2) and, with --extend-report, additional columns on polymorphisms, hotspots and lineage notes. Input: automatically generated or selected MT VCF. Output: haplogroup + quality score + mutation list. Installation: Tools → Download Haplogrep 3.

HLA typing (HLA*LA) HLA*LA HLA-A/B/C DRB1

Determines HLA alleles for the classical MHC genes (HLA-A, -B, -C, -DRB1, -DQB1, -DPB1 and others) directly from the loaded BAM file. HLA*LA uses a population reference graph (PRG_MHC_GRCh38_withIMGT) for highly accurate typing even from standard WGS without a separate HLA enrichment step.

Prerequisites: HLA*LA installed (Tools → HLA*LA), PRG graph downloaded (References → HLA reference), a GRCh38-aligned BAM file with index (.bai). Runtime: 20–60 minutes.

Output: <SampleID>_HLA_typing.txt in the output directory. The report documents alleles, quality scores and technical evidence. Medical use belongs in specialist interpretation.

HLA analysis (T1K) T1K HLA Concordance

T1K complements HLA*LA as an independent second evidence layer. Genome does not use T1K-HLA as a replacement for HLA*LA, but for concordance checking: HLA*LA G-groups remain primary, T1K results are compared orderless and normalized. Discrepancies are shown as technical evidence, not automatically as a clinical statement.

KIR genotyping (T1K) T1K KIR HLA context

T1K can additionally type KIR genes. Genome handles KIR separately from HLA: KIR gene status, KIR-HLA context and technical evidence appear as their own section in the medical genomics report. Missing or low evidence stays visible and is not hidden.

LPA analysis (KILDA) KILDA LPA KIV-2

KILDA analyses LPA/KIV-2 context from local inputs. Genome shows raw values, technical evidence and interpretation separately: values like quantile=NA are kept as raw values instead of being replaced by placeholders or clinical simplifications. Result rows stay compact and verifiable.

Pharmacogenetics (Aldy4) Aldy4 PGx Diplotypes

Aldy4 analyses complex pharmacogenetic genes and produces diplotype/allele information for PGx reports. Genome uses Aldy4 exclusively in the pharmacogenetics context; HLA, KIR or general medical genomics are not mixed in. The PDF report separates variant basis, diplotype, gene-drug context and cautious interpretation.

Repeat expansions (ExpansionHunter) ExpansionHunter Repeat CI

ExpansionHunter evaluates defined repeat loci. Genome separates repeat records from small-variant output and documents allele sizes, read evidence and confidence intervals. Locus-specific limits stay visible; a repeat finding does not replace diagnostics.

PGS score determination PGS Catalog Score SNP

PGS Catalog files can be used for score determination. Genome documents the score file, variant basis, considered markers and limits. PGS results are population- and file-dependent risk contexts, not a diagnosis.

Unmapped reads

Extracts reads that were not aligned against the reference genome. Possible causes: non-human DNA (bacteria, viruses), sequencing errors, very short reads, sequences in reference gaps. Output: a FASTQ file with unmapped reads. Useful for metagenomic analysis (Kaiju, CosmosID).

VCF analysis

Annotate VCF bcftools

Sets variant IDs in the scheme CHROM:POS:REF:ALT via bcftools annotate. If no VCF is present in the output directory, variant calling from the loaded BAM is performed first automatically (bcftools mpileup | call). Output: _annotated.vcf.gz.

Filter VCF bcftools

Filters VCF variants by quality criteria: QUAL≥20 and read depth DP≥10 via bcftools view. If no VCF is present, variant calling is run first. Output: _filtered.vcf.gz.

Variant QC (VarQC) Ts/Tv bcftools

Computes quality metrics of a VCF via bcftools stats: Ts/Tv ratio (target WGS: 2.0–2.1), SNP/indel ratio, heterozygosity rate, variants per chromosome. If no VCF is present, variant calling is run first. Output: _stats.txt.

🔍 SNP search

Fast search for your DNA in a genotype file. Load a TXT file (CombinedKit.txt, 23andMe, etc.), enter rsIDs and instantly get the genotypes, chromosomes and positions. With template management and a filter for quick searches.

Load SNP file

The SNP file (TXT format) is selected in the 'Directories' tab, just like the BAM/CRAM file. Supported formats: CombinedKit.txt (all platforms), 23andMe TXT, AncestryDNA TXT, or other tab-separated formats with rsid/position/genotype columns. The app shows the number of SNPs in the file after loading.

Enter rsIDs rs format

Enter one or more rsIDs into the text field, one per line. Format: rs123456 or simply 123456. You can also copy and paste from other programs; the filter cleans up extra characters automatically. The counter shows how many rsID(s) you have entered.

Filter button Regex

The filter button appears automatically when you enter text with special characters or whitespace. One click extracts all rsID patterns (rs + digits) from the text and removes everything else, perfect for cleaning up copied lists with extra spaces or commas.

Run search O(1) lookup

Click 'Search' to start. The app searches the loaded SNP file for an exact match with each rsID. Fast indexing: the SNP file is converted once into a lookup table, so even large files (100,000+ SNPs) are searchable in milliseconds.

Show results TSV export

Found rsIDs are shown in a table with columns: rsID | chromosome | position | genotype. rsIDs not found are listed in a separate area. The copy button lets you copy all results as tab-separated values (TSV) to the clipboard, perfect for pasting into Excel or other programs.

Save templates Persistence SF Symbols

Save frequently searched rsID lists as templates with a name (e.g. 'My Ancestry SNPs', 'Health panel'). Each template can have an individual SF Symbol icon and a note. Templates appear in the dropdown picker at the top. Built-in templates are predefined and cannot be deleted.

Manage templates Edit Delete Import/Export

Each template has options to load, edit and delete. 'Load' fills the search field with the saved rsIDs. 'Edit' opens a form to change name, icon and rsID list. Notes are saved automatically (debounce). Import/Export allows sharing templates as a TXT file.

💡

Tip: save frequently used rsID lists as templates. This saves time on repeated searches across different files. The filter is especially useful when copying lists from websites or PDFs that contain extra spaces. With Import/Export you can share templates with others.

Reports

Genome produces PDF-first reports from the inputs that are actually present. Medical genomics and pharmacogenetics stay separate; raw values, technical evidence, tool versions, references and cautious interpretation are not mixed.

Medical genomics PDF HLA KIR LPA Repeat

The medical genomics report bundles HLA*LA, T1K-HLA concordance, T1K-KIR, KILDA/LPA, ExpansionHunter, SNP/ClinVar/PGS context, mtDNA/Haplogrep and optional EBV/microbiome evidence. Sections appear only when matching inputs and artifacts are present.

Pharmacogenetics Aldy4 PharmCAT CPIC DPWG

The pharmacogenetics report stays PGx-only: Aldy4 diplotypes, PharmCAT/SNP rules, CPIC/DPWG/PharmGKB context and cautious drug notes. HLA, KIR, LPA and repeat raw appendices are not mixed into the PGx report.

Evidence and provenance Provenance References Tool versions

Every report documents the data basis, files used, references, tool versions, warnings and limits. Technical raw values stay visible so a result remains traceable and verifiable.

Genome phrases things cautiously: reports are structured analysis and context, not a diagnosis. Clinical decisions belong to qualified physicians; unclear or missing evidence is shown as such.

🔧 Tools

Bioinformatic command-line tools and reference files that Genome uses. Installation and detection run via the Tools tab; large reference data such as the HLA*LA graph, PGS Catalog files and microarray panels live in the Reference Library.

Homebrew Base

Package manager for macOS. Installed automatically under /opt/homebrew (Apple Silicon) or /usr/local (Intel) if not present. Homebrew manages all the other bioinformatic tools. After installation, 'brew update && brew upgrade' can be run manually.

samtools samtools

Standard tool for BAM/SAM processing. Used for: sorting and indexing BAM, extracting reads (view), computing pileup, BAM→FASTQ conversion, coverage analysis. Version 1.18+ recommended. Check with 'samtools --version'.

bcftools bcftools

Variant calling and VCF processing. Used for: mpileup (pileup creation), call (variant calling), view (filter/convert VCF), annotate (annotation), stats (quality statistics). Often installed together with htslib.

bwa / bwa-mem2 bwa bwa-mem2

Burrows-Wheeler Aligner for short-read alignment (Illumina). bwa mem: standard algorithm for reads >70 bp. bwa-mem2: ~3× faster variant with identical output. On Apple Silicon bwa-mem2 is preferred automatically. For alignment: bwa index is required (once per reference genome).

fastp fastp

Fast FASTQ quality-control and preprocessing tool. Features: adapter detection and trimming, quality trimming, length filtering, duplicate removal, GC analysis, interactive HTML report. Speed: ~500 MB/s on M-series processors.

FastQC FastQC Java

Java-based FASTQ analysis tool with a detailed HTML report. Good for a first quality check before alignment. Slower than Fastp. Requires a Java Runtime Environment (JRE), installed via Homebrew (java@21 or newer).

sambamba sambamba

Multithreaded BAM processing. Used as an alternative to samtools for markdup (duplicate marking) in the FASTQ→BAM pipeline. Up to 4× faster than samtools markdup on multi-core systems. Optional; samtools markdup is used as a fallback.

Haplogrep 3 Haplogrep 3 Phylotree 17.2 Java 11+

MT haplogroup classification with Haplogrep 3 (genepi/haplogrep3 3.2.2) and a modern codebase. Supports several phylotrees (phylotree-rcrs@17.2, phylotree-fu-rcrs@1.2, etc.) and, with --extend-report, provides additional columns on polymorphisms and hotspots. Installed as a complete directory (haplogrep3.jar + data/) in refLib/haplogrep3/ (~50 MB). Requires Java 11 or newer. CLI: java -jar haplogrep3.jar classify --in X --tree phylotree-rcrs@17.2 --out Y.

Genome uses only Haplogrep 3 for MT analysis.

HLA*LA HLA*LA Graph genome IMGT

HLA typing tool from Dilthey Lab (github.com/DiltheyLab/HLA-LA). Determines HLA alleles for the classical genes HLA-A, -B, -C, -DRB1, -DQB1, -DPB1 and more directly from the WGS BAM. Method: graph-genome approach with the PRG_MHC_GRCh38_withIMGT reference graph.

Installation: Homebrew dependencies (boost@1.85, bamtools), then a source build via make (~30 minutes, ~500 MB). A Boost patch is applied automatically. The binary ends up under the configured tool directory in HLA-LA/bin/HLA-LA.

Also required: the PRG_MHC_GRCh38_withIMGT reference graph (~2.3 GB) downloaded under References → HLA reference.

Output: file <SampleID>_HLA_typing.txt with HLA alleles in standard IMGT format (e.g. A*01:01, B*07:02). Runtime: 20–60 minutes.

T1K T1K HLA KIR

T1K is used for two separate Genome analyses: HLA analysis as a second evidence layer to HLA*LA, and KIR genotyping. HLA results are compared with HLA*LA; KIR stays its own result block. Genome shows technical evidence and concordance without mixing HLA and KIR into a diagnosis.

KILDA KILDA LPA KIV-2

KILDA provides the LPA analysis for KIV-2/LPA context. Genome takes raw values and tool limits visibly into the report. Values like quantile=NA are not smoothed over but documented as technical information.

Aldy4 Aldy4 PGx Diplotypes

Aldy4 is used for pharmacogenetics, especially for complex PGx genes and diplotypes. Genome uses Aldy4 output only in the PGx report and keeps it separate from HLA, KIR, LPA and repeat analyses.

ExpansionHunter ExpansionHunter Repeat expansions

ExpansionHunter analyses repeat expansions at defined loci. Genome takes repeat records, allele sizes, read evidence and confidence intervals and keeps small-variant output separate.

PharmCAT PharmCAT CPIC Java

Pharmacogenomics analysis tool from PharmGKB/Stanford. Analyses pharmacogenomically relevant variants and provides CPIC-oriented guidance. Input: normalized VCF. Output: HTML and JSON report. Stored as pharmcat.jar (~30 MB) in the Reference Library.

GATK (optional) GATK HaplotypeCaller Java

Genome Analysis Toolkit from the Broad Institute, the gold standard for variant calling in human medicine. Algorithm: HaplotypeCaller (local de-novo assembly) usually yields 10–15% more variants than bcftools, especially in complex regions and indels. Downside: 6–12 hours of compute time for 30× WGS. GATK is optional; bcftools is used by default and is significantly faster. Stored as gatk.jar (~670 MB) in the Reference Library.

Install all

Installs the Genome-managed tools at once: base tools (samtools, bcftools, bwa/bwa-mem2, fastp, FastQC, sambamba) plus specialist tools such as KILDA, Aldy4, ExpansionHunter, T1K and HLA*LA resources, as far as they are managed in the current release. Requires an internet connection. Homebrew is installed first if needed. Progress and errors appear in the log.

Detect

Checks which tools are already installed and updates the status indicator. Useful after a manual installation via Terminal. Runs 'which <tool>' and '<tool> --version'.

💡

Tools can also be installed manually in the Terminal. Afterwards click 'Detect' in the Tools tab so Genome updates version, path and status. For specialist tools, the Genome-managed runtime/tool path is what counts; plain PATH finds do not automatically replace missing resources such as the HLA*LA graph, PGS Catalog files or microarray panels.

When uninstalling tools the app checks the exit code (brew/pip) or the file-deletion success. Failed uninstalls are reported via the error bar and the tool stays marked as installed.

📦 References

Reference genomes and microarray panels are managed in the Reference Library. Download and management happen directly in the app.

Microarray panels

Panel overview

Panels are build-specific: hg38 panels for hg38/hs38 BAMs, hg19 panels for hg19/hs37d5 BAMs. They contain SNP coordinates of common chip platforms and are used for SNP capture. Depending on the panel, compact exports of roughly 2 million or 25 million SNPs are produced. File formats: .tab.gz (tab-separated, faster) or .vcf.gz (VCF format). Stored in the Reference Library directory.

Reference genomes

hs38 (GRCh38 no-alt) Recommended

GRCh38 without alternative contigs from NCBI (~832 MB compressed, ~3 GB unpacked). Standard in the 1000 Genomes Project and WGS Extract. Recommended for alignment and extraction, fewer mapping artifacts than hg38 with alt contigs. Stored locally as hs38.fa.gz → unpacked automatically after download to hs38.fa.

hs38d1 (GRCh38 + decoys) Recommended for WGS

GRCh38 with decoy contigs from NCBI (~871 MB compressed, ~3.1 GB unpacked). Contains all chromosomes plus artificial decoy sequences (hs38d1) that catch reads which do not correspond to a real chromosome (e.g. viral, bacterial or repetitive sequences). Advantages over hs38: cleaner alignments, fewer false-positive variants, slightly smaller BAMs. Recommended for WGS alignment when best possible quality is desired. Also used by WGS Extract.

GRCh38 / hg38

Current human reference genome from UCSC (~983 MB compressed). Contains main assembly + alternative sequences. Chromosome names with 'chr' prefix (chr1, chrX, chrY, chrM). For BAMs already aligned against hg38.

GRCh37 / hg19

Older human reference genome (~938 MB). Chromosome names without prefix (1, X, Y, MT). Many older WGS datasets use this build. Microarray extraction with an hg19 panel recommended.

hs37d5 (1000 Genomes)

hg19-based genome with decoy contigs (~906 MB). Common with commercial WGS providers (Dante Labs, Nebula Genomics). Contains the 'hs37d5' contig for reads that do not match a real chromosome. Optimized for microarray extraction of commercial WGS files.

HLA reference

PRG_MHC_GRCh38_withIMGT GRCh38 IMGT/HLA ~2.3 GB

Population reference graph for HLA*LA. Contains pre-built graph structures for the MHC region based on GRCh38 + the IMGT/HLA allele database. Needed for HLA typing in the Analysis tab.

Size: ~2.3 GB. Stored under <tool-directory>/HLA-LA_PRG/. Download from Zenodo. Without this graph HLA typing does not run.

PGS Catalog files PGS Catalog Score SNP

PGS Catalog files describe score definitions and marker lists for PGS score determination. Genome stores them in the Reference Library and documents in the PDF report the score file, the variant basis used, missing markers and limits. PGS scores are population- and file-dependent risk contexts, not a diagnosis.

After downloading, reference genomes are indexed automatically with samtools faidx (.fai). This step takes 2–5 minutes and only has to be done once per genome. Aborting during the download or indexing can lead to corrupt files; in that case delete the file and download again.

When deleting references or panels the app checks the deletion success. If a file cannot be removed (e.g. missing permissions), an error bar appears and the item stays marked as installed.

Log

The log shows all executed commands, progress and errors in real time.

Real-time output

Every executed shell command is shown with its full output text. Color coding: normal text = stdout, red entries = stderr/errors. Progress-bar output is updated as a running line.

Copy

Copies the entire visible log content to the clipboard. Useful for error reports or debugging. The content includes all timestamps and commands of the current session.

Clear display

Clears the log display in the app (the visible area). The physical log file under ~/Library/Application Support/Genome/logs/ remains fully intact.

Log files

Each app session is saved automatically as a log file: ~/Library/Application Support/Genome/logs/genome_YYYY-MM-DD_HHmmss.log. The last 20 sessions are kept, older ones deleted automatically. Reachable in Finder via: Go → Library → Application Support → Genome → logs.

Debug logging

Can be enabled in Settings → Debug logging. Additionally shows internal states, parsing results and decision logic. Recommended only for error analysis, slows down the display with heavy output.

Run history

History of analyses

The run history logs all completed analyses with type, date, duration, success/failure and full log. The last 100 runs are stored in ~/Library/Application Support/Genome/run_history.json.

Run types

Recorded types: Alignment, Extraction, Microarray, Haplogroup, LPA, Other.

Settings

General app settings and advanced options in the developer menu.

Appearance & language

Color scheme: System (follows macOS), Light or Dark. Language: System (follows macOS), German or English. Both settings are applied and saved immediately.

Alert sound

When enabled (default: on), the app plays an alert sound when a running process unexpectedly slows down. Helps to spot problems such as I/O timeouts or SSD sleep without constantly watching the screen.

During processing, macOS sleep is prevented automatically (idle sleep, disk sleep and system sleep). Pipelines run without interruption, even with the lid closed or an expired idle timer. No configuration needed.

Developer menu

Enable developer menu

The developer menu can be enabled in Settings. It shows advanced options: pipeline tool selection, test-data generator, dock-icon settings and debug logging. The accent color switches to blue as a visual cue.

Pipeline tool selection bwa GATK sambamba

Selection of pipeline components: aligner (bwa / minimap2), sorter (samtools / sambamba), markdup (samtools / sambamba / picard), variant caller (bcftools / GATK). GATK takes 3–6× longer but finds 10–15% more variants. Sambamba is up to 40% faster than samtools on multi-core systems.

Test-data generator

Generates a synthetic mini dataset (100 kb reference + 5,000 read pairs) for a quick function check. The full pipeline then takes seconds instead of hours. Useful for testing all functions without real WGS data. Data is stored under ~/GenomeTest/.

Dock icon

The app icon in the Dock and the app overview can be set to Light, Dark or Auto independently of the system appearance.

🩺 Troubleshooting

Common problems and their solutions. For persistent issues, use the full log (copy button) for the analysis.

Error: no index file

Problem: 'No index file found', extraction does not start. Solution: run samtools index <file.bam> in the Terminal. For CRAM: samtools index <file.cram>. The index (.bai/.crai) must be in the same directory as the BAM/CRAM file.

Error: reference genome missing

Problem: 'Reference genome not found' or CRAM cannot be opened. Solution: References → download the matching reference genome. Make sure the Reference Library points to the correct directory (Directories → Reference Library). CRAM needs exactly the same genome it was aligned against.

Error: tool not found

Problem: 'samtools not found' / 'bcftools not found' / 'bwa not found'. Solution: Tools → Install all. If Homebrew is installed but the tool isn't: run 'brew install samtools bcftools bwa' in the Terminal, then click Tools → Detect. PATH issue: /opt/homebrew/bin must be in the PATH.

Error: BWA index missing

Problem: 'bwa index not found' during FASTQ→BAM alignment. Solution: the bwa index is created automatically when missing, taking 30–60 minutes for a 3 GB genome. Alternatively manually: 'bwa index /path/to/reference.fa'. Index files (.bwt, .amb, .ann, .pac, .sa) must be in the same directory as the reference genome.

Warning: low coverage

Problem: read depth below 10×, extraction limited. Cause: too few reads, poor sequencing quality, or WES (not WGS). Microarray extraction is possible from ~5×, but many SNPs are output as 'no call'. Y/MT analysis is reliable from ~15×. Check coverage with 'samtools coverage <file.bam>'.

CRAM: wrong reference genome

Problem: CRAM file does not open or returns empty output. Cause: the reference genome in the Reference Library does not exactly match the original alignment genome. Solution: ask the provider for the exact MD5 checksum of the alignment genome. For Dante Labs: hs37d5. For Nebula: hg38.

Haplogrep 3 does not start

Problem: MT haplogroup cannot be computed. Solution: Tools → install Haplogrep 3. Make sure the Reference Library is set. Java must be installed. Check manually: haplogrep3 --help.

HLA*LA: error during installation or typing

Installation failed: (1) check whether Xcode Command Line Tools are installed: 'xcode-select --install'. (2) check make errors in the log, often Boost include paths are missing. (3) try again: trash button → reinstall.

HLA*LA not found after installation: click 'Detect'. The binary is under <tool-directory>/HLA-LA/bin/HLA-LA.

Typing failed / PRG missing: References → HLA reference → download PRG_MHC_GRCh38_withIMGT (~2.3 GB).

Typing failed / BAM error: the BAM file must be GRCh38-aligned and indexed (.bai). Y-only or MT-only BAMs are not supported.

No output files

Problem: extraction runs through but no files in the output directory. Possible causes: (1) output directory set incorrectly, check Directories. (2) no write permissions in the output directory. (3) the BAM contains no reads for the chosen region (e.g. no Y chromosome in a female sample). Check the log for error messages.

I/O timeout on an external SSD

Problem: the pipeline aborts with 'Operation timed out' or a 'bgzf_read' error, especially on external USB SSDs. Cause: the SSD goes to sleep or the USB connection is briefly interrupted. Solution: on I/O timeouts a retry dialog appears. 'Retry' re-attempts the step. Prevent SSD sleep: System Settings → Energy → turn off 'Put hard disks to sleep'. For long pipelines: use an internal SSD or connect the external SSD directly (no hub).

Process very slow

Normal times: FASTQ→BAM 30× WGS ~2–4 hours, microarray extraction 30× WGS ~20–60 minutes, reference-genome download ~5–30 minutes. Speed-up: install bwa-mem2 instead of bwa (3× faster), sambamba for markdup, SSD for the Reference Library and temp directory. Check for processor throttling under heat: 'sudo powermetrics --samplers cpu_power -n 1' in the Terminal.

💡

For unclear errors: Log → Copy → paste the full log text into a text editor. The exact failing command and the error message are always directly below the executed command.

📖 Terms & concepts

Explanation of the most important bioinformatic terms.

BAM / CRAM / SAM

Standard formats for aligned sequencing data. SAM (Sequence Alignment/Map): text-based, human-readable. BAM: binary, compressed SAM (~25% of the size). CRAM: even more strongly compressed (needs the reference genome to unpack, ~60% smaller than BAM). All require an index file (.bai / .crai) for fast access to specific genome regions.

FASTQ

Raw format for sequencing reads with quality values. Each read consists of 4 lines: name, sequence, '+', quality values (Phred score, encoded as ASCII). Paired-end: R1 (forward read) + R2 (reverse read) in two files. Typical sizes: 30× WGS ~100–150 GB per file.

VCF

Variant Call Format, lists all deviations found from the reference genome. Contains: CHROM, POS, ID (rsID), REF (reference allele), ALT (alternative allele), QUAL (quality score), FILTER, INFO, FORMAT, sample genotype. Compressed as .vcf.gz with a Tabix index (.tbi) for fast access.

SNP / rsID / indel

SNP (Single Nucleotide Polymorphism): a single base variation (e.g. A→G). rsID: a unique identifier from the NCBI dbSNP database (e.g. rs1805007 = MC1R red-hair variant). Indel: insertion or deletion of one or more bases. Microarray chips mainly measure known SNPs.

Haplogroup

A group of genetically related individuals with a common ancestor. Y haplogroups (paternal line): A to T (PhyloTree Y). MT haplogroups (maternal line): A to Z + subgroups (PhyloTree MT). Nomenclature: R1b1a1a2a1a1 = R1b-L11 = Western European branch. Deeper labels = more precise ancestry.

Coverage / read depth

Average number of reads covering a position. WGS standard values: 30× (standard, good for all applications), 15× (sufficient for microarray extraction), <10× (low, many no-calls). Computable with 'samtools coverage' or 'samtools depth'. Formula: coverage = (number of reads × read length) / genome size.

Genome build

Version of the reference genome: GRCh38/hg38 (current since 2013), GRCh37/hg19 (2009), hs37d5 (hg19+decoys). Chromosome coordinates differ between builds, an hg19 BAM cannot be used directly with an hg38 panel. The build is read automatically from the BAM header.

Ts/Tv ratio

Ratio of transitions (purine→purine: A↔G, or pyrimidine→pyrimidine: C↔T) to transversions (purine↔pyrimidine: A/G↔C/T). Expected WGS: 2.0–2.1. WES: 2.5–3.0 (the exome contains more CpG sites). Deviations point to sequencing problems or alignment errors.

Phred score / quality value

Logarithmic error-probability value per base: Q20 = 1% error, Q30 = 0.1% error, Q40 = 0.01% error. Illumina standard: ≥Q30 for ≥80% of all bases. Encoded as ASCII in the FASTQ format (offset 33). Fastp/FastQC show the distribution of quality values.

PCR duplicates

Reads with identical start and end position, created by PCR amplification before sequencing. They distort variant calling and coverage statistics. They are identified and marked (not deleted) by samtools markdup or sambamba. A duplicate rate >30% points to library problems.

Decoy contigs hs38d1 hs37d5

Artificial DNA sequences added to the reference genome to catch 'orphan' reads. Sequencing data contains reads from viruses, bacteria, repetitive elements or contamination. Without decoys these reads are mapped incorrectly onto real chromosomes and create false-positive variants. With decoys (e.g. hs38d1) they are correctly aligned to the decoy sequence and do not disturb the analysis. Result: cleaner BAMs, less noise, slightly less multi-mapping.

Supplementary vs. secondary alignments

When a read aligns at several places in the genome (split read/chimeric alignment), there are two kinds of marking:

• Supplementary (FLAG 2048): the shorter alignment fragment is supplementary to the primary one. Modern standard, supported by all current tools.

• Secondary (FLAG 256): the shorter fragment is marked as a secondary alignment (bwa -M flag). Needed for older tools. Secondary reads contain more data per entry and produce slightly larger BAM files.

Configurable in the Genome app under Conversion → split reads.

CNV

Copy Number Variation, a deviation from the normal diploid copy number (2) of a genome region. Deletions (0–1 copies) and duplications (3+ copies) sometimes affect whole genes. Examples: LPA/KIV-2 CNV for lipoprotein(a) context and CYP2D6 CNV for pharmacogenetics.

Strand convention (plus/minus)

DNA is double-stranded, each base has a complement (A↔T, C↔G). Genotyping platforms can use the plus strand (forward) or the minus strand (reverse) as reference. This means the same SNP can be reported as 'A' (plus strand) or 'T' (minus strand). When comparing data from different sources (e.g. 23andMe vs. WGS extraction), strand conventions must be taken into account. A/T and C/G SNPs are especially ambiguous because plus and minus strands cannot be distinguished.

Liftover

Conversion of genomic coordinates between different reference-genome versions (e.g. hg19 → hg38). Needed when data from different builds is to be compared. The same variant has different position values in hg19 and hg38 because the reference sequence changed between versions (gaps closed, contigs moved). Tools: UCSC LiftOver, CrossMap, Picard LiftoverVcf.

? FAQ

Frequently asked questions.

WGS vs. WES, what is the difference?

WGS (Whole Genome Sequencing): the entire genome (~3.2 billion bases). All regions covered. WES (Whole Exome Sequencing): only coding regions (~1% of the genome). For microarray extraction WGS is preferred; WES yields hardly any non-coding SNPs. Genome detects automatically whether WGS or WES is loaded.

Which reference genome should I use?

hs38d1 (GRCh38 + decoys): best quality for your own WGS alignment, decoy contigs catch noise. hs38 (GRCh38 no-alt): a good alternative without decoys, standard in WGS Extract. hg38: if the BAM is already aligned against hg38. hs37d5: if the BAM is from Dante Labs, Genome Quebec or a similar provider. hg19: if the BAM is from older sequencing labs. The app detects the BAM file's build automatically.

Can I upload the extraction to 23andMe?

No, 23andMe does not accept external files. The extracted files in 23andMe format are suitable for other platforms that read this format: GEDmatch, MyHeritage DNA, FamilyTreeDNA (as a raw file), DNA.Land, GEDmatch Genesis, Promethease, SelfDecode.

Which format for GEDmatch?

CombinedKit (contains all called SNPs) is best suited for GEDmatch, maximum coverage. Alternatively: 23andMe v3 or v5 (fewer SNPs but more widely supported). For GEDmatch Genesis: CombinedKit or AncestryDNA v2 recommended.

Is data uploaded to the cloud? Privacy

No. Genome works entirely locally. No data exchange with external servers. The only network activity: downloading reference genomes (UCSC/NCBI), tool installation via Homebrew, and the Haplogrep download from GitHub, all explicitly triggered by the user.

How long does the extraction take?

Guide values for 30× WGS on Apple Silicon: microarray extraction ~20–40 min, Y VCF ~5–10 min, MT VCF ~2–5 min, FASTQ→BAM ~24–48 hours (M4 closer to 24 h, M1 more towards 48 h). On older Intel Macs: significantly longer, not recommended. Main factors: coverage, file size, SSD speed, available CPU cores.

BAM has no chromosome names in the header

Some BAM files have an incomplete header. The app then cannot detect build and chromosome names automatically. Solution: check the BAM header with samtools view -H <file.bam>. If @SQ lines are missing: run 'samtools reheader' with the correct header or use the tool 'samtools addreplacerg'.

Why do genotypes differ between platforms? Important

When comparing genotyping data from different sources (e.g. 23andMe vs. WGS extraction), systematic differences occur that are not real biological deviations:

1. Genomic positions: different reference-genome versions (hg19 vs. hg38) use different coordinate systems. The same variant therefore has different position values. A liftover tool can convert between them.

2. Allele order: heterozygous genotypes can be written in any order (AG or GA). This is purely cosmetic, biologically identical.

3. Strand convention: depending on the platform, the plus or minus strand is used as reference. This makes complementary bases appear (A↔T, C↔G) even though the same genotype is meant.

4. Real calling differences: different technologies (SNP array vs. WGS) and their algorithms can lead to differing calls for individual variants.

Bottom line: over 99.7% of genotypes match in content. The visible differences are almost exclusively due to reference genome, strand convention and allele notation.

What does yFull need?

yFull accepts: Y+MT BAM (preferred, build hg38/hs38 recommended) or Y VCF. A male sample is required. The BAM file must be indexed. Genome creates the needed files under Extraction → Y+MT BAM and Extraction → Y VCF. Upload directly at yfull.com.

Further resources

This help is also available directly in the app. For feedback or questions about usage, write to info@pjlabs.dev.