파트 1: 방법론 개요¶

AI 지원 번역 - 자세히 알아보기 및 개선 제안

변이 호출(Variant calling)은 참조 게놈에 대비하여 게놈 서열의 변이를 식별하는 것을 목표로 하는 게놈 분석 방법입니다. 여기서는 전체 게놈 시퀀싱 데이터에서 짧은 배선 변이(short germline variants), 즉 SNP와 indel을 호출하기 위해 설계된 도구와 방법을 사용합니다.

GATK 파이프라인

완전한 변이 호출 파이프라인은 일반적으로 참조 게놈에 대한 매핑(때로는 게놈 정렬이라고도 함)과 변이 필터링 및 우선순위 지정을 포함하여 많은 단계를 포함합니다. 이 과정에서는 간결성을 위해 변이 호출 부분에만 집중하겠습니다.

방법론¶

전체 게놈 시퀀싱 샘플에 변이 호출을 적용하여 배선 SNP와 indel을 식별하는 두 가지 방법을 보여드리겠습니다. 먼저 각 샘플에서 독립적으로 변이를 호출하는 간단한 샘플별 접근법부터 시작하겠습니다. 그런 다음 여러 샘플을 함께 분석하여 더 정확하고 유익한 결과를 생성하는 더 정교한 공동 호출 접근법을 보여드리겠습니다.

두 접근법 모두에 대한 워크플로우 코드를 작성하기 전에, 먼저 일부 테스트 데이터에서 명령을 수동으로 실행해 보겠습니다.

데이터셋¶

다음 데이터와 관련 리소스를 제공합니다:

인간 염색체 20(hg19/b37에서)의 작은 영역과 그 보조 파일(인덱스 및 서열 사전)로 구성된 참조 게놈.
가족 트리오(어머니, 아버지, 아들)에 해당하는 세 개의 전체 게놈 시퀀싱 샘플. 파일 크기를 작게 유지하기 위해 염색체 20의 작은 데이터 조각으로 부분집합화되었습니다. 이것은 이미 참조 게놈에 매핑된 Illumina 단일 리드 시퀀싱 데이터로, BAM 형식(SAM(Sequence Alignment Map)의 압축 버전인 Binary Alignment Map)으로 제공됩니다.
샘플에 변이 호출에 적합한 데이터가 있는 게놈 상의 좌표인 게놈 간격 목록, BED 형식으로 제공됩니다.

소프트웨어¶

관련된 두 가지 주요 도구는 서열 정렬 파일을 조작하는 데 널리 사용되는 툴킷인 Samtools와, Broad Institute에서 개발한 변이 발견을 위한 도구 세트인 GATK(Genome Analysis Toolkit)입니다.

이러한 도구는 GitHub Codespaces 환경에 설치되어 있지 않으므로 Seqera Containers 서비스를 통해 검색한 컨테이너를 통해 사용하겠습니다(Hello Containers 참조).

팁

nf4-science/genomics 디렉토리에 있는지 확인하세요. pwd를 입력할 때 표시되는 경로의 마지막 부분이 genomics이어야 합니다.

1. 샘플별 변이 호출¶

샘플별 변이 호출은 각 샘플을 독립적으로 처리합니다. 변이 호출기는 한 번에 한 샘플의 시퀀싱 데이터를 검사하고 샘플이 참조와 다른 위치를 식별합니다.

이 섹션에서는 샘플별 변이 호출 접근법을 구성하는 두 가지 명령을 테스트합니다: Samtools를 사용한 BAM 파일 인덱싱과 GATK HaplotypeCaller를 사용한 변이 호출. 이것들이 이 과정의 파트 2에서 Nextflow 워크플로우로 적용할 명령입니다.

Samtools를 사용하여 BAM 입력 파일에 대한 인덱스 파일 생성
인덱싱된 BAM 파일에서 GATK HaplotypeCaller를 실행하여 VCF(Variant Call Format)로 샘플별 변이 호출 생성

먼저 하나의 샘플에서만 두 명령을 테스트하는 것으로 시작합니다.

1.1. Samtools로 BAM 입력 파일 인덱싱¶

인덱스 파일은 생물정보학 파일 형식의 일반적인 기능입니다. 인덱스 파일은 메인 파일의 구조에 대한 정보를 포함하여 GATK와 같은 도구가 전체 파일을 읽지 않고도 데이터의 부분집합에 접근할 수 있게 합니다. 이것은 이러한 파일이 얼마나 커질 수 있는지를 고려할 때 중요합니다.

BAM 파일은 종종 인덱스 없이 제공되므로, 많은 분석 워크플로우의 첫 번째 단계는 samtools index를 사용하여 인덱스를 생성하는 것입니다.

Samtools 컨테이너를 가져와서 대화식으로 실행하고 BAM 파일 중 하나에서 samtools index 명령을 실행할 것입니다.

1.1.1. Samtools 컨테이너 가져오기¶

docker pull 명령을 실행하여 Samtools 컨테이너 이미지를 다운로드하세요:

docker pull community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464

명령 출력

1.20--b5dfbd93de237464: Pulling from library/samtools
6360b3717211: Pull complete
2ec3f7ad9b3c: Pull complete
7716ca300600: Pull complete
4f4fb700ef54: Pull complete
8c61d418774c: Pull complete
03dae77ff45c: Pull complete
aab7f787139d: Pull complete
4f4fb700ef54: Pull complete
837d55536720: Pull complete
897362c12ca7: Pull complete
3893cbe24e91: Pull complete
d1b61e94977b: Pull complete
c72ff66fb90f: Pull complete
0e0388f29b6d: Pull complete
Digest: sha256:bbfc45b4f228975bde86cba95e303dd94ecf2fdacea5bfb2e2f34b0d7b141e41
Status: Downloaded newer image for community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464
community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464

이 이미지를 이전에 다운로드하지 않았다면 완료하는 데 1분 정도 걸릴 수 있습니다. 완료되면 컨테이너 이미지의 로컬 복사본이 생성됩니다.

1.1.2. Samtools 컨테이너를 대화식으로 실행¶

컨테이너를 대화식으로 실행하려면 -it 플래그와 함께 docker run을 사용하세요. -v ./data:/data 옵션은 로컬 data 디렉토리를 컨테이너에 마운트하여 도구가 입력 파일에 접근할 수 있도록 합니다.

docker run -it -v ./data:/data community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464

명령 출력

(base) root@1409896f77b1:/tmp#

프롬프트가 (base) root@a1b2c3d4e5f6:/tmp#와 같이 변경되어 컨테이너 내부에 있음을 나타냅니다.

/data/bam 아래에서 서열 데이터 파일을 볼 수 있는지 확인하세요:

ls /data/bam

명령 출력

reads_father.bam  reads_mother.bam  reads_mother.bam.bai  reads_son.bam

이제 첫 번째 명령을 시도할 준비가 되었습니다.

1.1.3. 인덱싱 명령 실행¶

Samtools 문서는 BAM 파일을 인덱싱하기 위해 실행할 명령줄을 제공합니다. 입력 파일만 제공하면 됩니다. 도구는 입력 파일명에 .bai를 추가하여 출력 파일명을 자동으로 생성합니다.

하나의 데이터 파일에서 samtools index 명령을 실행하세요:

samtools index /data/bam/reads_mother.bam

명령은 터미널에 출력을 생성하지 않지만, 이제 원본 BAM 입력 파일과 같은 디렉토리에 reads_mother.bam.bai라는 파일이 표시될 것입니다.

디렉토리 내용

data/bam/
├── reads_father.bam
├── reads_mother.bam
├── reads_mother.bam.bai
└── reads_son.bam

이것으로 첫 번째 단계의 테스트가 완료되었습니다.

1.1.4. Samtools 컨테이너 종료¶

컨테이너를 종료하려면 exit를 입력하세요.

exit

프롬프트가 컨테이너를 시작하기 전의 상태로 돌아가야 합니다.

1.2. GATK HaplotypeCaller로 변이 호출¶

방금 인덱싱한 BAM 파일에서 gatk HaplotypeCaller 명령을 실행하려고 합니다.

1.2.1. GATK 컨테이너 가져오기¶

먼저 docker pull 명령을 실행하여 GATK 컨테이너 이미지를 다운로드하세요:

docker pull community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867

명령 출력

일부 레이어는 앞서 가져온 Samtools 컨테이너 이미지와 공유되므로 Already exists로 표시됩니다.

4.5.0.0--730ee8817e436867: Pulling from library/gatk4
6360b3717211: Already exists
2ec3f7ad9b3c: Already exists
7716ca300600: Already exists
4f4fb700ef54: Already exists
8c61d418774c: Already exists
03dae77ff45c: Already exists
aab7f787139d: Already exists
4f4fb700ef54: Already exists
837d55536720: Already exists
897362c12ca7: Already exists
3893cbe24e91: Already exists
d1b61e94977b: Already exists
e5c558f54708: Pull complete
087cce32d294: Pull complete
Digest: sha256:e33413b9100f834fcc62fd5bc9edc1e881e820aafa606e09301eac2303d8724b
Status: Downloaded newer image for community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867
community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867

두 컨테이너 이미지가 대부분의 레이어를 공유하기 때문에 첫 번째 pull보다 빨라야 합니다.

1.2.2. GATK 컨테이너를 대화식으로 실행¶

Samtools에서 했던 것과 마찬가지로 data 디렉토리를 마운트하여 GATK 컨테이너를 대화식으로 실행하세요.

docker run -it -v ./data:/data community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867

프롬프트가 변경되어 GATK 컨테이너 내부에 있음을 나타냅니다.

1.2.3. 변이 호출 명령 실행¶

GATK 문서는 BAM 파일에서 변이 호출을 수행하기 위해 실행할 명령줄을 제공합니다.

BAM 입력 파일(-I)과 참조 게놈(-R), 출력 파일 이름(-O), 분석할 게놈 간격 목록(-L)을 제공해야 합니다.

그러나 인덱스 파일의 경로를 지정할 필요는 없습니다. 도구는 확립된 명명 규칙과 공동 배치 규칙에 따라 같은 디렉토리에서 자동으로 찾습니다. 참조 게놈의 보조 파일(인덱스 및 서열 사전 파일, *.fai 및 *.dict)에도 동일하게 적용됩니다.

gatk HaplotypeCaller \
        -R /data/ref/ref.fasta \
        -I /data/bam/reads_mother.bam \
        -O /data/vcf/reads_mother.vcf \
        -L /data/ref/intervals.bed

명령 출력

Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_mother.bam -O reads_mother.vcf -L /data/ref/intervals.bed
00:27:50.687 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
00:27:50.854 INFO  HaplotypeCaller - ------------------------------------------------------------
00:27:50.858 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
00:27:50.858 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
00:27:50.858 INFO  HaplotypeCaller - Executing as root@a1fe8ff42d07 on Linux v6.10.14-linuxkit amd64
00:27:50.858 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
00:27:50.859 INFO  HaplotypeCaller - Start Date/Time: February 8, 2026 at 12:27:50 AM GMT
00:27:50.859 INFO  HaplotypeCaller - ------------------------------------------------------------
00:27:50.859 INFO  HaplotypeCaller - ------------------------------------------------------------
00:27:50.861 INFO  HaplotypeCaller - HTSJDK Version: 4.1.0
00:27:50.861 INFO  HaplotypeCaller - Picard Version: 3.1.1
00:27:50.861 INFO  HaplotypeCaller - Built for Spark Version: 3.5.0
00:27:50.862 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:27:50.862 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:27:50.862 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:27:50.863 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:27:50.864 INFO  HaplotypeCaller - Deflater: IntelDeflater
00:27:50.864 INFO  HaplotypeCaller - Inflater: IntelInflater
00:27:50.864 INFO  HaplotypeCaller - GCS max retries/reopens: 20
00:27:50.864 INFO  HaplotypeCaller - Requester pays: disabled
00:27:50.865 INFO  HaplotypeCaller - Initializing engine
00:27:50.991 INFO  FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
00:27:51.016 INFO  IntervalArgumentCollection - Processing 6369 bp from intervals
00:27:51.029 INFO  HaplotypeCaller - Done initializing engine
00:27:51.040 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
00:27:51.042 INFO  NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
00:27:51.042 INFO  SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
00:27:51.046 INFO  HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
00:27:51.063 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
00:27:51.085 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
00:27:51.086 INFO  IntelPairHmm - Available threads: 10
00:27:51.086 INFO  IntelPairHmm - Requested threads: 4
00:27:51.086 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
00:27:51.128 INFO  ProgressMeter - Starting traversal
00:27:51.136 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
00:27:51.882 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
00:27:52.969 INFO  HaplotypeCaller - 7 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
7 total reads filtered out of 1867 reads processed
00:27:52.971 INFO  ProgressMeter - 20_10037292_10066351:13499              0.0                    35           1145.7
00:27:52.971 INFO  ProgressMeter - Traversal complete. Processed 35 total regions in 0.0 minutes.
00:27:52.976 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.003346916
00:27:52.976 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.045731709
00:27:52.977 INFO  SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.02 sec
00:27:52.981 INFO  HaplotypeCaller - Shutting down engine
[February 8, 2026 at 12:27:52 AM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=203423744

로그 출력은 매우 자세하므로 위의 예제에서 가장 관련성 높은 줄을 강조 표시했습니다.

출력 파일 reads_mother.vcf와 그 인덱스 파일 reads_mother.vcf.idx는 컨테이너의 작업 디렉토리 내부에 생성됩니다.

디렉토리 내용

conda.yml  hsperfdata_root  reads_mother.vcf  reads_mother.vcf.idx

VCF 파일에는 잠시 후 살펴볼 변이 호출이 포함되어 있으며, 인덱스 파일은 BAM 인덱스 파일과 동일한 기능을 가지고 있어 도구가 전체 파일을 로드하지 않고도 데이터의 부분집합을 검색하고 가져올 수 있게 합니다.

VCF는 텍스트 형식이고 이것은 작은 테스트 파일이므로 cat reads_mother.vcf를 실행하여 내용을 열어보고 볼 수 있습니다. 파일의 시작 부분까지 스크롤하면 여러 줄의 메타데이터로 구성된 헤더와 그 뒤에 한 줄에 하나씩 나열된 변이 호출 목록을 찾을 수 있습니다.

파일 내용 (요약)

reads_mother.vcf
##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --output reads_mother.vcf --intervals /data/ref/intervals.bed --input /data/bam/reads_mother.bam --reference /data/ref/ref.fasta [요약됨]",Version="4.5.0.0",Date="February 11, 2026 at 4:23:43 PM GMT">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=20_10037292_10066351,length=29059>
##source=HaplotypeCaller
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  reads_mother
20_10037292_10066351    3480    .       C       CT      503.03  .       AC=2;AF=1.00;AN=2;DP=23;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=27.95;SOR=1.179     GT:AD:DP:GQ:PL  1/1:0,18:18:54:517,54,0
20_10037292_10066351    3520    .       AT      A       609.03  .       AC=2;AF=1.00;AN=2;DP=18;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=33.83;SOR=0.693     GT:AD:DP:GQ:PL  1/1:0,18:18:54:623,54,0
20_10037292_10066351    3529    .       T       A       155.64  .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.544;DP=21;ExcessHet=0.0000;FS=1.871;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=7.78;ReadPosRankSum=-1.158;SOR=1.034       GT:AD:DP:GQ:PL  0/1:12,8:20:99:163,0,328
20_10037292_10066351    4012    .       C       T       1398.06 .       AC=2;AF=1.00;AN=2;DP=44;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=32.51;SOR=0.739     GT:AD:DP:GQ:PL  1/1:0,43:43:99:1412,129,0
20_10037292_10066351    4409    .       A       ATATG   710.03  .       AC=2;AF=1.00;AN=2;DP=31;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=30.87;SOR=0.784     GT:AD:DP:GQ:PL  1/1:0,23:23:69:724,69,0
20_10037292_10066351    5027    .       C       T       784.06  .       AC=2;AF=1.00;AN=2;DP=27;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=30.16;SOR=0.693     GT:AD:DP:GQ:PL  1/1:0,26:26:77:798,77,0
20_10037292_10066351    5469    .       A       G       1297.06 .       AC=2;AF=1.00;AN=2;DP=42;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=30.88;SOR=1.005     GT:AD:DP:GQ:PL  1/1:0,42:42:99:1311,126,0
20_10037292_10066351    7557    .       A       G       935.06  .       AC=2;AF=1.00;AN=2;DP=36;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=27.50;SOR=0.693     GT:AD:DP:GQ:PL  1/1:0,34:34:99:949,100,0
20_10037292_10066351    7786    .       G       T       1043.06 .       AC=2;AF=1.00;AN=2;DP=35;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=30.68;SOR=0.941     GT:AD:DP:GQ:PL  1/1:0,34:34:99:1057,102,0
20_10037292_10066351    8350    .       G       C       1162.06 .       AC=2;AF=1.00;AN=2;DP=39;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.80;SOR=1.096     GT:AD:DP:GQ:PL  1/1:0,39:39:99:1176,115,0
20_10037292_10066351    8886    .       AAGAAAGAAAG     A       1268.03 .       AC=2;AF=1.00;AN=2;DP=34;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=1.071     GT:AD:DP:GQ:PL  1/1:0,29:29:88:1282,88,0
20_10037292_10066351    13536   .       T       C       437.64  .       AC=1;AF=0.500;AN=2;BaseQRankSum=1.454;DP=45;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=9.95;ReadPosRankSum=-1.613;SOR=0.818        GT:AD:DP:GQ:PL  0/1:26,18:44:99:445,0,672
20_10037292_10066351    14156   .       T       C       183.64  .       AC=1;AF=0.500;AN=2;BaseQRankSum=0.703;DP=20;ExcessHet=0.0000;FS=1.871;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=9.18;ReadPosRankSum=-0.193;SOR=1.034        GT:AD:DP:GQ:PL  0/1:12,8:20:99:191,0,319

위의 예제 출력에서 마지막 헤더 줄을 강조 표시했습니다. 이 줄은 뒤따르는 표 형식 데이터의 열 이름을 제공합니다. 각 데이터 줄은 샘플의 시퀀싱 데이터에서 식별된 가능한 변이를 설명합니다. VCF 형식 해석에 대한 지침은 이 유용한 문서를 참조하세요.

1.2.4. 출력 파일 이동¶

컨테이너 내부에 남아 있는 모든 것은 향후 작업에서 접근할 수 없습니다. BAM 인덱스 파일은 마운트된 파일시스템의 /data/bam 디렉토리에 직접 생성되었지만 VCF 파일과 그 인덱스는 그렇지 않으므로 이 두 파일을 수동으로 이동해야 합니다.

mkdir /data/vcf
mv reads_mother.vcf* /data/vcf

디렉토리 내용

data/
├── bam
│   ├── reads_father.bam
│   ├── reads_mother.bam
│   ├── reads_mother.bam.bai
│   └── reads_son.bam
├── ref
│   ├── intervals.bed
│   ├── ref.dict
│   ├── ref.fasta
│   └── ref.fasta.fai
├── samplesheet.csv
└── vcf
    ├── reads_mother.vcf
    └── reads_mother.vcf.idx

완료되면 모든 파일이 이제 일반 파일시스템에서 접근할 수 있습니다.

1.2.5. GATK 컨테이너 종료¶

컨테이너를 종료하려면 exit를 입력하세요.

exit

프롬프트가 정상으로 돌아가야 합니다. 이것으로 샘플별 변이 호출 테스트가 완료되었습니다.

워크플로우로 작성하기!

바로 파트 2로 이동하여 이 분석을 Nextflow 워크플로우로 구현하기 시작하고 싶다면 그렇게 하셔도 됩니다. 파트 3으로 넘어가기 전에 두 번째 테스트 라운드를 완료하기 위해 돌아오기만 하면 됩니다.

2. 코호트에 대한 공동 호출¶

방금 사용한 변이 호출 접근법은 샘플별 변이 호출을 생성합니다. 이것은 각 샘플의 변이를 독립적으로 살펴보는 데는 좋지만, 제한적인 정보를 제공합니다. 여러 샘플에서 변이 호출이 어떻게 다른지 살펴보는 것이 더 흥미로운 경우가 많습니다. GATK는 이 목적을 위해 공동 변이 호출이라는 대안 방법을 제공합니다.

공동 변이 호출은 각 샘플에 대해 GVCF(Genomic VCF)라는 특수한 종류의 변이 출력을 생성한 다음, 모든 샘플의 GVCF 데이터를 결합하고 '공동 유전형 분석(joint genotyping)' 통계 분석을 실행하는 것을 포함합니다.

공동 분석

샘플의 GVCF의 특별한 점은 프로그램이 변이의 증거를 찾은 위치뿐만 아니라 게놈의 대상 영역에 있는 모든 위치에 대한 서열 데이터 통계를 요약하는 레코드를 포함한다는 것입니다. 이것은 공동 유전형 분석 계산에 중요합니다(추가 정보).

GVCF는 추가 매개변수(-ERC GVCF)와 함께 방금 테스트한 것과 동일한 도구인 GATK HaplotypeCaller에 의해 생성됩니다. GVCF를 결합하는 것은 GATK GenomicsDBImport로 수행되며, 이것은 샘플별 호출을 데이터 저장소(데이터베이스와 유사)로 결합합니다. 실제 '공동 유전형 분석' 분석은 GATK GenotypeGVCFs로 수행됩니다.

여기서는 GVCF를 생성하고 공동 유전형 분석을 실행하는 데 필요한 명령을 테스트합니다. 이것들이 이 과정의 파트 3에서 Nextflow 워크플로우로 적용할 명령입니다.

Samtools를 사용하여 각 BAM 입력 파일에 대한 인덱스 파일 생성
각 BAM 입력 파일에서 GATK HaplotypeCaller를 실행하여 샘플별 게놈 변이 호출의 GVCF 생성
모든 GVCF를 수집하여 GenomicsDB 데이터 저장소로 결합
결합된 GVCF 데이터 저장소에서 공동 유전형 분석을 실행하여 코호트 수준 VCF 생성

이제 세 개의 BAM 파일을 모두 인덱싱하는 것부터 시작하여 이러한 모든 명령을 테스트해야 합니다.

2.1. 세 샘플 모두에 대한 BAM 파일 인덱싱¶

위의 첫 번째 섹션에서는 하나의 BAM 파일만 인덱싱했습니다. 이제 GATK HaplotypeCaller가 처리할 수 있도록 세 샘플 모두를 인덱싱해야 합니다.

2.1.1. Samtools 컨테이너를 대화식으로 실행¶

이미 Samtools 컨테이너 이미지를 가져왔으므로 바로 실행할 수 있습니다:

docker run -it -v ./data:/data community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464

프롬프트가 변경되어 이전과 같이 data 디렉토리가 마운트된 컨테이너 내부에 있음을 나타냅니다.

2.1.2. 세 샘플 모두에서 인덱싱 명령 실행¶

세 개의 BAM 파일 각각에서 인덱싱 명령을 실행하세요:

samtools index /data/bam/reads_mother.bam
samtools index /data/bam/reads_father.bam
samtools index /data/bam/reads_son.bam

디렉토리 내용

data/bam/
├── reads_father.bam
├── reads_father.bam.bai
├── reads_mother.bam
├── reads_mother.bam.bai
├── reads_son.bam
└── reads_son.bam.bai

이것은 해당 BAM 파일과 같은 디렉토리에 인덱스 파일을 생성해야 합니다.

2.1.3. Samtools 컨테이너 종료¶

컨테이너를 종료하려면 exit를 입력하세요.

exit

프롬프트가 정상으로 돌아가야 합니다.

2.2. 세 샘플 모두에 대한 GVCF 생성¶

공동 유전형 분석 단계를 실행하려면 세 샘플 모두에 대한 GVCF가 필요합니다.

2.2.1. GATK 컨테이너를 대화식으로 실행¶

이미 이전에 GATK 컨테이너 이미지를 가져왔으므로 바로 실행할 수 있습니다:

docker run -it -v ./data:/data community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867

프롬프트가 변경되어 GATK 컨테이너 내부에 있음을 나타냅니다.

2.2.2. GVCF 옵션과 함께 변이 호출 명령 실행¶

게놈 VCF(GVCF)를 생성하기 위해 기본 명령에 -ERC GVCF 옵션을 추가하여 HaplotypeCaller의 GVCF 모드를 켭니다.

또한 출력 파일의 파일 확장자를 .vcf에서 .g.vcf로 변경합니다. 이것은 기술적으로 요구 사항은 아니지만 강력히 권장되는 규칙입니다.

gatk HaplotypeCaller \
        -R /data/ref/ref.fasta \
        -I /data/bam/reads_mother.bam \
        -O reads_mother.g.vcf \
        -L /data/ref/intervals.bed \
        -ERC GVCF

명령 출력

Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_mother.bam -O reads_mother.g.vcf -L /data/ref/intervals.bed -ERC GVCF
16:51:00.620 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
16:51:00.749 INFO  HaplotypeCaller - ------------------------------------------------------------
16:51:00.751 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
16:51:00.751 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
16:51:00.751 INFO  HaplotypeCaller - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
16:51:00.751 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
16:51:00.752 INFO  HaplotypeCaller - Start Date/Time: February 11, 2026 at 4:51:00 PM GMT
16:51:00.752 INFO  HaplotypeCaller - ------------------------------------------------------------
16:51:00.752 INFO  HaplotypeCaller - ------------------------------------------------------------
16:51:00.752 INFO  HaplotypeCaller - HTSJDK Version: 4.1.0
16:51:00.753 INFO  HaplotypeCaller - Picard Version: 3.1.1
16:51:00.753 INFO  HaplotypeCaller - Built for Spark Version: 3.5.0
16:51:00.753 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:51:00.753 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:51:00.753 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:51:00.754 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:51:00.754 INFO  HaplotypeCaller - Deflater: IntelDeflater
16:51:00.754 INFO  HaplotypeCaller - Inflater: IntelInflater
16:51:00.754 INFO  HaplotypeCaller - GCS max retries/reopens: 20
16:51:00.754 INFO  HaplotypeCaller - Requester pays: disabled
16:51:00.755 INFO  HaplotypeCaller - Initializing engine
16:51:00.893 INFO  FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
16:51:00.905 INFO  IntervalArgumentCollection - Processing 6369 bp from intervals
16:51:00.910 INFO  HaplotypeCaller - Done initializing engine
16:51:00.912 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
16:51:00.917 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
16:51:00.919 INFO  NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
16:51:00.919 INFO  SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
16:51:00.923 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
16:51:00.923 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
16:51:00.933 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
16:51:00.945 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
16:51:00.945 INFO  IntelPairHmm - Available threads: 4
16:51:00.945 INFO  IntelPairHmm - Requested threads: 4
16:51:00.945 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
16:51:00.984 INFO  ProgressMeter - Starting traversal
16:51:00.985 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
16:51:01.452 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
16:51:02.358 INFO  HaplotypeCaller - 7 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
7 total reads filtered out of 1867 reads processed
16:51:02.359 INFO  ProgressMeter - 20_10037292_10066351:13499              0.0                    35           1529.5
16:51:02.359 INFO  ProgressMeter - Traversal complete. Processed 35 total regions in 0.0 minutes.
16:51:02.361 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0022800000000000003
16:51:02.361 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.061637120000000004
16:51:02.361 INFO  SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.02 sec
16:51:02.362 INFO  HaplotypeCaller - Shutting down engine
[February 11, 2026 at 4:51:02 PM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=257949696

이것은 컨테이너의 현재 작업 디렉토리에 GVCF 출력 파일 reads_mother.g.vcf와 그 인덱스 파일 reads_mother.g.vcf.idx를 생성합니다.

디렉토리 내용

conda.yml  hsperfdata_root  reads_mother.g.vcf  reads_mother.g.vcf.idx

head -200 reads_mother.g.vcf를 실행하여 파일 내용의 처음 200줄을 보면, 섹션 1에서 생성한 동등한 VCF보다 훨씬 더 긴 것을 볼 수 있습니다. 대부분의 줄이 VCF에서 본 것과 상당히 다르게 보입니다.

파일 내용 (요약)

reads_mother.g.vcf
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --emit-ref-confidence GVCF --output reads_mother.g.vcf --intervals /data/ref/intervals.bed --input /data/bam/reads_mother.bam --reference /data/ref/ref.fasta [요약됨]",Version="4.5.0.0",Date="February 11, 2026 at 4:51:00 PM GMT">
##GVCFBlock0-1=minGQ=0(inclusive),maxGQ=1(exclusive)
##GVCFBlock1-2=minGQ=1(inclusive),maxGQ=2(exclusive)
##GVCFBlock10-11=minGQ=10(inclusive),maxGQ=11(exclusive)
##GVCFBlock11-12=minGQ=11(inclusive),maxGQ=12(exclusive)
##GVCFBlock12-13=minGQ=12(inclusive),maxGQ=13(exclusive)
##GVCFBlock13-14=minGQ=13(inclusive),maxGQ=14(exclusive)
##GVCFBlock14-15=minGQ=14(inclusive),maxGQ=15(exclusive)
##GVCFBlock15-16=minGQ=15(inclusive),maxGQ=16(exclusive)
##GVCFBlock16-17=minGQ=16(inclusive),maxGQ=17(exclusive)
##GVCFBlock17-18=minGQ=17(inclusive),maxGQ=18(exclusive)
##GVCFBlock18-19=minGQ=18(inclusive),maxGQ=19(exclusive)
##GVCFBlock19-20=minGQ=19(inclusive),maxGQ=20(exclusive)
##GVCFBlock2-3=minGQ=2(inclusive),maxGQ=3(exclusive)
##GVCFBlock20-21=minGQ=20(inclusive),maxGQ=21(exclusive)
##GVCFBlock21-22=minGQ=21(inclusive),maxGQ=22(exclusive)
##GVCFBlock22-23=minGQ=22(inclusive),maxGQ=23(exclusive)
##GVCFBlock23-24=minGQ=23(inclusive),maxGQ=24(exclusive)
##GVCFBlock24-25=minGQ=24(inclusive),maxGQ=25(exclusive)
##GVCFBlock25-26=minGQ=25(inclusive),maxGQ=26(exclusive)
##GVCFBlock26-27=minGQ=26(inclusive),maxGQ=27(exclusive)
##GVCFBlock27-28=minGQ=27(inclusive),maxGQ=28(exclusive)
##GVCFBlock28-29=minGQ=28(inclusive),maxGQ=29(exclusive)
##GVCFBlock29-30=minGQ=29(inclusive),maxGQ=30(exclusive)
##GVCFBlock3-4=minGQ=3(inclusive),maxGQ=4(exclusive)
##GVCFBlock30-31=minGQ=30(inclusive),maxGQ=31(exclusive)
##GVCFBlock31-32=minGQ=31(inclusive),maxGQ=32(exclusive)
##GVCFBlock32-33=minGQ=32(inclusive),maxGQ=33(exclusive)
##GVCFBlock33-34=minGQ=33(inclusive),maxGQ=34(exclusive)
##GVCFBlock34-35=minGQ=34(inclusive),maxGQ=35(exclusive)
##GVCFBlock35-36=minGQ=35(inclusive),maxGQ=36(exclusive)
##GVCFBlock36-37=minGQ=36(inclusive),maxGQ=37(exclusive)
##GVCFBlock37-38=minGQ=37(inclusive),maxGQ=38(exclusive)
##GVCFBlock38-39=minGQ=38(inclusive),maxGQ=39(exclusive)
##GVCFBlock39-40=minGQ=39(inclusive),maxGQ=40(exclusive)
##GVCFBlock4-5=minGQ=4(inclusive),maxGQ=5(exclusive)
##GVCFBlock40-41=minGQ=40(inclusive),maxGQ=41(exclusive)
##GVCFBlock41-42=minGQ=41(inclusive),maxGQ=42(exclusive)
##GVCFBlock42-43=minGQ=42(inclusive),maxGQ=43(exclusive)
##GVCFBlock43-44=minGQ=43(inclusive),maxGQ=44(exclusive)
##GVCFBlock44-45=minGQ=44(inclusive),maxGQ=45(exclusive)
##GVCFBlock45-46=minGQ=45(inclusive),maxGQ=46(exclusive)
##GVCFBlock46-47=minGQ=46(inclusive),maxGQ=47(exclusive)
##GVCFBlock47-48=minGQ=47(inclusive),maxGQ=48(exclusive)
##GVCFBlock48-49=minGQ=48(inclusive),maxGQ=49(exclusive)
##GVCFBlock49-50=minGQ=49(inclusive),maxGQ=50(exclusive)
##GVCFBlock5-6=minGQ=5(inclusive),maxGQ=6(exclusive)
##GVCFBlock50-51=minGQ=50(inclusive),maxGQ=51(exclusive)
##GVCFBlock51-52=minGQ=51(inclusive),maxGQ=52(exclusive)
##GVCFBlock52-53=minGQ=52(inclusive),maxGQ=53(exclusive)
##GVCFBlock53-54=minGQ=53(inclusive),maxGQ=54(exclusive)
##GVCFBlock54-55=minGQ=54(inclusive),maxGQ=55(exclusive)
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock6-7=minGQ=6(inclusive),maxGQ=7(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock7-8=minGQ=7(inclusive),maxGQ=8(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock8-9=minGQ=8(inclusive),maxGQ=9(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock9-10=minGQ=9(inclusive),maxGQ=10(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) for improved RMS Mapping Quality calculation. Incompatible with deprecated RAW_MQ formulation.">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20_10037292_10066351,length=29059>
##source=HaplotypeCaller
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	reads_mother
20_10037292_10066351	3277	.	G	<NON_REF>	.	.	END=3278	GT:DP:GQ:MIN_DP:PL	0/0:38:99:37:0,102,1530
20_10037292_10066351	3279	.	A	<NON_REF>	.	.	END=3279	GT:DP:GQ:MIN_DP:PL	0/0:37:81:37:0,81,1084
20_10037292_10066351	3280	.	A	<NON_REF>	.	.	END=3281	GT:DP:GQ:MIN_DP:PL	0/0:37:99:37:0,99,1485
20_10037292_10066351	3282	.	T	<NON_REF>	.	.	END=3282	GT:DP:GQ:MIN_DP:PL	0/0:36:96:36:0,96,1440
20_10037292_10066351	3283	.	T	<NON_REF>	.	.	END=3283	GT:DP:GQ:MIN_DP:PL	0/0:35:87:35:0,87,1305
20_10037292_10066351	3284	.	T	<NON_REF>	.	.	END=3284	GT:DP:GQ:MIN_DP:PL	0/0:37:90:37:0,90,1350
20_10037292_10066351	3285	.	T	<NON_REF>	.	.	END=3293	GT:DP:GQ:MIN_DP:PL	0/0:35:81:31:0,81,1215
20_10037292_10066351	3294	.	A	<NON_REF>	.	.	END=3302	GT:DP:GQ:MIN_DP:PL	0/0:33:90:30:0,90,970
20_10037292_10066351	3303	.	G	<NON_REF>	.	.	END=3304	GT:DP:GQ:MIN_DP:PL	0/0:33:72:32:0,72,963
20_10037292_10066351	3305	.	C	<NON_REF>	.	.	END=3309	GT:DP:GQ:MIN_DP:PL	0/0:34:99:33:0,99,1053
20_10037292_10066351	3310	.	A	<NON_REF>	.	.	END=3319	GT:DP:GQ:MIN_DP:PL	0/0:35:90:35:0,90,1086
20_10037292_10066351	3320	.	A	<NON_REF>	.	.	END=3320	GT:DP:GQ:MIN_DP:PL	0/0:33:68:33:0,68,959
20_10037292_10066351	3321	.	A	<NON_REF>	.	.	END=3322	GT:DP:GQ:MIN_DP:PL	0/0:32:84:32:0,84,1260
20_10037292_10066351	3323	.	G	<NON_REF>	.	.	END=3323	GT:DP:GQ:MIN_DP:PL	0/0:32:79:32:0,79,953
20_10037292_10066351	3324	.	T	<NON_REF>	.	.	END=3325	GT:DP:GQ:MIN_DP:PL	0/0:32:81:32:0,81,1215
20_10037292_10066351	3326	.	G	<NON_REF>	.	.	END=3326	GT:DP:GQ:MIN_DP:PL	0/0:31:60:31:0,60,873
20_10037292_10066351	3327	.	C	<NON_REF>	.	.	END=3328	GT:DP:GQ:MIN_DP:PL	0/0:30:78:30:0,78,1170
20_10037292_10066351	3329	.	T	<NON_REF>	.	.	END=3329	GT:DP:GQ:MIN_DP:PL	0/0:31:81:31:0,81,1215
20_10037292_10066351	3330	.	G	<NON_REF>	.	.	END=3330	GT:DP:GQ:MIN_DP:PL	0/0:31:76:31:0,76,949
20_10037292_10066351	3331	.	T	<NON_REF>	.	.	END=3332	GT:DP:GQ:MIN_DP:PL	0/0:30:81:29:0,81,1215
20_10037292_10066351	3333	.	A	<NON_REF>	.	.	END=3335	GT:DP:GQ:MIN_DP:PL	0/0:30:72:30:0,72,892
20_10037292_10066351	3336	.	T	<NON_REF>	.	.	END=3337	GT:DP:GQ:MIN_DP:PL	0/0:30:84:30:0,84,1260
20_10037292_10066351	3338	.	C	<NON_REF>	.	.	END=3338	GT:DP:GQ:MIN_DP:PL	0/0:30:59:30:0,59,851
20_10037292_10066351	3339	.	C	<NON_REF>	.	.	END=3339	GT:DP:GQ:MIN_DP:PL	0/0:30:84:30:0,84,1260
20_10037292_10066351	3340	.	T	<NON_REF>	.	.	END=3340	GT:DP:GQ:MIN_DP:PL	0/0:30:77:30:0,77,888
20_10037292_10066351	3341	.	A	<NON_REF>	.	.	END=3343	GT:DP:GQ:MIN_DP:PL	0/0:30:84:28:0,84,910
20_10037292_10066351	3344	.	T	<NON_REF>	.	.	END=3344	GT:DP:GQ:MIN_DP:PL	0/0:29:73:29:0,73,832
20_10037292_10066351	3345	.	T	<NON_REF>	.	.	END=3348	GT:DP:GQ:MIN_DP:PL	0/0:29:87:29:0,87,891
20_10037292_10066351	3349	.	A	<NON_REF>	.	.	END=3349	GT:DP:GQ:MIN_DP:PL	0/0:29:72:29:0,72,904
20_10037292_10066351	3350	.	G	<NON_REF>	.	.	END=3350	GT:DP:GQ:MIN_DP:PL	0/0:29:87:29:0,87,910
20_10037292_10066351	3351	.	T	<NON_REF>	.	.	END=3352	GT:DP:GQ:MIN_DP:PL	0/0:30:90:30:0,90,975
20_10037292_10066351	3353	.	G	<NON_REF>	.	.	END=3354	GT:DP:GQ:MIN_DP:PL	0/0:31:72:30:0,72,846
20_10037292_10066351	3355	.	A	<NON_REF>	.	.	END=3355	GT:DP:GQ:MIN_DP:PL	0/0:31:93:31:0,93,978
20_10037292_10066351	3356	.	T	<NON_REF>	.	.	END=3357	GT:DP:GQ:MIN_DP:PL	0/0:31:67:31:0,67,916
20_10037292_10066351	3358	.	A	<NON_REF>	.	.	END=3363	GT:DP:GQ:MIN_DP:PL	0/0:31:90:31:0,90,1017
20_10037292_10066351	3364	.	G	<NON_REF>	.	.	END=3364	GT:DP:GQ:MIN_DP:PL	0/0:32:82:32:0,82,947
20_10037292_10066351	3365	.	A	<NON_REF>	.	.	END=3365	GT:DP:GQ:MIN_DP:PL	0/0:32:90:32:0,90,1350
20_10037292_10066351	3366	.	C	<NON_REF>	.	.	END=3366	GT:DP:GQ:MIN_DP:PL	0/0:32:79:32:0,79,963
20_10037292_10066351	3367	.	T	<NON_REF>	.	.	END=3369	GT:DP:GQ:MIN_DP:PL	0/0:32:90:31:0,90,1350
20_10037292_10066351	3370	.	C	<NON_REF>	.	.	END=3370	GT:DP:GQ:MIN_DP:PL	0/0:32:46:32:0,46,903
20_10037292_10066351	3371	.	A	<NON_REF>	.	.	END=3371	GT:DP:GQ:MIN_DP:PL	0/0:32:90:32:0,90,1350
20_10037292_10066351	3372	.	A	<NON_REF>	.	.	END=3372	GT:DP:GQ:MIN_DP:PL	0/0:32:80:32:0,80,905
20_10037292_10066351	3373	.	A	<NON_REF>	.	.	END=3374	GT:DP:GQ:MIN_DP:PL	0/0:32:90:32:0,90,1350
20_10037292_10066351	3375	.	G	<NON_REF>	.	.	END=3375	GT:DP:GQ:MIN_DP:PL	0/0:31:76:31:0,76,922
20_10037292_10066351	3376	.	C	<NON_REF>	.	.	END=3376	GT:DP:GQ:MIN_DP:PL	0/0:33:93:33:0,93,1395
20_10037292_10066351	3377	.	A	<NON_REF>	.	.	END=3381	GT:DP:GQ:MIN_DP:PL	0/0:32:84:31:0,84,1260
20_10037292_10066351	3382	.	A	<NON_REF>	.	.	END=3385	GT:DP:GQ:MIN_DP:PL	0/0:33:90:33:0,90,1350
20_10037292_10066351	3386	.	A	<NON_REF>	.	.	END=3387	GT:DP:GQ:MIN_DP:PL	0/0:34:84:33:0,84,964
20_10037292_10066351	3388	.	A	<NON_REF>	.	.	END=3397	GT:DP:GQ:MIN_DP:PL	0/0:32:90:31:0,90,1350
20_10037292_10066351	3398	.	A	<NON_REF>	.	.	END=3398	GT:DP:GQ:MIN_DP:PL	0/0:31:75:31:0,75,920
20_10037292_10066351	3399	.	T	<NON_REF>	.	.	END=3399	GT:DP:GQ:MIN_DP:PL	0/0:31:87:31:0,87,1305
20_10037292_10066351	3400	.	T	<NON_REF>	.	.	END=3400	GT:DP:GQ:MIN_DP:PL	0/0:32:90:32:0,90,1350
20_10037292_10066351	3401	.	T	<NON_REF>	.	.	END=3402	GT:DP:GQ:MIN_DP:PL	0/0:32:87:31:0,87,1305
20_10037292_10066351	3403	.	T	<NON_REF>	.	.	END=3403	GT:DP:GQ:MIN_DP:PL	0/0:32:90:32:0,90,1350
20_10037292_10066351	3404	.	C	<NON_REF>	.	.	END=3405	GT:DP:GQ:MIN_DP:PL	0/0:32:80:32:0,80,944
20_10037292_10066351	3406	.	G	<NON_REF>	.	.	END=3407	GT:DP:GQ:MIN_DP:PL	0/0:31:72:30:0,72,859
20_10037292_10066351	3408	.	G	<NON_REF>	.	.	END=3408	GT:DP:GQ:MIN_DP:PL	0/0:32:62:32:0,62,890
20_10037292_10066351	3409	.	T	<NON_REF>	.	.	END=3418	GT:DP:GQ:MIN_DP:PL	0/0:33:81:30:0,81,1215
20_10037292_10066351	3419	.	T	<NON_REF>	.	.	END=3419	GT:DP:GQ:MIN_DP:PL	0/0:29:74:29:0,74,827
20_10037292_10066351	3420	.	T	<NON_REF>	.	.	END=3421	GT:DP:GQ:MIN_DP:PL	0/0:30:84:29:0,84,1260
20_10037292_10066351	3422	.	T	<NON_REF>	.	.	END=3430	GT:DP:GQ:MIN_DP:PL	0/0:29:75:28:0,75,1125
20_10037292_10066351	3431	.	T	<NON_REF>	.	.	END=3439	GT:DP:GQ:MIN_DP:PL	0/0:28:81:28:0,81,1215
20_10037292_10066351	3440	.	A	<NON_REF>	.	.	END=3440	GT:DP:GQ:MIN_DP:PL	0/0:28:70:28:0,70,782
20_10037292_10066351	3441	.	T	<NON_REF>	.	.	END=3442	GT:DP:GQ:MIN_DP:PL	0/0:28:81:28:0,81,1215
20_10037292_10066351	3443	.	T	<NON_REF>	.	.	END=3443	GT:DP:GQ:MIN_DP:PL	0/0:28:78:28:0,78,1170
20_10037292_10066351	3444	.	T	<NON_REF>	.	.	END=3445	GT:DP:GQ:MIN_DP:PL	0/0:28:64:28:0,64,722
20_10037292_10066351	3446	.	G	<NON_REF>	.	.	END=3446	GT:DP:GQ:MIN_DP:PL	0/0:28:78:28:0,78,1170
20_10037292_10066351	3447	.	C	<NON_REF>	.	.	END=3447	GT:DP:GQ:MIN_DP:PL	0/0:29:53:29:0,53,694
20_10037292_10066351	3448	.	T	<NON_REF>	.	.	END=3449	GT:DP:GQ:MIN_DP:PL	0/0:31:76:30:0,76,827
20_10037292_10066351	3450	.	A	<NON_REF>	.	.	END=3450	GT:DP:GQ:MIN_DP:PL	0/0:31:87:31:0,87,1305
20_10037292_10066351	3451	.	C	<NON_REF>	.	.	END=3452	GT:DP:GQ:MIN_DP:PL	0/0:31:74:31:0,74,715
20_10037292_10066351	3453	.	T	<NON_REF>	.	.	END=3455	GT:DP:GQ:MIN_DP:PL	0/0:31:84:31:0,84,1260
20_10037292_10066351	3456	.	A	<NON_REF>	.	.	END=3456	GT:DP:GQ:MIN_DP:PL	0/0:31:23:31:0,23,766
20_10037292_10066351	3457	.	T	<NON_REF>	.	.	END=3460	GT:DP:GQ:MIN_DP:PL	0/0:31:90:31:0,90,1350
20_10037292_10066351	3461	.	C	<NON_REF>	.	.	END=3461	GT:DP:GQ:MIN_DP:PL	0/0:30:89:30:0,89,873
20_10037292_10066351	3462	.	T	<NON_REF>	.	.	END=3462	GT:DP:GQ:MIN_DP:PL	0/0:31:90:31:0,90,1350
20_10037292_10066351	3463	.	G	<NON_REF>	.	.	END=3463	GT:DP:GQ:MIN_DP:PL	0/0:31:44:31:0,44,739
20_10037292_10066351	3464	.	T	<NON_REF>	.	.	END=3468	GT:DP:GQ:MIN_DP:PL	0/0:32:90:32:0,90,1350
20_10037292_10066351	3469	.	C	<NON_REF>	.	.	END=3469	GT:DP:GQ:MIN_DP:PL	0/0:32:79:32:0,79,816
20_10037292_10066351	3470	.	T	<NON_REF>	.	.	END=3470	GT:DP:GQ:MIN_DP:PL	0/0:31:84:31:0,84,1260
20_10037292_10066351	3471	.	T	<NON_REF>	.	.	END=3478	GT:DP:GQ:MIN_DP:PL	0/0:32:75:32:0,75,1125
20_10037292_10066351	3479	.	T	<NON_REF>	.	.	END=3479	GT:DP:GQ:MIN_DP:PL	0/0:34:36:34:0,36,906
20_10037292_10066351	3480	.	C	CT,<NON_REF>	503.03	.	DP=23;ExcessHet=0.0000;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=82800,23	GT:AD:DP:GQ:PL:SB	1/1:0,18,0:18:54:517,54,0,517,54,517:0,0,7,11
20_10037292_10066351	3481	.	T	<NON_REF>	.	.	END=3481	GT:DP:GQ:MIN_DP:PL	0/0:21:51:21:0,51,765
20_10037292_10066351	3482	.	T	<NON_REF>	.	.	END=3482	GT:DP:GQ:MIN_DP:PL	0/0:21:54:21:0,54,810
20_10037292_10066351	3483	.	T	<NON_REF>	.	.	END=3487	GT:DP:GQ:MIN_DP:PL	0/0:20:51:19:0,51,765
20_10037292_10066351	3488	.	T	<NON_REF>	.	.	END=3488	GT:DP:GQ:MIN_DP:PL	0/0:19:42:19:0,42,571
20_10037292_10066351	3489	.	A	<NON_REF>	.	.	END=3489	GT:DP:GQ:MIN_DP:PL	0/0:17:51:17:0,51,521
20_10037292_10066351	3490	.	C	<NON_REF>	.	.	END=3490	GT:DP:GQ:MIN_DP:PL	0/0:17:35:17:0,35,431
20_10037292_10066351	3491	.	A	<NON_REF>	.	.	END=3495	GT:DP:GQ:MIN_DP:PL	0/0:17:48:17:0,48,720
20_10037292_10066351	3496	.	A	<NON_REF>	.	.	END=3498	GT:DP:GQ:MIN_DP:PL	0/0:17:51:17:0,51,473
20_10037292_10066351	3499	.	C	<NON_REF>	.	.	END=3499	GT:DP:GQ:MIN_DP:PL	0/0:16:48:16:0,48,428
20_10037292_10066351	3500	.	G	<NON_REF>	.	.	END=3500	GT:DP:GQ:MIN_DP:PL	0/0:16:31:16:0,31,379
20_10037292_10066351	3501	.	T	<NON_REF>	.	.	END=3501	GT:DP:GQ:MIN_DP:PL	0/0:17:48:17:0,48,720
20_10037292_10066351	3502	.	A	<NON_REF>	.	.	END=3503	GT:DP:GQ:MIN_DP:PL	0/0:19:54:18:0,54,550
20_10037292_10066351	3504	.	T	<NON_REF>	.	.	END=3504	GT:DP:GQ:MIN_DP:PL	0/0:19:48:19:0,48,720
20_10037292_10066351	3505	.	A	<NON_REF>	.	.	END=3506	GT:DP:GQ:MIN_DP:PL	0/0:20:51:20:0,51,765
20_10037292_10066351	3507	.	T	<NON_REF>	.	.	END=3519	GT:DP:GQ:MIN_DP:PL	0/0:19:54:18:0,54,501
20_10037292_10066351	3520	.	AT	A,<NON_REF>	609.03	.	DP=18;ExcessHet=0.0000;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=64800,18	GT:AD:DP:GQ:PL:SB	1/1:0,18,0:18:54:623,54,0,623,54,623:0,0,9,9
20_10037292_10066351	3522	.	T	<NON_REF>	.	.	END=3525	GT:DP:GQ:MIN_DP:PL	0/0:18:54:18:0,54,550
20_10037292_10066351	3526	.	T	<NON_REF>	.	.	END=3527	GT:DP:GQ:MIN_DP:PL	0/0:19:57:19:0,57,607
20_10037292_10066351	3528	.	T	<NON_REF>	.	.	END=3528	GT:DP:GQ:MIN_DP:PL	0/0:19:54:19:0,54,810
20_10037292_10066351	3529	.	T	A,<NON_REF>	155.64	.	BaseQRankSum=-0.544;DP=21;ExcessHet=0.0000;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=75600,21;ReadPosRankSum=-1.158	GT:AD:DP:GQ:PL:SB	0/1:12,8,0:20:99:163,0,328,199,352,551:5,7,5,3
20_10037292_10066351	3530	.	A	<NON_REF>	.	.	END=3530	GT:DP:GQ:MIN_DP:PL	0/0:33:64:33:0,64,941
20_10037292_10066351	3531	.	A	<NON_REF>	.	.	END=3533	GT:DP:GQ:MIN_DP:PL	0/0:33:81:33:0,81,1215
20_10037292_10066351	3534	.	A	<NON_REF>	.	.	END=3534	GT:DP:GQ:MIN_DP:PL	0/0:33:78:33:0,78,1170
20_10037292_10066351	3535	.	A	<NON_REF>	.	.	END=3536	GT:DP:GQ:MIN_DP:PL	0/0:33:68:33:0,68,891
20_10037292_10066351	3537	.	A	<NON_REF>	.	.	END=3546	GT:DP:GQ:MIN_DP:PL	0/0:29:72:26:0,72,1080

마지막 헤더 줄과 파일의 처음 세 개의 '실제' 변이 호출을 다시 강조 표시했습니다.

변이 호출 줄이 많은 비변이 줄 사이에 산재되어 있음을 알 수 있습니다. 이것들은 변이 호출기가 변이의 증거를 찾지 못한 비변이 영역을 나타냅니다. 위에서 간략히 언급했듯이, 이것이 변이 호출의 GVCF 모드의 특별한 점입니다. 변이 호출기는 변이가 없다는 것에 대한 신뢰 수준을 설명하는 일부 통계를 포착합니다. 이를 통해 두 가지 매우 다른 경우를 구별할 수 있습니다: (1) 샘플이 동형접합 참조임을 보여주는 양질의 데이터가 있는 경우, (2) 어느 쪽으로든 결정하기에 충분한 양질의 데이터를 사용할 수 없는 경우.

이와 같은 GVCF에는 일반적으로 이러한 비변이 줄이 많이 있으며, 그 사이에 더 적은 수의 변이 레코드가 산재되어 있습니다.

2.2.3. 나머지 두 샘플에서 프로세스 반복¶

이제 아래 명령을 하나씩 실행하여 나머지 두 샘플에 대한 GVCF를 생성하겠습니다.

gatk HaplotypeCaller \
        -R /data/ref/ref.fasta \
        -I /data/bam/reads_father.bam \
        -O reads_father.g.vcf \
        -L /data/ref/intervals.bed \
        -ERC GVCF

명령 출력

Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_father.bam -O reads_father.g.vcf -L /data/ref/intervals.bed -ERC GVCF
17:28:30.677 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:28:30.801 INFO  HaplotypeCaller - ------------------------------------------------------------
17:28:30.803 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
17:28:30.804 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
17:28:30.804 INFO  HaplotypeCaller - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
17:28:30.804 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
17:28:30.804 INFO  HaplotypeCaller - Start Date/Time: February 11, 2026 at 5:28:30 PM GMT
17:28:30.804 INFO  HaplotypeCaller - ------------------------------------------------------------
17:28:30.804 INFO  HaplotypeCaller - ------------------------------------------------------------
17:28:30.805 INFO  HaplotypeCaller - HTSJDK Version: 4.1.0
17:28:30.805 INFO  HaplotypeCaller - Picard Version: 3.1.1
17:28:30.805 INFO  HaplotypeCaller - Built for Spark Version: 3.5.0
17:28:30.806 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:28:30.806 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:28:30.806 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:28:30.806 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:28:30.806 INFO  HaplotypeCaller - Deflater: IntelDeflater
17:28:30.807 INFO  HaplotypeCaller - Inflater: IntelInflater
17:28:30.807 INFO  HaplotypeCaller - GCS max retries/reopens: 20
17:28:30.807 INFO  HaplotypeCaller - Requester pays: disabled
17:28:30.807 INFO  HaplotypeCaller - Initializing engine
17:28:30.933 INFO  FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
17:28:30.946 INFO  IntervalArgumentCollection - Processing 6369 bp from intervals
17:28:30.951 INFO  HaplotypeCaller - Done initializing engine
17:28:30.953 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
17:28:30.957 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:30.959 INFO  NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
17:28:30.960 INFO  SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
17:28:30.963 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
17:28:30.963 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
17:28:30.972 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
17:28:30.987 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
17:28:30.987 INFO  IntelPairHmm - Available threads: 4
17:28:30.987 INFO  IntelPairHmm - Requested threads: 4
17:28:30.987 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
17:28:31.034 INFO  ProgressMeter - Starting traversal
17:28:31.034 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
17:28:31.570 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
17:28:32.865 INFO  HaplotypeCaller - 9 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
9 total reads filtered out of 2064 reads processed
17:28:32.866 INFO  ProgressMeter - 20_10037292_10066351:13338              0.0                    38           1245.2
17:28:32.866 INFO  ProgressMeter - Traversal complete. Processed 38 total regions in 0.0 minutes.
17:28:32.868 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0035923200000000004
17:28:32.868 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.10765202500000001
17:28:32.868 INFO  SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.03 sec
17:28:32.869 INFO  HaplotypeCaller - Shutting down engine
[February 11, 2026 at 5:28:32 PM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=299892736

gatk HaplotypeCaller \
        -R /data/ref/ref.fasta \
        -I /data/bam/reads_son.bam \
        -O reads_son.g.vcf \
        -L /data/ref/intervals.bed \
        -ERC GVCF

명령 출력

Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_son.bam -O reads_son.g.vcf -L /data/ref/intervals.bed -ERC GVCF
17:30:10.017 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:30:10.156 INFO  HaplotypeCaller - ------------------------------------------------------------
17:30:10.159 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
17:30:10.159 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
17:30:10.159 INFO  HaplotypeCaller - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
17:30:10.159 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
17:30:10.159 INFO  HaplotypeCaller - Start Date/Time: February 11, 2026 at 5:30:09 PM GMT
17:30:10.159 INFO  HaplotypeCaller - ------------------------------------------------------------
17:30:10.160 INFO  HaplotypeCaller - ------------------------------------------------------------
17:30:10.160 INFO  HaplotypeCaller - HTSJDK Version: 4.1.0
17:30:10.160 INFO  HaplotypeCaller - Picard Version: 3.1.1
17:30:10.161 INFO  HaplotypeCaller - Built for Spark Version: 3.5.0
17:30:10.161 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:30:10.161 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:30:10.161 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:30:10.161 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:30:10.161 INFO  HaplotypeCaller - Deflater: IntelDeflater
17:30:10.162 INFO  HaplotypeCaller - Inflater: IntelInflater
17:30:10.162 INFO  HaplotypeCaller - GCS max retries/reopens: 20
17:30:10.162 INFO  HaplotypeCaller - Requester pays: disabled
17:30:10.162 INFO  HaplotypeCaller - Initializing engine
17:30:10.277 INFO  FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
17:30:10.290 INFO  IntervalArgumentCollection - Processing 6369 bp from intervals
17:30:10.296 INFO  HaplotypeCaller - Done initializing engine
17:30:10.298 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
17:30:10.302 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:30:10.303 INFO  NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
17:30:10.304 INFO  SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
17:30:10.307 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
17:30:10.307 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
17:30:10.315 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
17:30:10.328 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
17:30:10.329 INFO  IntelPairHmm - Available threads: 4
17:30:10.329 INFO  IntelPairHmm - Requested threads: 4
17:30:10.329 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
17:30:10.368 INFO  ProgressMeter - Starting traversal
17:30:10.369 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
17:30:10.875 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
17:30:11.980 INFO  HaplotypeCaller - 14 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
14 total reads filtered out of 1981 reads processed
17:30:11.981 INFO  ProgressMeter - 20_10037292_10066351:13223              0.0                    35           1302.7
17:30:11.981 INFO  ProgressMeter - Traversal complete. Processed 35 total regions in 0.0 minutes.
17:30:11.983 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0034843710000000004
17:30:11.983 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.048108363
17:30:11.983 INFO  SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.02 sec
17:30:11.984 INFO  HaplotypeCaller - Shutting down engine
[February 11, 2026 at 5:30:11 PM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=226492416

완료되면 현재 디렉토리에 .g.vcf로 끝나는 세 개의 파일(샘플당 하나)과 .g.vcf.idx로 끝나는 각각의 인덱스 파일이 있어야 합니다.

디렉토리 내용

conda.yml        reads_father.g.vcf      reads_mother.g.vcf      reads_son.g.vcf
hsperfdata_root  reads_father.g.vcf.idx  reads_mother.g.vcf.idx  reads_son.g.vcf.idx

이 시점에서 각 입력 샘플에 대해 GVCF 모드로 변이를 호출했습니다. 이제 공동 호출로 넘어갈 시간입니다.

하지만 컨테이너를 종료하지 마세요! 다음 단계에서 동일한 컨테이너를 사용할 것입니다.

2.3. 공동 유전형 분석 실행¶

이제 모든 GVCF가 있으므로 샘플 코호트에 대한 변이 호출을 생성하는 공동 유전형 분석 접근법을 시도해 볼 수 있습니다. 이것은 모든 GVCF의 데이터를 데이터 저장소로 결합한 다음 공동 유전형 분석을 실행하여 최종 공동 호출된 변이의 VCF를 생성하는 2단계 방법입니다.

2.3.1. 모든 샘플별 GVCF 결합¶

이 첫 번째 단계는 GATK의 또 다른 도구인 GenomicsDBImport를 사용하여 모든 GVCF의 데이터를 GenomicsDB 데이터 저장소로 결합합니다. GenomicsDB 데이터 저장소는 변이 정보의 중간 저장소 역할을 하는 일종의 데이터베이스 형식입니다.

gatk GenomicsDBImport \
    -V reads_mother.g.vcf \
    -V reads_father.g.vcf \
    -V reads_son.g.vcf \
    -L /data/ref/intervals.bed \
    --genomicsdb-workspace-path family_trio_gdb

명령 출력

Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar GenomicsDBImport -V reads_mother.g.vcf -V reads_father.g.vcf -V reads_son.g.vcf -L /data/ref/intervals.bed --genomicsdb-workspace-path family_trio_gdb
17:37:07.569 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:37:07.699 INFO  GenomicsDBImport - ------------------------------------------------------------
17:37:07.702 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.5.0.0
17:37:07.702 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
17:37:07.703 INFO  GenomicsDBImport - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
17:37:07.703 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
17:37:07.704 INFO  GenomicsDBImport - Start Date/Time: February 11, 2026 at 5:37:07 PM GMT
17:37:07.704 INFO  GenomicsDBImport - ------------------------------------------------------------
17:37:07.704 INFO  GenomicsDBImport - ------------------------------------------------------------
17:37:07.706 INFO  GenomicsDBImport - HTSJDK Version: 4.1.0
17:37:07.706 INFO  GenomicsDBImport - Picard Version: 3.1.1
17:37:07.707 INFO  GenomicsDBImport - Built for Spark Version: 3.5.0
17:37:07.709 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:37:07.709 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:37:07.709 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:37:07.710 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:37:07.710 INFO  GenomicsDBImport - Deflater: IntelDeflater
17:37:07.711 INFO  GenomicsDBImport - Inflater: IntelInflater
17:37:07.711 INFO  GenomicsDBImport - GCS max retries/reopens: 20
17:37:07.711 INFO  GenomicsDBImport - Requester pays: disabled
17:37:07.712 INFO  GenomicsDBImport - Initializing engine
17:37:07.883 INFO  FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
17:37:07.886 INFO  IntervalArgumentCollection - Processing 6369 bp from intervals
17:37:07.889 INFO  GenomicsDBImport - Done initializing engine
17:37:08.560 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
17:37:08.561 INFO  GenomicsDBImport - Vid Map JSON file will be written to /tmp/family_trio_gdb/vidmap.json
17:37:08.561 INFO  GenomicsDBImport - Callset Map JSON file will be written to /tmp/family_trio_gdb/callset.json
17:37:08.561 INFO  GenomicsDBImport - Complete VCF Header will be written to /tmp/family_trio_gdb/vcfheader.vcf
17:37:08.561 INFO  GenomicsDBImport - Importing to workspace - /tmp/family_trio_gdb
17:37:08.878 INFO  GenomicsDBImport - Importing batch 1 with 3 samples
17:37:09.359 INFO  GenomicsDBImport - Importing batch 1 with 3 samples
17:37:09.487 INFO  GenomicsDBImport - Importing batch 1 with 3 samples
17:37:09.591 INFO  GenomicsDBImport - Done importing batch 1/1
17:37:09.592 INFO  GenomicsDBImport - Import completed!
17:37:09.592 INFO  GenomicsDBImport - Shutting down engine
[February 11, 2026 at 5:37:09 PM GMT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=113246208
Tool returned:
true

이 단계의 출력은 사실상 여러 개의 다른 파일 형태로 결합된 변이 데이터를 담고 있는 추가 중첩 디렉토리 세트를 포함하는 디렉토리입니다. 둘러볼 수는 있지만 이 데이터 저장소 형식은 사람이 직접 읽도록 만들어지지 않았다는 것을 빠르게 알 수 있습니다.

팁

GATK에는 필요에 따라 데이터 저장소에서 변이 호출 데이터를 검사하고 추출할 수 있게 해주는 도구가 포함되어 있습니다.

2.3.2. 실제 공동 유전형 분석 실행¶

이 두 번째 단계는 GATK의 또 다른 도구인 GenotypeGVCFs를 사용하여 코호트의 모든 샘플에서 사용 가능한 데이터를 고려하여 변이 통계와 개별 유전형을 재계산합니다.

gatk GenotypeGVCFs \
    -R /data/ref/ref.fasta \
    -V gendb://family_trio_gdb \
    -O family_trio.vcf

명령 출력

Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar GenotypeGVCFs -R /data/ref/ref.fasta -V gendb://family_trio_gdb -O family_trio.vcf
17:38:45.084 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:38:45.217 INFO  GenotypeGVCFs - ------------------------------------------------------------
17:38:45.220 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.5.0.0
17:38:45.220 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
17:38:45.220 INFO  GenotypeGVCFs - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
17:38:45.220 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
17:38:45.221 INFO  GenotypeGVCFs - Start Date/Time: February 11, 2026 at 5:38:45 PM GMT
17:38:45.221 INFO  GenotypeGVCFs - ------------------------------------------------------------
17:38:45.221 INFO  GenotypeGVCFs - ------------------------------------------------------------
17:38:45.221 INFO  GenotypeGVCFs - HTSJDK Version: 4.1.0
17:38:45.222 INFO  GenotypeGVCFs - Picard Version: 3.1.1
17:38:45.222 INFO  GenotypeGVCFs - Built for Spark Version: 3.5.0
17:38:45.222 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:38:45.222 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:38:45.222 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:38:45.222 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:38:45.223 INFO  GenotypeGVCFs - Deflater: IntelDeflater
17:38:45.223 INFO  GenotypeGVCFs - Inflater: IntelInflater
17:38:45.223 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
17:38:45.223 INFO  GenotypeGVCFs - Requester pays: disabled
17:38:45.223 INFO  GenotypeGVCFs - Initializing engine
17:38:45.544 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
17:38:45.561 INFO  NativeGenomicsDB - pid=221 tid=222 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
17:38:45.561 INFO  NativeGenomicsDB - pid=221 tid=222 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF records
17:38:45.561 INFO  NativeGenomicsDB - pid=221 tid=222 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF records
17:38:45.577 INFO  GenotypeGVCFs - Done initializing engine
17:38:45.615 INFO  ProgressMeter - Starting traversal
17:38:45.615 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
17:38:45.903 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.07757032800000006,Cpu time(s),0.07253379200000037
17:38:46.421 INFO  ProgressMeter - 20_10037292_10066351:13953              0.0                  3390         252357.3
17:38:46.422 INFO  ProgressMeter - Traversal complete. Processed 3390 total variants in 0.0 minutes.
17:38:46.423 INFO  GenotypeGVCFs - Shutting down engine
[February 11, 2026 at 5:38:46 PM GMT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=203423744

이것은 컨테이너의 현재 작업 디렉토리에 VCF 출력 파일 family_trio.vcf와 그 인덱스 family_trio.vcf.idx를 생성합니다. 또 다른 적당히 작은 파일이므로 cat family_trio.vcf를 실행하여 파일 내용을 보고 처음 몇 개의 변이 줄을 찾기 위해 아래로 스크롤할 수 있습니다.

파일 내용 (요약)

family_trio.vcf
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=GenomicsDBImport,CommandLine="GenomicsDBImport --genomicsdb-workspace-path family_trio_gdb --variant reads_mother.g.vcf --variant reads_father.g.vcf --variant reads_son.g.vcf --intervals /data/ref/intervals.bed [요약됨]",Version="4.5.0.0",Date="February 11, 2026 at 5:37:07 PM GMT">
##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="GenotypeGVCFs --output family_trio.vcf --variant gendb://family_trio_gdb --reference /data/ref/ref.fasta --include-non-variant-sites false [요약됨]",Version="4.5.0.0",Date="February 11, 2026 at 5:38:45 PM GMT">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --emit-ref-confidence GVCF --output reads_mother.g.vcf --intervals /data/ref/intervals.bed --input /data/bam/reads_mother.bam --reference /data/ref/ref.fasta [요약됨]",Version="4.5.0.0",Date="February 11, 2026 at 4:51:00 PM GMT">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) for improved RMS Mapping Quality calculation. Incompatible with deprecated RAW_MQ formulation.">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=20_10037292_10066351,length=29059>
##source=GenomicsDBImport
##source=GenotypeGVCFs
##source=HaplotypeCaller
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  reads_father    reads_mother    reads_son
20_10037292_10066351    3480    .       C       CT      1625.89 .       AC=5;AF=0.833;AN=6;BaseQRankSum=0.220;DP=85;ExcessHet=0.0000;FS=2.476;MLEAC=5;MLEAF=0.833;MQ=60.00;MQRankSum=0.00;QD=21.68;ReadPosRankSum=-1.147e+00;SOR=0.487    GT:AD:DP:GQ:PL  0/1:15,16:31:99:367,0,375       1/1:0,18:18:54:517,54,0 1/1:0,26:26:78:756,78,0
20_10037292_10066351    3520    .       AT      A       1678.89 .       AC=5;AF=0.833;AN=6;BaseQRankSum=1.03;DP=80;ExcessHet=0.0000;FS=2.290;MLEAC=5;MLEAF=0.833;MQ=60.00;MQRankSum=0.00;QD=22.39;ReadPosRankSum=0.701;SOR=0.730  GT:AD:DP:GQ:PL  0/1:18,13:31:99:296,0,424       1/1:0,18:18:54:623,54,0 1/1:0,26:26:78:774,78,0
20_10037292_10066351    3529    .       T       A       154.29  .       AC=1;AF=0.167;AN=6;BaseQRankSum=-5.440e-01;DP=104;ExcessHet=0.0000;FS=1.871;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.00;QD=7.71;ReadPosRankSum=-1.158e+00;SOR=1.034       GT:AD:DP:GQ:PL  0/0:44,0:44:99:0,112,1347       0/1:12,8:20:99:163,0,328        0/0:39,0:39:99:0,105,1194
20_10037292_10066351    4012    .       C       T       3950.73 .       AC=6;AF=1.00;AN=6;DP=127;ExcessHet=0.0000;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=31.86;SOR=0.725    GT:AD:DP:GQ:PL  1/1:0,46:46:99:1446,137,0   1/1:0,43:43:99:1412,129,0        1/1:0,35:35:99:1106,105,0
20_10037292_10066351    4409    .       A       ATATG   2478.69 .       AC=6;AF=1.00;AN=6;DP=96;ExcessHet=0.0000;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=33.95;SOR=0.963     GT:AD:DP:GQ:PL  1/1:0,28:28:90:969,90,0 1/1:0,21:21:69:724,69,0      1/1:0,24:24:72:799,72,0
20_10037292_10066351    4464    .       T       TA      620.25  .       AC=1;AF=0.167;AN=6;BaseQRankSum=0.108;DP=102;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.00;QD=19.38;ReadPosRankSum=1.27;SOR=0.892 GT:AD:DP:GQ:PGT:PID:PL:PS       0|1:15,17:32:99:0|1:4464_T_TA:629,0,554:4464    0/0:30,0:30:78:.:.:0,78,1170 0/0:39,0:39:99:.:.:0,108,1286
20_10037292_10066351    4465    .       T       TA      620.25  .       AC=1;AF=0.167;AN=6;BaseQRankSum=-2.250e-01;DP=101;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.00;QD=19.38;ReadPosRankSum=0.910;SOR=0.892   GT:AD:DP:GQ:PGT:PID:PL:PS       0|1:15,17:32:99:0|1:4464_T_TA:629,0,554:4464    0/0:30,0:30:78:.:.:0,78,1170 0/0:39,0:39:99:.:.:0,108,1286
20_10037292_10066351    5027    .       C       T       3339.73 .       AC=6;AF=1.00;AN=6;DP=108;ExcessHet=0.0000;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=31.51;SOR=0.731    GT:AD:DP:GQ:PL  1/1:0,36:36:99:1164,108,0   1/1:0,26:26:77:798,77,0  1/1:0,44:44:99:1391,132,0
20_10037292_10066351    5469    .       A       G       2725.93 .       AC=5;AF=0.833;AN=6;BaseQRankSum=-3.665e+00;DP=113;ExcessHet=0.0000;FS=6.914;MLEAC=5;MLEAF=0.833;MQ=60.00;MQRankSum=0.00;QD=24.34;ReadPosRankSum=1.50;SOR=0.320    GT:AD:DP:GQ:PL  0/1:18,23:41:99:553,0,486       1/1:0,42:42:99:1311,126,0       1/1:0,29:29:86:876,86,0
20_10037292_10066351    7557    .       A       G       2257.93 .       AC=5;AF=0.833;AN=6;BaseQRankSum=-1.362e+00;DP=111;ExcessHet=0.0000;FS=3.400;MLEAC=5;MLEAF=0.833;MQ=60.00;MQRankSum=0.00;QD=21.50;ReadPosRankSum=1.11;SOR=0.566    GT:AD:DP:GQ:PL  0/1:19,15:34:99:313,0,493       1/1:0,34:34:99:949,100,0        1/1:0,37:37:99:1010,108,0
20_10037292_10066351    7786    .       G       T       3503.73 .       AC=6;AF=1.00;AN=6;DP=114;ExcessHet=0.0000;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=31.28;SOR=0.970    GT:AD:DP:GQ:PL  1/1:0,34:34:99:1066,102,0   1/1:0,34:34:99:1057,102,0        1/1:0,44:44:99:1394,132,0
20_10037292_10066351    8350    .       G       C       2663.93 .       AC=5;AF=0.833;AN=6;BaseQRankSum=-1.608e+00;DP=106;ExcessHet=0.0000;FS=5.378;MLEAC=5;MLEAF=0.833;MQ=60.00;MQRankSum=0.00;QD=25.37;ReadPosRankSum=-1.870e-01;SOR=0.950      GT:AD:DP:GQ:PL  0/1:16,14:30:99:356,0,430       1/1:0,39:39:99:1176,115,0       1/1:0,36:36:99:1146,108,0
20_10037292_10066351    8886    .       AAGAAAGAAAG     A       3037.69 .       AC=6;AF=1.00;AN=6;DP=89;ExcessHet=0.0000;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=2.269     GT:AD:DP:GQ:PL  1/1:0,18:18:55:804,55,0      1/1:0,29:29:88:1282,88,0        1/1:0,22:22:67:965,67,0
20_10037292_10066351    9536    .       T       C       1089.95 .       AC=3;AF=0.500;AN=6;BaseQRankSum=-5.640e-01;DP=82;ExcessHet=0.0000;FS=12.258;MLEAC=3;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=20.57;ReadPosRankSum=0.860;SOR=0.373   GT:AD:DP:GQ:PL  1/1:0,32:32:95:950,95,0 0/0:29,0:29:81:0,81,1215        0/1:14,7:21:99:156,0,353
20_10037292_10066351    13375   .       C       T       724.29  .       AC=1;AF=0.167;AN=6;BaseQRankSum=0.171;DP=121;ExcessHet=0.0000;FS=7.398;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.00;QD=12.71;ReadPosRankSum=0.415;SOR=1.688        GT:AD:DP:GQ:PL  0/1:28,29:57:99:733,0,679       0/0:29,0:29:81:0,81,1215        0/0:34,0:34:99:0,99,1485
20_10037292_10066351    13536   .       T       C       1025.16 .       AC=2;AF=0.333;AN=6;BaseQRankSum=1.63;DP=118;ExcessHet=0.9691;FS=1.719;MLEAC=2;MLEAF=0.333;MQ=60.00;MQRankSum=0.00;QD=11.65;ReadPosRankSum=-2.000e-01;SOR=0.904    GT:AD:DP:GQ:PL  0/1:21,23:44:99:591,0,526       0/1:26,18:44:99:445,0,672       0/0:29,0:29:84:0,84,1260
20_10037292_10066351    14156   .       T       C       438.16  .       AC=2;AF=0.333;AN=6;BaseQRankSum=3.20;DP=96;ExcessHet=0.9691;FS=2.381;MLEAC=2;MLEAF=0.333;MQ=60.00;MQRankSum=0.00;QD=7.82;ReadPosRankSum=1.13;SOR=0.592    GT:AD:DP:GQ:PL  0/1:25,11:36:99:258,0,676       0/1:12,8:20:99:191,0,319        0/0:38,0:38:99:0,99,1117
20_10037292_10066351    14403   .       G       A       144.29  .       AC=1;AF=0.167;AN=6;BaseQRankSum=2.63;DP=116;ExcessHet=0.0000;FS=1.435;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.00;QD=3.52;ReadPosRankSum=0.252;SOR=0.802  GT:AD:DP:GQ:PL  0/1:32,9:41:99:153,0,821        0/0:37,0:37:99:0,109,1169       0/0:37,0:37:99:0,99,1113

마지막 헤더 줄을 다시 강조 표시했습니다. 이 줄은 변이 호출 데이터의 시작을 표시합니다.

이것은 앞서 생성한 VCF와 비슷해 보이지만, 이번에는 세 샘플 모두에 대한 유전형 수준 정보가 있습니다. 파일의 마지막 세 열은 샘플에 대한 유전형 블록이며, 강조 표시된 헤더 줄에 표시된 대로 ID 필드의 알파벳 순서로 나열되어 있습니다.

첫 번째 변이에 대해 우리 테스트 가족 트리오에 대해 호출된 유전형을 보면, 아버지는 이형접합 변이(0/1)이고, 어머니와 아들은 모두 동형접합 변이(1/1)입니다.

이것이 궁극적으로 우리가 데이터셋에서 추출하려는 정보입니다!

2.3.3. 출력 파일 이동¶

앞서 언급했듯이 컨테이너 내부에 남아 있는 모든 것은 향후 작업에서 접근할 수 없습니다. 컨테이너를 종료하기 전에 GVCF 파일, 최종 다중 샘플 VCF 및 모든 인덱스 파일을 컨테이너 외부의 파일시스템으로 수동으로 이동하겠습니다. 그렇게 하면 이 모든 작업을 자동화하기 위해 워크플로우를 구축할 때 비교할 수 있는 것이 생깁니다.

mv *.vcf* /data/vcf

디렉토리 내용" hl_lines="14-19 22-23

data
├── bam
│   ├── reads_father.bam
│   ├── reads_father.bam.bai
│   ├── reads_mother.bam
│   ├── reads_mother.bam.bai
│   ├── reads_son.bam
│   └── reads_son.bam.bai
├── ref
│   ├── intervals.bed
│   ├── ref.dict
│   ├── ref.fasta
│   └── ref.fasta.fai
├── samplesheet.csv
└── vcf
    ├── family_trio.vcf
    ├── family_trio.vcf.idx
    ├── reads_father.g.vcf
    ├── reads_father.g.vcf.idx
    ├── reads_mother.g.vcf
    ├── reads_mother.g.vcf.idx
    ├── reads_mother.vcf
    ├── reads_mother.vcf.idx
    ├── reads_son.g.vcf
    └── reads_son.g.vcf.idx

완료되면 모든 파일이 이제 일반 파일시스템에서 접근할 수 있습니다.

2.3.4. GATK 컨테이너 종료¶

컨테이너를 종료하려면 exit를 입력하세요.

exit

프롬프트가 정상으로 돌아가야 합니다. 이것으로 공동 변이 호출 명령의 수동 테스트가 완료되었습니다.

핵심 정리¶

Samtools 인덱싱과 GATK 변이 호출 명령을 각각의 컨테이너에서 테스트하는 방법을 알게 되었습니다. 여기에는 GVCF를 생성하고 여러 샘플에 대해 공동 유전형 분석을 실행하는 방법도 포함됩니다.

다음 단계¶

잠시 휴식을 취한 다음 파트 2로 이동하여 컨테이너를 사용하여 작업을 실행하는 워크플로우에 동일한 명령을 적용하는 방법을 학습하세요.