Fastq

By Jonghyuk Kim

Posted 2022-06-16 Updated 2022-06-17 2 min read

Definition

Fastq는 NGS 결과로 생성된 sequence data와 이에 상응하는 quality score 정보를 담고 있는 text 파일입니다.

Bioinformatics 분야의 standard data format이기 때문에 대부분 분석 tool에서 input으로 사용합니다.

Fastq Format

Post-Image Fastq format

Fastq는 한 개 read당 네 개 lines로 구성되어 있습니다.

Line1 (sequence identifier): ‘@’ 기호로 시작합니다. Sequencing run과 cluster 관련 정보를 담고 있습니다.

Value	Description
@	sequence identifier start character
NG501674	uniqe instrument id
510	run id
HTVWKAFX3	flowcell id
1	flocell lane
11101	tile number within the flowcell lane
11906	x-coordinate of the cluster within the tile
1035	y-coordinate of the cluster within the tile
1	read number [1, 2 (if paired-end or mate-pair reads only)]
N	N means this read is NOT filtered (=passed), Y otherwise
0	0 when none of the control bits are on, otherwise it is an even number
GAATCTGA	index sequence

Line2 (sequence): Read의 sequence 정보를 담고 있습니다.
Line3 (separator): ‘+’ 기호로 시작합니다. Sequence와 quality를 분리하는 구분자 입니다.
Line4 (base call quality score): Line2의 각각 sequence에 대한 quality 값 정보를 담고 있습니다. Phred+33 encoding 방식을 따르며 quality score 표기는 ASCII 문자로 합니다.

Open Fastq

Fastq 파일은 보통 몇 백 만개 단위의 reads 정보를 담고 있으며 파일 size 또한 Gb 단위입니다. 윈도우 OS 환경에서 이렇게 큰 파일을 열고 작업하는 것은 무리입니다. 그럼에도 불구하고 단순히 fastq 파일을 열어보고 싶다거나 작은 size의 fastq 파일을 다룰 때는 NotePad++, Sublime Text와 같은 text editor를 사용할 수 있습니다.

Fastq

Definition

Fastq Format

Open Fastq

Further Reading

Phred Score (Base Call Quality Score)

WES 개념과 유전학적 의의

WES 연구 디자인법