FASTQ files

FASTQ files are the foundation of next-generation sequencing analysis. Understanding how to interpret these scores, and when to apply trimming, is essential for ensuring accurate downstream analyses in genomics and bioinformatics.

Tamara Frontanilla, PhD
10 min read

What is a FASTQ?

A FASTQ file stores DNA/RNA reads and their per-base quality. There is always at least one FASTQ file. If the experiment was paired-end, there will be two files (R1 and R2). If it was single-end, there will be only one (R1).

Each read takes up 4 lines:

  1. Read ID
  2. Sequence (A/C/G/T/N)
  3. Separator (+)
  4. Quality: a string of ASCII characters, one per base, encoding the Phred score (base-calling confidence)

**FASTQ files are almost always delivered compressed as .fastq.gz.

EXAMPLE

@A00469:123:HGF2KDSXX:1:1101:10003:12345 1:N:0:ACGTAC
ACGTTCTGATGACCTTAGCA
+
IIHFGEFIIHDF>?=;:987
  • Line 1: Identifier (instrument, run, coordinates, etc.). The 1 before :N: indicates R1 (if it were 2, it would be R2).
  • Line 2: Sequence (20 bases in this example).
  • Line 3: Separator (+).
  • Line 4: Per-base quality, same length as the sequence (20 characters).

How to interpret per-base quality?

It is related to the probability of error P as follows:

The Phred quality score Q is defined as:

Q = -10 · log10(P) ⇒ P = 10-Q/10

  • Q10: 1/10 error rate (90% confidence)
  • Q20: 1/100 (99%)
  • Q30: 1/1000 (99.9%)
  • Q40: 1/10,000 (99.99%)

Example of step-by-step interpretation

ACGTTCTGATGACCTTAGCA
IIHFGEFIIHDF>?=;:987
PosBaseCharASCIIQ (Phred)P(error) aprox.Interpretation
1AI73400.0001 (0.01%)Extremely reliable
2CI73400.0001Extremely reliable
3GH72390.00013Very reliable
4TF70370.0002Very reliable
5TG71380.00016Very reliable
6CE69360.00025Very reliable
7TF70370.0002Very reliable
8GI73400.0001Extremely reliable
9AI73400.0001Extremely reliable
10TH72390.00013Very reliable
11GD68350.00032Reliable
12AF70370.0002Very reliable
13C>62290.0013 (0.13%)Acceptable
14C?63300.001 (0.1%)Acceptable
15T=61280.0016 (0.16%)Moderada
16T;59260.0025 (0.25%)Moderada
17A:5825Moderate–lowModerate–low
18G957240.004 (0.4%)Low-moderate
19C856230.005 (0.5%)Low
20A755220.0063 (0.63%)Low

Global Interpretation

  • Beginning of the read (positions 1–10): very high quality scores (Q37–40), virtually error-free.
  • Middle region (11–14): quality decreases slightly (Q29–35), still acceptable.
  • End of the read (15–20): quality drops considerably (Q22–28), with an expected error rate of 0.1–0.6%: this is where Illumina sequencers typically show issues.

Per-base quality scores: high-quality bases at the start, decreasing toward the end of the read.
Per-base quality scores: high-quality bases at the start, decreasing toward the end of the read.

This pattern is common: high quality at the beginning, drop at the end. That is why trimming of the last nucleotides is often performed.

REVIEW

  • FASTQ = raw data (bases + per-base quality).
  • There is always at least one FASTQ file.
  • If the experiment was paired-end, both strands were sequenced, resulting in two files (R1 and R2).
  • If it was single-end, there will be only one file (R1).