SAM files
SAM files are the cornerstone of sequence alignment analysis. Useful for inspecting mapping quality, errors, and alignment structure, they are typically converted to BAM for efficient storage and processing. Understanding their fields is essential for accurate variant calling, visualization, and downstream genomic workflows.
A SAM file is a plain-text format used to store alignments of DNA or RNA sequences against a reference genome.
It is generated by alignment programs such as BWA or minimap2, and it constitutes a central component of bioinformatics pipelines, as it preserves not only the sequencing reads but also how and where they align to the genome.
Structure of a SAM File
A SAM file is composed of two main sections:
- Header: begins with the symbol @ and contains information about the reference genome, version, contigs, and alignment parameters.
- Alignment lines: each line corresponds to a read alignment and contains at least 11 mandatory fields.
1. HEADER
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422
What does this mean?
@HD VN:1.6 SO: coordinate
- @HD: indicates a header line (the file header)
- VN:1.6: the version of the SAM format used here (version 1.6)
- SO:coordinate: specifies the Sort Order of the file.
- Coordinate: reads are sorted by genomic position (ascending).
- Other possible values include unsorted (no defined order) and queryname (sorted by read name).
Means: “This SAM file conforms to version 1.6 of the format, and reads are organized by genomic coordinate”
Is the header mandatory?
No. However, in practice, most bioinformatic tools require or strongly recommend including at least:
- @HD: version + sort order
- @SQ: the list of chromosomes/contigs used as the reference
The header informs downstream programs which reference genome was used and the length of each chromosome. Without this information, tools used later in the pipeline (e.g., IGV, samtools, GATK) may fail to interpret the file correctly.
Common Header Issues
When the same software performs both; alignment and SAM generation, the header is usually consistent. Inconsistencies often arise in the following scenarios:
- Differences in reference genome nomenclature
- Some builds/versions use “chr1”, others simply “1”.
- Example:
- Header: @SQ SN:1 LN:248956422
- Alignment lines: chr1
- This typically occurs when the SAM was generated against a different FASTA than the one later used by another program (IGV, GATK, bcftools, etc.).
- Format conversion
- During conversions (SAM → BAM → CRAM) or when using tools such as samtools merge, Picard, or bamtools, the header may be rewritten.
- In these steps, chromosome names or their order can change, causing incompatibilities.
- Merging files from different alignments
- Combining BAM files from different alignment runs (e.g., hg19 vs. GRCh38) often results in inconsistent header naming conventions.
- This causes errors because the @SQ lines do not match.
- Manual edits or custom pipelines
- Occasionally, users edit the header manually (e.g., with samtools reheader) to correct an issue.
- If not done consistently, errors can appear, such as “phantom” chromosomes (present in the header but unused, or vice versa).
- Different software = different conventions
- A classic example:
- bwa mem: uses @SQ SN:chr1.
- bowtie2: may use SN:1.
- Mixing files produced under different conventions yields non-identical names.
- A classic example:
Programs such as samtools, bcftools, IGV, and GATK require that the names in @SQ SN: exactly match the names in the FASTA reference used.
If there is any discrepancy (even chr1 vs 1), errors of the following sort may occur:
Sequence dictionary and alignment file not compatible
2. READ ALIGNMENT LINES
Each alignment line follows a fixed format with at least 11 mandatory fields:
- QNAME: read name (ID)
- FLAG: integer value encoding information about the read (e.g., whether it belongs to pair R1 or R2, whether it is aligned, whether it maps to the reverse complement, etc.)
- RNAME: chromosome or reference sequence where the read is aligned
- POS: starting position (1-based) of the alignment on the reference
- MAPQ: mapping quality (confidence that the alignment is correct)
- CIGAR: string describing how the sequence aligns (number of matches, insertions, deletions)
- RNEXT: reference name of the next read (for paired-end sequencing)
- PNEXT: position of the next read
- TLEN: observed template length (insert size)
- SEQ: the nucleotide sequence of the read
- QUAL: quality scores for each base (same Phred scale as in FASTQ)
Beyond these 11 fields, optional fields may be included to provide additional information (e.g., number of mismatches, tags generated by the alignment software, etc.)
Example:
M00176:35:000000000-A3JHG:1:1101:10003:12345 99 chr1 10000 60 76M = 10100 176 ACGTTCTGATGACCTTAGCA... IIIHFGEFIIHDF>?=;:987…
Explanation
- QNAME: M00176:35:... → read identifier.
- FLAG: 99 → indicates paired-end sequencing, first read, mapped in the forward orientation.
- RNAME: chr1 → aligned to chromosome 1.
- POS: 10000 → alignment starts at position 10,000.
- MAPQ: 60 → high confidence in the mapping.
- CIGAR: 76M → 76 bases aligned (all “matches”).
- RNEXT: = → mate aligned on the same chromosome.
- PNEXT: 10100 → starting position of the mate.
- TLEN: 176 → observed fragment length.
- SEQ: observed DNA sequence.
- QUAL: quality scores for each base, in the same format as FASTQ.
The SAM format is particularly useful for inspecting alignments, assessing sequencing quality, and identifying potential mapping errors. Being text-based, it is transparent but also very large in size. For this reason, it is usually converted into BAM, its compressed binary version, which is smaller and supported by most software and visualization tools, such as IGV, HipSTR, and GangSTR.