FastQC

Free

Quality control for high-throughput sequencing data (FASTQ): per-base quality, sequence content, GC content, and more

Overview

FastQC performs quality control (QC) on high-throughput sequencing data in FASTQ format. It provides a suite of metrics and visualizations to assess read quality, sequence composition, and potential issues before downstream analysis.

Why QC matters

- **Detect sequencing problems**: Identify poor quality bases, adapter contamination, or biased composition - **Validate data**: Ensure data quality is sufficient for alignment, variant calling, or assembly - **Troubleshoot**: Diagnose failed runs, library prep issues, or instrument artifacts - **Species-agnostic**: Works with any organism's DNA or RNA sequencing data

QC modules included

- Per-base sequence quality (Phred scores by position) - Per-base sequence content (A/C/G/T/N percentage along the read) - Per-sequence GC content (GC% distribution across reads) - Sequence length distribution - Overrepresented sequences (duplicates, adapters, contamination) - Basic statistics (total reads, bases, mean/min/max length)

Input Format

Required format: FASTQ (4 lines per read)**

- **Line 1**: Read identifier (must start with `@`) - **Line 2**: Sequence (A, C, G, T, N; case insensitive) - **Line 3**: `+` (separator; optional repeat of id in some formats) - **Line 4**: Quality string (same length as line 2); **Phred encoding offset 33** (Sanger / Illumina 1.8+)

Quality encoding

Each character on line 4 corresponds to one base on line 2. Quality score = `ord(character) - 33` (Phred scale). Higher values = better quality (e.g. 30 = 99.9% base call accuracy).

Example input

``` @read1 ATCGATCGATCGATCGATCGATCGATCGATCG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @read2 GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ```

Requirements

- Paste raw FASTQ text (e.g. from a `.fastq` or `.fq` file) - All four lines must be present for each read - Sequence and quality line lengths must match - Reads can be single-end; paired-end: paste one mate or run separately

Output Explanation

Basic statistics** - **Total reads**: Number of valid FASTQ records parsed - **Total bases**: Sum of all sequence lengths - **Mean / Min / Max length**: Read length statistics (bp)

Per-base sequence quality** - Line plot: mean Phred score (y) vs. position in read (x) - Good data: mean quality stays high (e.g. >28) across the read - Warning: quality drops at 3' end (common in Illumina); very low quality may indicate trimming or filtering needed

Per-base sequence content** - Percentage of A, C, G, T (and N) at each position across all reads - Ideally: roughly flat lines (~25% each for A/T/G/C in random library) - Bias at start/end: often adapter or primer; bias in middle may indicate contamination or amplification artifact

**Per-sequence GC content** - GC% for each read; distribution or per-read plot - Compare to expected GC content for your organism; sharp peak = good; multimodal may indicate contamination or mixed samples

Sequence length distribution** - Histogram: how many reads have each length (e.g. 36 bp, 75 bp, 150 bp) - Uniform length = single run type; mixed lengths = mixed libraries or trimming

**Overrepresented sequences** - Sequences that appear in >0.1% of reads (exact match) - Often adapters, primers, or contaminants; review for adapter trimming or contamination

Use Cases

**1. Pre-alignment QC** - Check Illumina/other NGS data before alignment or variant calling - Decide whether to trim adapters or low-quality bases - Confirm read length and quality meet pipeline requirements

**2. Library and run QC** - Validate library prep (e.g. PCR bias, adapter carryover) - Compare multiple runs or lanes for consistency - Troubleshoot failed or borderline sequencing runs

**3. Metagenomics and RNA-Seq** - Assess quality of metagenomic or transcriptomic FASTQ files - Identify overrepresented sequences (e.g. rRNA, adapter) - Use before taxonomic or expression analysis

**4. Teaching and reporting** - Generate QC metrics for methods sections or reports - Teach FASTQ format and quality concepts - Quick check of uploaded or shared FASTQ data

Tips & Best Practices

1. **Paste or upload**: Paste FASTQ content directly; for very large files, subset (e.g. first 10,000 reads) to keep response fast.

2. **Encoding**: Tool assumes Phred+33 (Sanger/Illumina 1.8+). If your data is Phred+64 (older Illumina), convert or use another QC tool that supports it.

3. **Interpret with context**: Compare per-base quality and GC content to your organism and library type; "fail" in one module may be normal for your assay.

4. **Overrepresented sequences**: If adapters or primers appear, consider adapter trimming (e.g. cutadapt, Trimmomatic) before downstream steps.

5. **Paired-end**: For paired-end data, run each mate file separately or paste one mate; metrics are per-file.

6. **Species**: Works with any species; no reference genome required. Use your own reference only in downstream steps (alignment, etc.).

Technical Details

Phred quality** - Formula: `Q = ord(quality_character) - 33` - Q = 10 → 90% base call accuracy; Q = 20 → 99%; Q = 30 → 99.9%

**Overrepresented threshold** - Sequences reported if they represent ≥ 0.1% of total reads; top 10 by count are shown (sequence truncated to 50 bp in display).

**Parsing** - Reads with mismatched sequence/quality length or missing `@` are skipped. Only complete 4-line blocks are analyzed.