Analyzing Gene Expression with STAR and Pysam
Introduction
In the realm of genomics, understanding gene expression patterns is crucial for unraveling the mysteries of cellular function. This blog post will delve into a comprehensive analysis pipeline utilizing STAR, a powerful tool for Spliced Transcripts Alignment to a Reference, combined with the Python library Pysam. The focus is on a specific dataset with the Sequence Read Archive (SRA) ID “SRR639771.”
Setting Up the Environment
Before diving into the code, let’s set the stage. The necessary genomic data is fetched from Ensembl, specifically the GTF (Gene Transfer Format) and FASTA files for Cryptococcus neoformans var. grubii H99. The genome is stored locally in the GENOME_DIR
directory.
id="SRR639771" # Sequence Read Archive ID (NCBI)
GENOME_DIR="/content/genome/"
FASTQ_DIR="/content/fastq/"
GTF_URL="ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz"
FASTA_URL="ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/fasta/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/dna/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz"
FASTAGZ=GENOME_DIR+FASTA_URL.split("/")[-1]
GTFGZ=GENOME_DIR+GTF_URL.split("/")[-1]
FASTA=FASTAGZ.replace(".gz","")
GTF=GTFGZ.replace(".gz","")
STAR_PATH="/content/STAR/source/STAR" #where to install Spliced Transcripts Alignment to a Reference
STAR_OUT="/content/starout/"
Downloading and Preprocessing
The next step involves downloading the required files and preparing the data for analysis. The SRA toolkit is employed for fetching the raw sequencing data, which is then converted to FASTQ format using fastq-dump
.
!apt install sra-toolkit &>/dev/null
%%bash -s "$id" "$FASTQ_DIR"
prefetch ${1} &>/dev/null
fastq-dump --outdir ${2} --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip /content/${1}/${1}.sra &>/dev/null
%%bash -s "$id"
zcat /content/fastq/${1}_pass_1.fastq.gz | head
!pip install pysam &>/dev/null
%%bash
git clone https://github.com/alexdobin/STAR.git && make -C STAR/source &>/dev/null
Genome Indexing with STAR
STAR requires a pre-built genome index for efficient alignment. The GTF and FASTA files are used to generate this index.
%%bash -s "$GENOME_DIR" "$STAR_OUT" "$GTF_URL" "$FASTA_URL" "$GTFGZ" "$FASTAGZ" "$GTF" "$FASTA" "$STAR_PATH"
mkdir -p "${1}"
mkdir -p "${2}"
if wget --quiet --directory-prefix="${1}" "${3}"; then
echo "First file downloaded successfully."
else
echo "Error downloading the first file. Exiting."
exit 1
fi
if wget --quiet --directory-prefix="${1}" "${4}"; then
echo "Second file downloaded successfully."
else
echo "Error downloading the second file. Exiting."
exit 1
fi
gunzip "${5}"
gunzip "${6}"
${9} --runMode genomeGenerate --genomeDir "${1}" --genomeFastaFiles "${8}" --sjdbGTFfile "${7}" --outFileNamePrefix "${2}/genome_" --genomeSAindexNbases 11 --outSAMtype BAM SortedByCoordinat
Aligning Reads with STAR
With the genome indexed, the next step involves aligning the sequenced reads to the reference genome using STAR.
%%bash -s "$GENOME_DIR" "$id" "$FASTQ_DIR" "$STAR_PATH" "$STAR_OUT"
${4} \
--genomeDir "${1}" \
--readFilesIn "${3}${2}_pass_1.fastq.gz" \
--readFilesCommand zcat \
--outBAMsortingBinsN 200 \
--runThreadN 2 \
--limitBAMsortRAM 1795207491 \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix ${5} \
Quantifying Gene Expression with Pysam
Now, we transition to Python with pysam to quantify gene expression. The code snippet below demonstrates how to extract gene-level read counts and calculate Reads Per Kilobase of exon per Million mapped reads (RPKM).
import pysam
pysam.index(STAR_OUT+"Aligned.sortedByCoord.out.bam"))
bam_file = STAR_OUT+ "Aligned.sortedByCoord.out.bam"
bai_file = STAR_OUT+ "Aligned.sortedByCoord.out.bam.bai"
gtf_file = GTF
# Create a pysam AlignmentFile object with the index file
samfile1 = pysam.AlignmentFile(bam_file, "rb", index_filename=bai_file)
gene_read_counts = []
with open(gtf_file, "r") as gtf:
for line in gtf:
if line.startswith("#"):
continue # Skip comments in GTF file
#if 'exon_number "1"' in line and line[0:2]=="Mt":
fields = line.strip().split("\t")
feature = fields[2]
if feature == 'exon':
chromosome = fields[0]#.replace("chr","")
start = int(fields[3])
end = int(fields[4])
gene_name_full = fields[8]
nameparts=gene_name_full.split(";")
gene_id=nameparts[0].split(" ")[1].replace("\"","")
transcript_id=nameparts[1].split(" ")[2].replace("\"","")
exon_number=nameparts[2].split(" ")[2].replace("\"","")
#print(chromosome,feature,start,end,gene_name_full)
# Count reads that fall between start and end coordinates of the exon
try:
read_count = samfile1.count(chromosome, start, end)
gene_read_counts.append((feature,chromosome, gene_id,transcript_id,exon_number,end-start+1,read_count))
except:
w=1
Visualizing Results
The final section involves visualizing the results. We generate various plots, including bar charts depicting RPKM and raw read counts across different chromosomes.
import pandas as pd
df=pd.DataFrame(gene_read_counts,columns=['Feature','Chromosome','Gene ID','Transcript ID', 'Exon','Length','Counts'])
df.head()
#Reads Per Kilobase of exon per Million mapped reads
total_mapped_reads = samfile1.mapped
df['RPKM'] = (df.Counts / df.Length) * (1e6 / total_mapped_reads)
P=pd.pivot_table(data=df,values="RPKM",index=['Chromosome','Feature','Gene ID','Transcript ID'],aggfunc='mean')
P.sort_values(by='RPKM',ascending=False).head(20)
df.groupby(by = 'Gene ID').mean(numeric_only = True).sort_values(by='RPKM',ascending=False).head(20)
data=df.groupby(by = 'Chromosome').mean(numeric_only = True).sort_values(by='RPKM',ascending=False).reset_index()
data
data.plot.bar(y='RPKM',x='Chromosome');
data.plot.bar(y='Counts',x='Chromosome');
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(data['Chromosome'], data['RPKM'], 'b-')
ax2.plot(data['Chromosome'], data['Counts'], 'r-')
ax1.set_ylabel('Reads Per kb per Million Mapped Reads', color='blue')
ax2.set_ylabel('Counts', color='red')
ax1.set_xlabel('Chromosome')
plt.show()
Conclusion
In conclusion, this analysis pipeline seamlessly integrates command-line tools like STAR with Python libraries such as pysam, providing a comprehensive approach to gene expression analysis. Understanding and adapting this pipeline can help extract meaningful insights from genomic data.