Bioinformatics_RNA-seq
Created: 2021-08-10; Update: 2021-10-01
Chapter 1 Introduction to RNA sequencing (RNA_seq)
An over simplification of the Process.
Holds 4 main sections.
- Sample and sample analysis. Samples are taken, prepared and send to a company for analysis. (Illumina !?).
- Raw data (FASTQ) - Expression quantification.
- Differential expression analysis.
- Pathway and Enrichment analysis.
Each of these sections are subdivided in other steps.
Download or acquire the files
2021/07/02
Copy Trial data from:
MK/土橋 to folder HD-PCFSU3-A/Experiments Data/Genewiz :
The Report file is here.
The Summary file is here.
Raw Data
open "Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/60-499583816_60-501009934/Result/03_GeneExpression"Report file
open "/Volumes/HD-PCFSU3-A/Experiments%20Data/Genewiz/60-499583816_60-501009934/60-499583816_60-501009934_GENEWIZ_RNASeq_Report.html"Summary report file
open "/Volumes/HD-PCFSU3-A/Experiments%20Data/Genewiz/60-499583816_60-501009934/60-499583816_60-501009934_GENEWIZ_QC_Report.html"To open folder in Finder with r code, is similar to terminal commands.:
open . # To open the current working folder within Finder
open ~ # To open the Home folder
open / # To open the Root directory
open Desktop/
open Music/
open /Library/
open /path/to/Directory/
# If folder have spaces, replace with %20% or other similar, or use "".
open "/path/to/Directory/"You can also launch (and update) applications from the Terminal without using Finder. For example, to open Safari, type: (But seem like it only works in terminal and can not do it with r/rstudio)
1.1 Downloading and setting up conda environments.
Benefit: You can install all your tools in “Conda” and export as yml and you can share that environment or create a new one base on it, and by doing so, you can have all those tools automatically installed.
1.2 Install Miniconda, or Anaconda.
Choose Anaconda if you:
- Are new to conda or Python
- Like the convenience of having Python and over 1500 scientific packages automatically installed at once
- Have the time and disk space (a few minutes and 3 GB), and/or
- Don’t want to install each of the packages you want to use individually.
Choose Miniconda if you:
- Do not mind installing each of the packages you want to use individually.
- Do not have time or disk space to install over 1500 packages at once, and/or
- Just want fast access to Python and the conda commands, and wish to sort out the other programs later.
Mind that Windows and Mac installation and procedures are different.
Does Windows need Linux to be intalled? Yes, a version of Linux kernell for window is available for download.
Install Anaconda or Miniconda.
brew install --cask anaconda
brew install --cask miniconda
# To check the version of conda installed.
conda --versionQuick video tutorial on how to use conda.
* (Master) Conda environment.
* Set up a data Science environment with Conda-Forge.
To create an environment.
The easy way: Open the anaconda app > Environment > New > Name… Ok
To access environment and open in terminal: anaconda app > Environment > Select environment > rna-seq_test ^ > »Open in Terminal.
1.2.1 Conda terminal commands.
Conda cheat sheet
In Terminal
# To open conda app
open /Applications/Anaconda-Navigator.app
# Or, if any of the container folders have a space character.
open "/Applications/Anaconda-Navigator.app"
# To open (activate) conda environment.
. /usr/local/anaconda3/bin/activate && conda activate /usr/local/anaconda3/envs/rna-seq_test
# To see all the environment created.
conda env list
# To create a new environment called "rna-seq_test" in conda in terminal:
conda create --rna-seq_test
# To activate environment.
conda activate rna-seq_test
# To see if we activated the environment and where (path) we are.
echo $path
# To exit the environment and return to base/root. Check with echo $path.
conda deactivate
# To see packages installed in env.:
conda config --show channels
# List all packages and versions installed in active environment
conda list
# To see the url location of packages (repository):
conda info
# To add packages to the environment. (only to this environment: --env).
conda config --env --add channels conda-forge
conda config --env --add channels bioconda
# If a package is repeated in the environment, Conda will always install the package of higher version. If you dont want that, and specifically need a version originally configured at creation of the environment.. type:
conda config --env --set channel_priority strict.
# To install fastqc
conda install fastqc
#To make and export a list of the required environment/
conda list --export > requirements.txt
#To make and export an yml file to share with others.
conda env export > rna-seq_test.yml
conda env export --file rna-seq_test.yml
# To quickly check the yml file.
less rna-seq_test.yml
#To see the content of a .yml file.
head rna-seq_test.yml
#To clear the terminal screen.
clear
# To run a Jupyter Notebook.
jupyter notebook
# To Keep Anaconda updated.
conda update conda
# Delete an environment and everything in it called rna-seq_test.
conda env remove --rna-seq_test
# To Uninstall Anaconda.
# [Uninstall page]()1.2.2 Anaconda/Miniconda install Summary of Steps.
- Install anaconda/miniconda for your OS from the website.
- Prevent the base environment from automatically activating.
conda config --set auto_activate_base false
- Create an empty environment.
conda create -n rna-seq_test.
- Activate the environment
conda activate rna-seq_test.
- Add conda-forge as first channel
conda config --env --add channels conda-forge.
- Ensure that conda-forge is used if the package is available.
conda config --env --set channel_priority strict
- Install packages
conda install ....
- Install packages not in conda-forge. Search in conda webpage search.
conda install -c ....
- Verify Jupyter Notebooks in correct environment …!?… In jupyter:
sys.executable
Video reference:
Bioinformatics - Downloading and Setting Up Conda Environments.
1.2.3 Conda activate error
When trying to initialize environment in terminal with the conda activate rna-seq_testcommand, the following message pops up.
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
I tried:
conda init powershell
conda init zsh
So far the only way to enter environment is though anaconda app.
open "/Applications/Anaconda-Navigator.app"Then Environment > rna-seq_test» > Open in terminal… [Terminal]
Another temporary solution is using the code:
# For the rna-seq_test environment.
. /usr/local/anaconda3/bin/activate && conda activate /usr/local/anaconda3/envs/rna-seq_test
# For anaconda3 base environment.
. /usr/local/anaconda3/bin/activate && conda activate /usr/local/anaconda3;1.3 Alignment Procedures and files.
This section was created based in this reference.
After installing “conda,” and to gain time, download the genome of the species to be study. In this case, download the mouse (mus muslculus) genome file. This file contains the list of all the mouse genes and their codes.
We need something to map our reads against once we get to the results.
Files for mouse (Mus musculus) are available for download at Johns Hopkins University Center for Computational Biology (CCB) at TopHat. A spliced read mapper for RNA-Seq.
There are different ways:
- Using tophat
- Using Star
- Custom/personal code.
1.3.1 Using Tophat (early version)
Not Working!!!
To download:
1. Look for the Mus musculus Build37.2 from the NCBI.
1. Right click the link and copy the link address.
1. go back to the conda environment and type wget and the paste the link.
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/NCBI/build37.2/Mus_musculus_NCBI_build37.2.tar.gz
1.3.2 Using Tophat
Not Working!!!
Tophat can be installed using the same conda install
conda install \-c bioconda tophat
When this is finished installing, then we will need to get the mouse genome from the Johns Hopkins Univeristy Center for computational BIology. The version of the mouse genome that I am using here is the NCBI build37.2. Instead of downloading this from the website and having to move it to the cluser, I will just download it using wget into the folder that has the raw reads, trimmed reads, and the FastQC files.
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/NCBI/build37.2/Mus_musculus_NCBI_build37.2.tar.gzThis will take a long time to download because the file is a little less than 16GB zipped.
Installation of TopHat failed.!!!!!!!!!!!!!!!!!!!!!!!
Try STAR
1.3.3 Using STAR
Complicated!!!
STAR can be installed the same way as the previous programs with conda install (conda install -c bioconda star). In order to run STAR, we need to creaate indices just like with tophat, but STAR has this built in. I’m going to be using the same genome and GTF file as previously downloaded, but Dr. Ge uses a different zipped genome from the gencode database.
~/miniconda2/bin/STAR \
--runThreadN 80 \
--runMode genomeGenerate \
--genomeDir starIndex \
--genomeFastaFiles Index/genome.fa \ #same when we made the bowtie indices
--sjdbGTFfile Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtfWith the index files made, we can start aligning with STAR. It’s important here than we only pick the paired end reads and not use all of the reads. Tophat is able to use all 4 reads but STAR doesn’t allow that, so we need to make sure that we feed in the large files from trimming.
~/miniconda2/bin/STAR --runThreadN 80 --genomeDir starIndex --readFilesIn 770_fp.fq 770_rp.fq --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix 2121770 --outSAMtype BAM SortedByCoordinate
[here]()Complicated.!!!!!!!!!!!!!!!!!!!!!!!
1.3.4 Custom/personal way to Download the Mus_musculus_NCBI_build37.2.tar.gz file.
Esaest so far.
- Go to the iGenomes Ready-To-Use Reference Sequences and Annotations from illumina..
- Copy the Mus musculus (Mouse) NCBI build37.2 link.
- Paste the link in a new tab/window on a web browser (chrome) and click enter.
- Popup window will ask where to save the file, choose folder and save.
- File is 23Gb so it will take about 3 to 4 hrs to download.
1.3.5 Get the Sample files.
Samples files come as “x_1.fastq.gz” or “x_2.fastq.gz,” where x is the name of the file, 1 is for forward and 2 for reverse pair reads. example: we will place them in a folder (00_Rawdata)
mPDL_RNA7D_Ko1_S1_L001_R1_001.fastq.gz
mPDL_RNA7D_Ko1_S1_L001_R2_001.fastq.gz
mPDL_RNA7D_Ko2_S1_L001_R1_001.fastq.gz
mPDL_RNA7D_Ko2_S1_L001_R2_001.fastq.gz
mPDL_RNA7D_Ko3_S1_L001_R1_001.fastq.gz
mPDL_RNA7D_Ko3_S1_L001_R2_001.fastq.gz… etc….
Download multiqc
conda install multiqc1.3.6 Starting to analize fastq.gz files
Initialize fastqc
fastqc Make an output directory named rawFastQC
mkdir rawFastQCNow we have to process all fastq files contained in the folder (00_Rawdata) and place the results in the directory we just made “rawFastQC.”
fastqc rawFastQC/*.fastq.gz -o rawFastQC/Once analysis is finished. Go to the results folder (rawFastQC)
cd rawFastQC“fastqc” created .zip files and a .html file report for each sample. To see them individually may be difficult and take time. For a more convenient way to visualize results, merge all sample reports (html file) in one file. Create the report by reading all the files contained in the results folder (rawFastQC) and combine them in one multiqc_report.html file using “multiqc.”
# go to the results folder where the html report files are.
cd rawFastQC
# Check if the files are there.
ls
# Create the multiqc report. Type (the dot is important!)
multiqc .
# check if file was created.
ls1.3.7 Trimming with Trimmomatic
conda install trimmomatic
# parallel my also be required for task piping.
conda install parallel24:40 Conda was used again to run Trimmomatic. This isn’t as easy as using the wildcard like with FastQC because each output has to be personalized for the read files that are input into Trimmomatic. Also, we have to make sure that the adapter sequences are in the same folder that we are running so we can refer to them easily when calling the Trimmomatic program. In this case, we are using the TruSeq3-PE-2.fa adapter sequences For example:
~/miniconda2/bin/trimmomatic PE SRR2121770_1.fastq.gz SRR2121770_2.fastq.gz 770_fp.fq.gz 770_fu.fq.gz 770_rp.fq.gz 770_ru.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 &This would be repeated for each of the pairs (12 in total). We are trimming paired-end reads with the TruSeq3-PE-2 adapters. We are chopping off the first and last 3 bases and if we end up with a sequence less than 36 bases, we get rid of it. We want to make sure that there are enough bases in a read to work with. These parameters can be tweaked for possibly better end results with less being discarded. When Trimmomatic is finished running, it will out put the total number of reads, the total number from both the forward and reverse reads that are kept, the number of only forward reads kept, the number of only reverse reads kept, and the number of discarded reads.
The trimmed reads can be analyzed again with FastQC to see how well the trimming worked to make the file better quality. After running FastQC on the trimmed files we see that the quality of those that were really bad quality were improved. There were a few different metrics throughout all of the files that bounced from a warning before the failing, or from passing before to a warning, and so forth, overall creating better quality read files.
Read Check Undrstantig Trimmomatic tutorial for this section !!!!!!
after trimmomatic, files will be created, .gz and .log files. Use multiqc to merge all logs in one report file. Then check the timmomatic survival Reads. If all samples are over 55M (million) reads then samples are valid for further analysis.
Check the order again. Trimmomatic > alligment -> w/ TopHap or STAR
1.3.8 Using STAR to map genes
Using the Mus_musculus files we will compare our samples and find where in the genome our reads are. > Build Genome Indexing and useful STAR Flags (quick).
See also:
novocraft aling. *Opens only on safari.
Analyzind RNA-Seq data using Python3 Snakemake. File Snake file.
Look for more information with rRNA and Microbial Contamination.
Bioinformatics - Contamination QC and FeatureCounts Quick Look at Counts and Setting Up R Project