Chapter 2 File Download (or SRA), QC, and Trimming

Bioinformatics - SRA Download, QC, and Trimming.

Create environment (avoid spaces or special characters).
Init environment
conda install sra-tools This will allow us to pull directly from the sra page. (where the sample ex. Files)

Go to page.
Select only the files needed.
Then in the Selected > click on Accession List >
Save (this saves a .txt file with the list of names of the selected files).
Drag onto the file structure, so its now in system
ls to see the files (SRR_Acc_List.txt)
Check ‘head SRR_Acc_List.txt’ command
Also you can use the vi SRR_Acc_List.txt command to see the list in the file. Ex.

①. SRR2121770\
②. SRR2121771\
③. SRR2121774\
④. SRR2121778\
⑤. Etc.

Now that we have the list of files, we have to individually download each one of them.

One way to do that is with the fastq-dump --gzi --split-file SRR2121770 & command.

①. This command will take the SRR2121770 file, and save it as a .gzi file (to save space)
②. it will split the file in forward (1) and Reverse (2),
③. the “&” sign means that it will run this command in the background.
This will take some time since the file is big. A way to see if the process is finished or not is with the command jobs, which gives a list of the jobs in process or nothing if there are not jobs running.
Another way to do it is with the pipe command, so we don’t have to do this process for each file. Use the command cat SRR_Acc_List.txt | parallel fastq-dump --gzi --split-file {}.

①. This will take the SRR_Acc_List.txt and catalog the list.     
②. Then it will take each of the element in the list "{}" slipt them, and save them as gzi.  
③. In the rawReads folder???.(create rawReads folder). you will have a list of .fastq.gz_1 and .fastq.gz_2 for each element of the list. `…rawReads$ ls`.  
④. Each time something is changed in the files, create a new folder and save there so the originals are not modified. Ex. Trimmed.

Once all files are downloaded, install multiqc. conda install multiqc. (if not installed)
Before running fastqc, create an output directory where to save the processed files. mkdir rawFastQC (make sure you are in the right directory before making this folder).
Then run fastqc: fastqc -t 64 rawReads/*fastq.gz -o rawFastQC

This means, process files with fastqc using 64 treads (-t 64) windows/linux virtual machine;
by taking all fastqc.gz files in the folder rawReads;
and save in the output directory (-o) called rawFastQC.
This process will take a long time depending on the number and size of the files.

Once finished cd rawFastQC and check the files ls
Then inside the folder, run multiqc in multiqc ..
Once this finished, inside the folder a new “multiqc_report.html” will be created.
Open the files and compare.

22:29
Once this quality control is finished we can pass to the trimming of the samples.

2.1 Trimmomatic

Trimmomatic web page.
Quick start Paired End:

# [Trimmomatic web page](http://www.usadellab.org/cms/?page=trimmomatic).
java -jar trimmomatic-0.39.jar PE input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

Trimming and Filtering full parameters description and example:

# datacarpentry [Trimming and Filtering](https://datacarpentry.org/wrangling-genomics/03-trimming/index.html) 
trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq  \
              SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
              SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
              ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20

Install trimmomatic, conda install trimmomatic (if not installed).

conda install trimmomatic
# parallel my also be required for task piping.
conda install parallel

If we are going to trim the files, we need to make a new folder, so exit rawFastQC folder. cd.. and mkdir trimmedReads, we are going to point trimmomatic the this folder.

# exit rawFastQC folder (remember there is a space between cd and .)
cd .. 
# Create new folder for trimmed files
mkdir trimmedReads

We are going to pull our true adapter sequencer to our working directory. Then it will be a lot easier to point trimmomatic to that vs having to type in the really long path. cp ~/miniconda3/envs/name_of_environment/share/trimmomatic-0.39-1/adapters/TrueSeq3-PE-2.fa . (the last point is important, it means that the command will the done in this folder). This command copy the file TruSeq3-PE-2.fa file to our current folder, why? so the process will be done directly in our folder, instead of the need to type the path to the folder (if not, in the code, instead of the point, type the path to the directory.)
cp ~/miniconda3/envs/rna-seq_test/share/trimmomatic-0.39-1/adapters/TrueSeq3-PE-2.fa . This gives error. Sometimes, due to updates in conda and trimmomatic, folder names might change or there is a space in a folders name. Check if this is the case by open "/usr/local/anaconda3/envs/rna-seq_test/share/". Seems like the version of timmomatic is “0.39-2” and not “0.39-1, and I installed anaconda3 and not miniconda3. Correct code and try again.
Full path is: /usr/local/anaconda3/envs/rna-seq_test/share/trimmomatic-0.39-2/adapters/TruSeq3-PE-2.fa, then the code would be:

`cp /usr/local/anaconda3/envs/rna-seq_test/share/trimmomatic-0.39-2/adapters/TruSeq3-PE-2.fa .`

Install Parallel, Conda install paralell
Create new directory for the trimmed files. mkdir trimmedReads.
Now run the trimmomatic.
For my samples it would be…

# Try 01 FAILED
trimmomatic PE -threads 5 rawReads/mPDL_RNA7D_Ko1_S1_L001_R1_001.fastq.gz rawReads/mPDL_RNA7D_Ko1_S1_L001_R2_001.fastq.gz -baseout trimmedReads/mPDL_RNA7D_Ko1_S1_L001 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 2> trimmedReads/mPDL_RNA7D_Ko1_S1_L001trimming.log

# Try 02 FAILED
trimmomatic PE -threads 5 rawReads/mPDL_RNA7D_Ko1_S1_L001_R1_001_1.fastq.gz rawReads/mPDL_RNA7D_Ko1_S1_L001_R2_001_2.fastq.gz -baseout trimmedReads/mPDL_RNA7D_Ko1_S1_L001 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 2> trimmedReads/mPDL_RNA7D_Ko1_S1_L001trimming.log

# Try 03 SUCCESS!!!???
trimmomatic PE rawReads/mPDL_RNA7D_Ko1_S1_L001_R1_001_1.fastq.gz rawReads/mPDL_RNA7D_Ko1_S1_L001_R2_001_2.fastq.gz -baseout trimmedReads/mPDL_RNA7D_Ko1_S1_L001 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36 2> trimmedReads/mPDL_RNA7D_Ko1_S1_L001trimming.log

# Try 04 (Datacarpentry)
trimmomatic PE -threads 4 rawReads/mPDL_RNA7D_Ko1_S1_L001_R1_001_1.fastq.gz rawReads/mPDL_RNA7D_Ko1_S1_L001_R2_001_2.fastq.gz -baseout trimmedReads/mPDL_RNA7D_Ko1_S1_L001_R1_001 ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20

# datacarpentry [Trimming and Filtering](https://datacarpentry.org/wrangling-genomics/03-trimming/index.html) 
$ trimmomatic PE -threads 4 
SRR_1056_1.fastq 
SRR_1056_2.fastq  \
SRR_1056_1.trimmed.fastq 
SRR_1056_1un.trimmed.fastq \
SRR_1056_2.trimmed.fastq 
SRR_1056_2un.trimmed.fastq \
ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20

# Sample 1 Success!!!!???? error but TrimmomaticPE: Completed successfully message.  # error in TruSeq3-PE.fa (No such file or directory)... Change to TruSeq3-PE-2.fa (not tested yet)

trimmomatic PE rawReads/mPDL_RNA7D_Ko1_S1_L001_R1_001_1.fastq.gz rawReads/mPDL_RNA7D_Ko1_S1_L001_R2_001_2.fastq.gz -baseout trimmedReads/mPDL_RNA7D_Ko1_S1_L001_R1_001 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36 2> trimmedReads/mPDL_RNA7D_Ko1_S1_L001trimming.log

# Sample 2 Success!!!!???? error but TrimmomaticPE: Completed successfully message.  # error in TruSeq3-PE.fa (No such file or directory)... Change to TruSeq3-PE-2.fa (not tested yet)
trimmomatic PE rawReads/mPDL_RNA7D_Ko2_S1_L001_R1_001_1.fastq.gz rawReads/mPDL_RNA7D_Ko2_S1_L001_R2_001_2.fastq.gz -baseout trimmedReads/mPDL_RNA7D_Ko2_S1_L001_R2_001 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36 2> trimmedReads/mPDL_RNA7D_Ko2_S1_L001trimming.log

java -jar trimmomatic-0.39.jar PE input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

The PE means “Pair end Read”, because we have both forward and reverse ends (1 and 2); If we would have only one end read (1 or 2) then it would be only single read, for single reads we type SE (instead of the PE).

Timmomatic usually use up to 5 threads, thus we use “-threads 5”. (What if we don’t type the # of threads? would it use all threads? Why is this important?….)

Then type the path to the folder and files containing the files to be trimmed. rawReads/mPDL_RNA7D_Ko1_S1_L001_R1_001.fastq.gz (forward), and type also the reverse “rawReads/mPDL_RNA7D_Ko1_S1_L001_R2_001.fastq.gz”
The “-baseout trimmedReads/mPDL_RNA7D_Ko1_S1_L001” means that; the base file name of the output will all have the name ” trimmedReads/mPDL_RNA7D_Ko1_S1_L001”. It will place the file in the trimmedReads folder and all the file with start with the name mPDL_RNA7D_Ko1_S1_L001.

Then we use the True sequence adapter (TruSeq3-PE-2.fa) that we moved to the working folder by typing “ILLUMINACLIP:TruSeq3-PE-2.fa:” (you can use the tap for auto completion coz it is in the same folder)

Then we place the trimming parameters??… “2:30:10:2keepBothReads”. Add more info here…..
how to understand the trimmomatic flags? Understanding Trimmomatic

The “LEADING:3 TRALING:3”means that; if the bases have bad quality, then trimm the leading 3 and trailing 3 bases. more info, If the quality is of very good quality we can remove 10, but we dont want to remove/through away that much information from out reeds.

Then, since our reads are really short and we are looking at their quality, we want to set a mean length. “MINLEN:36”. So if any or our 51 base pair reads have more than 15 removed (51-15= 36), we are just not going to use or keep them because we lost ~33% of the information of that read, so is not going to be a good (quality?) read.

Then we want to keep the output. “2> mPDL_RNA7D_Ko1_S1_L001trimming.log`”, the “2” is to keep output, and then feed that into “>” the “trimmedReads folder, and make (from all the ourput) a nice report like for fastQC and colled it”mPDL_RNA7D_Ko1_S1_L001trimming.log”.

From all the code typed, there will be 5 outputs to be loged, …. Since this takes really long time, and option will be to do a cat with the reads and do 4 jobs at a time.

For my files..??

# Not working
cat Filename.txt | parallel -j 4 "trimmomatic PE -threads 5 rawReads/{}_R1_001.fastq.gz rawReads/{}_R2_001.fastq.gz -baseout trimmedReads/{} ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 2> trimmedReads/{}trimming.log"

In the tutorial for auto loading.

cat Filename_list.txt | parallel -j 4 "trimmomatic PE -threads 5 rawReads/{}_1.fastq.gz rawReads/{}_2.fastq.gz -baseout trimmedReads/{} ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 2> trimmedReads/{}trimming.log"

Control + Z to stop running and add bj 1 (in root?) to run in the background. type jobs (in terminal) to see the jobs running in the background.

In the github tutorial: Conda was used again to run Trimmomatic. This isn’t as easy as using the wildcard like with FastQC because each output has to be personalized for the read files that are input into Trimmomatic. Also, we have to make sure that the adapter sequences are in the same folder that we are running so we can refer to them easily when calling the Trimmomatic program. In this case, we are using the TruSeq3-PE-2.fa adapter sequences For example:

~/miniconda2/bin/trimmomatic PE SRR2121770_1.fastq.gz SRR2121770_2.fastq.gz 770_fp.fq.gz 770_fu.fq.gz 770_rp.fq.gz 770_ru.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 &

Finally for this trial I used:

# Sample 1 Success!!!!???? error but TrimmomaticPE: Completed successfully message.  # error in TruSeq3-PE.fa (No such file or directory)... Change to TruSeq3-PE-2.fa (not tested yet)

trimmomatic PE rawReads/mPDL_RNA7D_Ko1_S1_L001_R1_001_1.fastq.gz rawReads/mPDL_RNA7D_Ko1_S1_L001_R2_001_2.fastq.gz -baseout trimmedReads/mPDL_RNA7D_Ko1_S1_L001_R1_001 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36 2> trimmedReads/mPDL_RNA7D_Ko1_S1_L001trimming.log

# Sample 2 Success!!!!???? error but TrimmomaticPE: Completed successfully message.  # error in TruSeq3-PE.fa (No such file or directory)... Change to TruSeq3-PE-2.fa (not tested yet)
trimmomatic PE rawReads/mPDL_RNA7D_Ko2_S1_L001_R1_001_1.fastq.gz rawReads/mPDL_RNA7D_Ko2_S1_L001_R2_001_2.fastq.gz -baseout trimmedReads/mPDL_RNA7D_Ko2_S1_L001_R2_001 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36 2> trimmedReads/mPDL_RNA7D_Ko2_S1_L001trimming.log

Pipe line code for automatic file loading to trimmomatic still not successfully tested.
Come back to this later.

# Pipe and loop for all files.
cat filenames.txt | parallel -j 4 "trimmomatic PE rawReads/{}_1.fastq.gz rawReads/{}_2.fastq.gz -baseout trimmedReads/{} ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRAILING:3 MINLEN:36 2> trimmedReads/{}trimming.log" #Not successful.

Finally re-run the fastqc and multiqc of the trimmed data.

# In TrimmedRead folder
ls
 multiqc .

This will take all the .log files created in by the TruSeq3-PE-2.fa trimmed data.

Note: Seems like multiqc can take all html and log files and combine the information with in them in a single html report file.

# fastQC of trimmed files again?
mkdir trimmedFastQC

# rm -d trimmedFastQC # Make sure folder is in main folder, if not delete

# Run fastqc... Trimmed files have no file extension. Why???
fastqc trimmedReads/* -o trimmedFastQC

# .zip and qc html files will be created. Run multiqc
cd trimmedFastQC

multiqc .

2.2 Indexing with STAR

Bioinformatics - Building Genome Index and Aligning with STAR

STAR is a tool that allow us to build an index of the mouse genome; then, with those index files we can then map the reads against the genome and find the location of each of those reads where they are mapped in the genome. Once we have that, the we can do a the “read count”

For more information see

[STAR: ultrafast universal RNA-seq aligner] (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/) STAR Manual download link. PDF file in local HD
Link to manual page.

Download the Mouse genome form the NCBI page.
See Custom/personal way to Download the Mus_musculus_NCBI_build37.2.tar.gz file section. (open section in a new tab)

See 1.3.4

Once file is downloaded, unzip. (It will take a long time)
Make sure to be in the same folder/directory as the tar.gz file.
extracted files will be saved in a Mus_musculus folder/directory automatically created.

# Check the correct environment.
. /usr/local/anaconda3/bin/activate && conda activate /usr/local/anaconda3/envs/rna-seq_test

# Check the correct folder/directory.   
cd "/Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/60-499583816_60-501009934/Result/Mus Musculus/Mus_musculus_NCBI_build37.2.tar.gz"

# cd "folder path"
#Data/Genewiz/60-499583816_60-501009934/Result/Mus Musculus/Mus_musculus_NCBI_build37.2.tar.gz: Not a directory
cd "/Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/60-499583816_60-501009934/Result/Mus Musculus/"

# Check contents
ls  

# Unzip file
 tar xvfz Mus_musculus_NCBI_build37.2.tar.gz

Tar command info for linux here and for mac here

Confirm that STAR is installed if not Install STAR.

# Make sure of enviroment.  
STAR --help

#If not found install.  
conda install STAR 

# Make sure is from "bioconda".  Preceed([y]/n)? message.
y

# Check again. Long list of flags.  
STAR --help

# Create a  directory for starIndex.  
mkdir starIndex

ls

Run STAR.
Make sure that the unzip folder of Mus_musculus_NCBI_build37.2.tar.gz (Mus_musculus) is in the same directory as trimmed/environment. Is this necessary???!

FYI:
Mus_musculus_NCBI_build37.2.tar.gz zip file is 23.58 GB.
Mus_musculus unziped folder is 98.77 GB. Initially downloaded into: "/Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/60-499583816_60-501009934/Result/Mus Musculus/Mus_musculus_NCBI_build37.2.tar.gz"
Trial analysis forlder in: "/Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/Trial02"

Mus_musculus unziped folder moved to: /Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/Trial02/Mus_musculus

Path to genome.fa file.
/Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/60-499583816_60-501009934/Result/Mus Musculus/Mus_musculus/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa

Path to genes.gtf file.
/Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/60-499583816_60-501009934/Result/Mus Musculus/Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf
However better to use the folder Shortcut (known as aliases on Mac) created when unziped. This eliminates the Archives/archive-2015-07-17-14-32-40section of the path. So… use /Mus_musculus/NCBI/build37.2/Annotation/Genes/genes.gtf

# Make sure that the unzip folder of Mus_musculus_NCBI_build37.2.tar.gz (Mus_musculus) is in the same directory as environment/analysis folder in this case:  

cd "/Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/Trial02/"

# Run STAR
STAR --runThreadN 64 --runMode genomeGenerate --genomeDir starIndex --genomeFastaFiles Mus_musculus/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa --sjdbGTFfile Mus_musculus/NCBI/build37.2/Annotation/Genes/genes.gtf

Sep 13 12:47:47 ..... started STAR run
Sep 13 12:47:48 ... starting to generate Genome files
Sep 13 12:49:24 ..... processing annotations GTF
Sep 13 12:49:40 ... starting to sort Suffix Array. This may take a long time...
Sep 13 12:49:51 ... sorting Suffix Array chunks and saving them to disk...
Sep 13 13:23:01 ... loading chunks from disk, packing SA...
Sep 13 13:48:03 ... finished generating suffix array
Sep 13 13:48:03 ... generating Suffix Array index
Sep 13 13:53:59 ... completed Suffix Array index
Sep 13 13:54:00 ..... inserting junctions into the genome indices

# Try several times but it could not finished correctly, "System run out of application memory. 

# It is important that the hole process is fully completed.  It takes quite a long time and uses a lot of memory resource.

Your System Has Run Out of Application Memory on Mac.

Solutions: 1. Close unnecessary webpages/programs.
2. Free up more space on the system HDD (at least 20% of free memory?).
3. Google chrome application memory leak?. Run task manager and check memory usage.
4. Check memory pressure (memory pressure command) and double ckeck your free space. Also try running sudo purge from the terminal. 5. Re-start computer to reset uptime. 6. Update software. 7. Reset Mac’s NVRAM and PRAM. 8. Close chrome and/or Safari.#


STAR version: 2.7.9a   compiled:  :/Users/cshl/data/STAR/STAR/source
Sep 13 16:19:40 ..... started STAR run
Sep 13 16:19:41 ... starting to generate Genome files
Sep 13 16:21:14 ..... processing annotations GTF
Sep 13 16:21:28 ... starting to sort Suffix Array. This may take a long time...
Sep 13 16:21:37 ... sorting Suffix Array chunks and saving them to disk...
Sep 13 16:53:49 ... loading chunks from disk, packing SA...
Sep 13 17:18:12 ... finished generating suffix array
Sep 13 17:18:12 ... generating Suffix Array index
Sep 13 17:23:21 ... completed Suffix Array index
Sep 13 17:23:24 ..... inserting junctions into the genome indices
Sep 13 17:25:56 ... writing Genome to disk ...
Sep 13 17:27:06 ... writing Suffix Array to disk ...
Sep 13 17:36:19 ... writing SAindex to disk
Sep 13 17:37:01 ..... finished successfully
(rna-seq_test) MR-MBP20:Trial02 marcelorosales$ 

# Check if the files were created in the folder.

cd starIndex

ls

These process will take about 2 hrs or longer depending con computer’s processing power.

The optins/flags coded were:
–runThreadN 64: Number of threads use for processing.
–runMode genomeGenerate: To start star generates the index files therefore genomeGenerate. –genomeDir starIndex:This is the directory for the star index output (is not the genome FASTA files). –genomeFastaFiles Mus_musculus/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa: the whole genome of the mouse is in this file. –sjdbGTFfile Mus_musculus/NCBI/build37.2/Annotation/Genes/genes.gtf: This lets star know that the next file we are going to give it is a GTF file which is the anotation file of all the transcript that we are going to try to identify with our reads. It may not be really necessary, but it is recommended to usefor more accuracy, specially in de novo transcrip discovery, you would use this basically to take all the reads and map them.

STAR will create the index files and save them in the “starIndex” folder. Whit this files compleated, we will now map the reads of our samples

Align reads
Create a Folder to save the sample reads.

# create a new directory to target as output directory.

mkdir starAligned
ls

For aligning, there are some issues with the flags command Z cat which tells star that the files we’re inputting are zipped.
The way around it is to (maybe not the best way) gunzip flag (command/app). The way to use this is to do/perform the command beforehand, and then input on the files that we want, which it would be something like treammedReads/[] in a cat file. (file containing a catalog or list of the files downloaded or to be fed to the application).
To create a cat file see Cat command in Linux or see video

cat > filenames.txt

# Type or copy paste.

mPDL_RNA7D_Ko1_S1_L001_R1_001_1P
mPDL_RNA7D_Ko1_S1_L001_R1_001_2P
mPDL_RNA7D_Ko2_S1_L001_R2_001_1P
mPDL_RNA7D_Ko2_S1_L001_R2_001_2P

# press control +D to save 

# Also, might need to instal pigz

conda install -c conda-forge pigz

mPDL_RNA7D_Ko1_S1_L001_R1_001_1P mPDL_RNA7D_Ko1_S1_L001_R1_001_1U mPDL_RNA7D_Ko1_S1_L001_R1_001_2P mPDL_RNA7D_Ko1_S1_L001_R1_001_2U mPDL_RNA7D_Ko2_S1_L001_R2_001_1P mPDL_RNA7D_Ko2_S1_L001_R2_001_1U mPDL_RNA7D_Ko2_S1_L001_R2_001_2P mPDL_RNA7D_Ko2_S1_L001_R2_001_2U

mPDL_RNA7D_Ko1_S1_L001_R1_001_1P mPDL_RNA7D_Ko1_S1_L001_R1_001_2P mPDL_RNA7D_Ko2_S1_L001_R2_001_1P mPDL_RNA7D_Ko2_S1_L001_R2_001_2P

changed to match filenames.txt

mPDL_RNA7D_Ko1_1P mPDL_RNA7D_Ko1_2P mPDL_RNA7D_Ko2_1P mPDL_RNA7D_Ko2_2P

# Original
cat filenames2.txt | parallel -j 2 "gunzip trimmedReads/{}_*P.gz && STAR --runThreadN 32 --genomeDir starIndex --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --outSAMtype BAM SortedByCoordinate && pigz trimmedReads/{}_*P"

# ERROR
# zsh:1: no matches found: trimmedReads/mPDL_RNA7D_Ko1_*P.gz
# zsh:1: no matches found: trimmedReads/mPDL_RNA7D_Ko2_*P.gz

# zsh:1: no matches found: trimmedReads/mPDL_RNA7D_Ko1_*P.gz
# zsh:1: no matches found: trimmedReads/mPDL_RNA7D_Ko2_*P.gz

# With no extensions. 

cat filenames2.txt | parallel -j 2 "gunzip trimmedReads/{}_*P && STAR --runThreadN 32 --genomeDir starIndex --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --outSAMtype BAM SortedByCoordinate && pigz trimmedReads/{}_*P"

# ERROR
#gunzip: trimmedReads/mPDL_RNA7D_Ko1_1P: unknown suffix -- ignored
#gunzip: trimmedReads/mPDL_RNA7D_Ko1_2P: unknown suffix -- ignored
#gunzip: trimmedReads/mPDL_RNA7D_Ko2_1P: unknown suffix -- ignored
#gunzip: trimmedReads/mPDL_RNA7D_Ko2_2P: unknown suffix -- ignored

cat filenames2.txt | parallel -j 2 "STAR --runThreadN 32 --genomeDir starIndex --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --outSAMtype BAM SortedByCoordinate && pigz trimmedReads/{}_*P"


# 
cat filenames2.txt | parallel -j 2 "gunzip trimmedReads/{}_*P.gz && STAR --runThreadN 32 --genomeDir starIndex --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --outSAMtype BAM SortedByCoordinate && pigz trimmedReads/{}_*P"

# No gunzip 
cat filenames2.txt | parallel -j 2 "STAR --runThreadN 32 --genomeDir starIndex --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --outSAMtype BAM SortedByCoordinate && pigz trimmedReads/{}_*P"

##  NOT WORKING... OPTIONS:

# 1. FIRST convert all .*P files to gz () and do it in the background (use & at the end of line).
pigz trimmedReads/*P &
# Check
bg 1

# to see if it is working 
jobs
# [1]    running    pigz trimmedReads/*P

# to see Process type
top # see commad "pigz" "STAR" if running. also top 

# To stop control + Z
# To exit control + C

# Now all P files are gz. run code.

cat filenames2.txt | parallel -j 2 "gunzip trimmedReads/{}_*P.gz && STAR --runThreadN 32 --genomeDir starIndex --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --outSAMtype BAM SortedByCoordinate && pigz trimmedReads/{}_*P" &


# Found better code.

cat filenames2.txt | parallel -j 2 "parallel gunzip ::: trimmedReads/{}_*P.gz && STAR --runThreadN 64 --genomeLoad LoadAndKeep --genomeDir starIndex/ --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --limitBAMsortRAM 5000000000 --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx && pigz trimmedReads/{}_*P" &

# I there is no need to unzip

cat filenames2.txt | parallel -j 2 "STAR --runThreadN 64 --genomeLoad LoadAndKeep --genomeDir starIndex/ --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --limitBAMsortRAM 5000000000 --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx && pigz trimmedReads/{}_*P"

# ERROR

# Shared memory error: 4, errno: Invalid argument(22)
# EXITING because of FATAL ERROR: problems with shared memory: error from shmget() or shm_open().
# SOLUTION: check shared memory settings as explained in STAR manual, OR run STAR with --genomeLoad NoSharedMemory to avoid using shared memory

cat filenames2.txt | parallel -j 2 "STAR --runThreadN 64 --genomeLoad NoSharedMemory --genomeDir starIndex/ --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --limitBAMsortRAM 5000000000 --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx"

# ERROR
# BAMoutput.cpp:27:BAMoutput: exiting because of *OUTPUT FILE* error: could not create output file starAligned/mPDL_RNA7D_Ko2_STARtmp//BAMsort/49/2
# SOLUTION: check that the path exists and you have write permission for this file. Also check ulimit -n and increase it to allow more open files.

ulimit -n
#2560

ulimit -n 100000

fg1 #? What us tgus for?

cat filenames2.txt | parallel -j 2 "STAR --runThreadN 64 --genomeLoad NoSharedMemory --genomeDir starIndex/ --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --limitBAMsortRAM 5000000000 --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx"

cat filenames2.txt | parallel -j 1 "STAR --runThreadN 64 --genomeLoad NoSharedMemory --genomeDir starIndex/ --readFilesIn trimmedReads/{}_1P trimmedReads/{}_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/{} --limitBAMsortRAM 5000000000 --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx"






STAR --runThreadN 32 --genomeDir starIndex --readFilesIn trimmedReads/mPDL_RNA7D_Ko1_S1_L001_R1_001_1P trimmedReads/mPDL_RNA7D_Ko1_S1_L001_R1_001_2P --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix starAligned/mPDL_RNA7D_Ko1_S1_L001_R1_001 --outSAMtype BAM SortedByCoordinate

SAM is human readable, BAM is compress binary version (for CPU to read),

2.3 Microbial contamination

For this use/install Kraken2

2.4 Feature counts.

video



# Install subreads.

conda install subread

# count

featureCounts -h

cd "/Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/Trial02/Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/"

head genes.gtf

# GTF file has gene_id (sametimes is doesnt work!), and transcipt_id. This two are going to be used to label genes.  in the fearuresCounts -h you can see the code to input the *_id.

# make a count Directoru

cd "/Volumes/HD-PCFSU3-A/Experiments Data/Genewiz/Trial02"

mkdir readCounts

# Read counts
# as .txt
featureCounts -T 8 -a Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -g `transcript_id` -o readCounts/readCounts.txt starAlign/*.bam

# as .csv?
featureCounts -T 8 -a Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -g `transcript_id` -o readCounts/readCounts.csv starAlign/*.bam

# make the report
cd readCounts

multiqc .

2.5 Setting R environment.

Intall packages

Packages

readr (install.packages(‘readr’))
limma (BiocManager::install(‘limma’))
DESeq2 (BiocManager::install(‘DESeq2’))
dplyr (install.packages(“dplyr”))
ggplot2 (install.packages(“ggplot2”))
gplots (install.packages(“gplots”))
Annotations (BiocManager::install(‘AnnotationDbi’))
org.Hs.eg.db (BiocManager::install(‘org.Hs.eg.db’))
This is for Human
org.Mm.eg.db (BiocManager::install(‘org.Mm.eg.db’))
This is for Mouse
ggrepel (install.packages(“ggrepel”))
ReportingTools (BiocManager::install(‘ReportingTools’))
GO.db (BiocManager::install(‘GO.db’))
GOstats (BiocManager::install(‘GOstats’))
pathview (BiocManager::install(‘pathview’))
gage (BiocManager::install(‘gage’))
gageData (BiocManager::install(‘gageData’))
select (BiocManager::install(‘Select’))

install.packages(c("dplyr", "ggplots", "ggplot2", "greppel" ))

BiocManager::install(c("lima", "DESeq2", "AnotationDbi", "org.Mn.eg.db", "ReportingTools", "GO.db", "GOstats", "pathview", "gage", "gageDATA", "select"))

# Libraries

library(limma)
library(DESeq2)
library(dplyr)
library(readr)
countData = read_csv("readCounts.csv", skip = 1)

##End