12. Short read quality and trimming

You should now be logged into your Atmosphere computer! You should see something like this

sateeshp@vm142-35:~$ 

Note: If you haven’t added conda to your PATH yet, please do so by running the following:

echo export PATH=$PATH:/opt/miniconda3/bin >> ~/.bashrc

Then, run the following command (or start a new terminal session) in order to activate the conda environment:

source ~/.bashrc

12.1. Data source

We will be using mRNAseq reads from a small subset of data from Nematostella vectensis (Tulin et al., 2013).

Original RNAseq workflow protocol here, more updated protocol here.

Load your data into work/data.

First, create the subdirectories

cd ~/
mkdir -p work work/data
cd work/data

Download subset of data:

curl -L https://osf.io/365fg/download -o nema_subset_0Hour.zip
curl -L https://osf.io/9tf2g/download -o nema_subset_6Hour.zip
unzip nema_subset_0Hour.zip
unzip nema_subset_6Hour.zip

Define your $PROJECT variable to be the location of your work directory; in this case, it will be /mnt/work:

export PROJECT=~/work

Now load your data in!

12.2. Check that your data is where it should be

Check:

ls $PROJECT/data

If you see all the files you think you should, good! Otherwise, debug.

These are FASTQ files – let’s take a look at them:

less 0Hour_ATCACG_L002_R1_001.fastq

(use the spacebar to scroll down, and type ‘q’ to exit ‘less’)

Question:

  • where does the filename come from?
  • why are there 1 and 2 in the file names?

Links:

12.3. Quality trimming and light quality filtering

Make sure you’ve got the PROJECT location defined, and your data is there:

set -u
printf "\nMy raw data is in $PROJECT/data/, and consists of $(ls -1 ${PROJECT}/data/*.fastq | wc -l) files\n\n"
set +u

Important: If you get an error above or the count of files is wrong…STOP!! Revisit the installation instructions!

12.3.2. FastQC

We’re going to use FastQC summarize the data. We already installed ‘fastqc’ on our computer for you.

Now, run FastQC on two files:

fastqc *.fastq

After this finishes running (has to run on each file so might take a while), type ‘ls’:

ls -d *fastqc.zip*

to list the files, and you should see a number of files with the extensions .fastqc.zip.

Inside each of the fatqc directories you will find reports from the fastqc. You can download these files using your RStudio Server console, if you like. To install and run an RStudio Server, go here.

Questions:

  • What should you pay attention to in the FastQC report?
  • Which is “better”, file 1 or file 2? And why?

Links:

There are several caveats about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed for the first 100,000 sequences in each file.

12.3.3. MultiQC

If you would like to aggregate all of your fastqc reports across many samples, MultiQC will do this into a single report for easy comparison.

Run MultiQC:

multiqc .

The terminal output should look like this:

[INFO   ]         multiqc : This is MultiQC v1.6
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
[INFO   ]          fastqc : Found 20 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete

You can also view this output here

12.3.3.1. View your files on your own computer

As an alternative to viewing the files on the Rstudio server, we can secure copy (scp) these files to our own laptops, and view them from there.

mkdir ~/Desktop/nema_fastqc  # make a directory for these files
scp username@ip.address:~/work/quality/*.html ~/Desktop/nema_fastqc

where the first argument after scp is your login and path for the files we want to copy (from the atmosphere instance), and the second argument is the path to place the files on our own computer.

If you are unable to use scp though a terminal output, you can see the fastqc html output here

12.3.4. Adapter trim each pair of files

Setup trim directory:

cd ..
mkdir trim
cd trim
ln -s ../data/*.fastq .
cat /opt/miniconda3/share/trimmomatic*/adapters/* > combined.fa

See excellent paper on trimming from MacManes 2014.

Run:

for filename in *_R1_*.fastq
do
# first, make the base by removing fastq
  base=$(basename $filename .fastq)
  echo $base
  
# now, construct the R2 filename by replacing R1 with R2
  baseR2=${base/_R1_/_R2_}
  echo $baseR2
        
# finally, run Trimmomatic
  trimmomatic PE ${base}.fastq ${baseR2}.fastq \
    ${base}.qc.fq.gz s1_se.gz \
    ${baseR2}.qc.fq.gz s2_se.gz \
    ILLUMINACLIP:combined.fa:2:40:15 \
    LEADING:2 TRAILING:2 \
    SLIDINGWINDOW:4:2 \
    MINLEN:25

# save the orphans
  zcat s1_se.gz s2_se.gz >> orphans.qc.fq.gz
  rm -f s1_se.gz s2_se.gz
done

Now, run fastqc again on trimmed files:

fastqc *.qc.fq.gz
multiqc .

The paired sequences output by this set of commands will be in the files ending in .qc.fq.gz, with any orphaned sequences all together in orphans.qc.fq.gz.

Make these trimmed reads read-only and keep them, as we will reuse them later.

chmod a-w ${PROJECT}/trim/*.qc.fq.gz

Questions:

  • How do you figure out what the parameters mean?
  • How do you figure out what parameters to use?
  • What adapters do you use?
  • What version of Trimmomatic are we using here? (And FastQC?)
  • Do you think parameters are different for RNAseq and genomic data sets?
  • What’s with these annoyingly long and complicated filenames?
  • why are we running R1 and R2 together?

For a discussion of optimal trimming strategies, see MacManes, 2014 – it’s about RNAseq but similar arguments should apply to metagenome assembly.

Links:

Questions:

  • is the quality trimmed data “better” than before?
  • Does it matter that you still have adapters!?