RNA SEQ

Course Overview

RNA-Seq allows us to study the gene expression patterns in organisms. Some of the most common applications of RNA-Seq include measuring gene expression levels and gene annotation. This course teaches you how to process RNA-Seq data from receiving raw sequence files to getting expression values for genes.

RNASeq allows us to find differences of gene expression, which can be helpful in finding which genes respond to stress, whether it is abiotic or biotic. It can be used to find out what other genes a mutation in another gene affects in developmental pathway, or how the whole transcriptome responses to environmental conditions. In this course we learn to analyse whole transcriptomic dataset from wheat, by mapping it to the reference, and looking at gene counts directly. The dataset we use is publicly available, and relates to a peer reviewed publication in Nature Genetics .

Goals of this course

Be able to run an RNA-Seq pipeline to analyse differential gene expression starting from raw fastq reads to a list of candidate genes.

  • Understand what RNA-Seq is and where it can be used
  • Understand what the biases and limitation are within the experimental design and RNA-Seq analysis
  • Have an overview of programmes that you can use to process the RNA-Seq data files.
  • Understand both mapping based and k-mer based approaches to RNA-Seq analysis and when to use each.
  • Have an introduction to 2 GUI programmes for easy visualisation and quality controlling of your RNA-Seq files.

In part one, I will introduce RNA-Seq. Firstly, we talk what is RNA-Seq, where it can be used and why we would use RNA-Seq in different research projects.

Here I will introduce the course and will take you over a few experimental questions we were able to answer with a RNA-Seq dataset and I will explain the importance of choosing the right experimental design to answer the example research questions.

In part two, I will first introduce the RNA-Seq experiment in context. This is a broad overview from collecting plant materials all the way to sequence analysis. I show you how the reads are mapped to the genome and why is it importance to understand the mapping is based on Exons, without the Introns. We will go briefly into the issues that might arise with mapping RNA-Seq data to a reference genome. We conclude Part 2 with a discussion on terminology of reads and, the characteristics of the two main sequencing technologies, Illumina and PacBio for RNA-Seq and why we might use each of the sequencing technologies.

After introducing the concept of RNA-Seq and where you can use it for, I will go further into experimental design and what are the potential biases that you need to be aware of. Here, I will answer questions at the collection stage such as: “What is important when collecting the plants?”, “How much biological samples do I need?”, “How many reads do I need to sequence (depth) to answer my research question?”, and “What biases can I expect while analysing a RNA-Seq database?”. Finally, I will give an exercise to practise your own ideas of setting up a RNA-Seq experiment, taking in account all the possible biases one would like to avoid.

Now we have had a look at experimental design, I will go further into sequencing data. I will explain, that once we have got the email from the sequence provider that the sequences are ready to be downloaded, what do you actually get from the sequencing provider and what do you need to watch out for when downloading the sequences, emphasising the importance of calculating the md5 sums. Once downloaded, we will go into the 4-line patterns and structure of a FASTQ file. This will prepare you to start the hands-on exercises used in the next parts of this course. 

In this session, we will start working in the VirtualBox. How to download, update and open the VirtualBox is explained in the Intro course of the bioinformatics courses found here. I will recap a few commands that we have discussed in the introduction of Unix command line. With these commands we will have a look in the different file types, and navigate the file structure.

In part 6, we start with an exercise where we map the sequencing reads to a genome. I will explain how to use the software/tool called Hisat2 and how to check your output Sequence Alignment Map (SAM) file.

Now that we have generated the SAM file, we want to analyse the transcriptome that is mapped to the genome. We first need to sort and index the SAM file using a software called Samtools. Subsequently, I will be able to measure transcript abundance using a software called Stringtie.

In this session, I will show you a different method to analyse the transcript abundance. I will show you a method that is faster and more memory efficient. This method is based on K-mer quantification. Here, I introduce K-mers, what are they and how would you be able to generate them from your RNA-Seq files. In this example I used the software tool called Kallisto to generate the gene counts using a K-mer approach. Finally, I show you the output formats from Kallisto and how you can interpret it. 

To conclude the introduction to RNA-Seq, I show you a few GUI tools that you would be able to use to analyse your dataset in a graphical interface. I will discuss two tools in particular: Degust and Integrative genome viewer (IGV).

Helpful notes and terminology

During this course, I have covered an introduction to RNA-Seq. We have covered a broad section of questions that you can answer with RNA-Seq. We have discussed the RNA-Seq experiment from experimental design to final candidate list. We have used a real dataset and analysed is using a few of the key RNA-Seq analysis tools using the command line in Linux. I am aware that during this course you may have heard a few new terms or terms that you are not too familiar with. In the glossary below I have added definitions of the most important terminology used in this course.

Glossary
  • Gene annotation: The process of finding and designating locations of individual genes and other features on raw DNA sequences, called assemblies.
  • Gene expression: The process by which the information encoded in a gene is turned into a function.
  • Reference genome: a template genome incorporating the most up to date information we have on the studied organism
  • Intron: The non-coding sections of an RNA transcript, or the DNA encoding it, that are spliced out before the RNA molecule is translated into a protein.
  • Exon: The region of the genome that ends up within an mRNA molecule.
  • Exon Junctions: a protein complex which forms on a pre-messenger RNA strand at the junction of two exons which have been joined together during RNA splicing.
  • SNPs: Single Nucleotide Polymorphisms.
  • Fragment: Piece of DNA that is interrogated by sequencing.
  • Read: a sequence of nucleotides obtained by a sequencer.
  • SAM file: Sequence Alignment Map file
  • FPKM: Fragments per kilobase per million
  • TPM: Transcripts per million
  • TSV: Tab-separated values
  • CSV: Comma-separated values