sequence analysis

Course Overview

In the course sequence analysis, I will cover; how to align and map sequences to a reference sequence and how to interactively view the mapped and aligned sequences. I will introduce multiple important file formats commonly used in sequence analysis. Finally, I explain how you can generate variance files to see Single Nucleotide Variances (SNVs). This course is the foundation of checking sequence outcomes and to understand the different file formats.

For this course it is suggested to first follow 1) Intro: VirtualBox setup and loading packages and 2) Command Line Shell.

Goals of this course

  • Understand what an alignment is
  • Understand the differences between an alignments and a mapping
  • Understand structures of common used data formatting files; such as GFF3, BED, BAM, SAM and VCF
  • Get more familiar with using sequence visualisation programmes such as the interactive genomics viewer (IGV)
  • Understand the general workflow from receiving sequence files to analysing single nucleotide variances between samples.

In the Sequence Analsyis course we look into alignments, variation calling and how to handle sequence data in general.

In the first part of sequence analysis, I will introduce the course outline. I will discuss a dataset published in an open access peer-reviewed publication. Please read the paper about Kiriga following this link before starting the sequence analysis course. In this session, I will introduce interactive genomics viewer (IGV) and explain how you read the specific, tab separated, columns in a GFF3 file.

In part two of the course, I will introduce alignments. What are alignments and how is the quality of a alignment calculated. Finally we will finish with an exercise to do a manual alignment of a sequence of letters.

In part 3, we continue with performing and handling alignments. firstly, I will discuss the difference between local and global alignments. Here I discuss Basic Local Alignment Tool (BLAST) and its different parts of the BLAST toolkit. During this part we will work together to run a BLAST using the web tool and command line. At the command line version I explain how you can make your own BLAST database.

In part 4, I will go over the exercise that I presented in part 3.

During the exercise given in part 3, and explained in part 4, we got an Browser Extensible Data (BED) file as an output. Here I will discuss what a BED file is, what you can do with it and how it is formatted. I will go further in what programmes you can use to look in the BED file format and how you can modify the BED file to make it more accessible in IGV. With this information, we will work on an exercise to put the newly known workflow in practice.

After we went in depth what alignments are and how to check them, we go in a different type of alignment called mapping. Here we will discuss how we can map a sequence to another sequence (often to a “reference” sequence) to make a Binary Alignment Map (BAM) formatted file, or Sequence Alignment/Map (SAM) formatted file. Together, we will go over an exercise to map some sequences to a part of the Kariega genome to make a SAM and BAM file and look to the mapping via IGV. We then will finish this part of the course looking more in depth in the SAM and BAM formatted files.

In part 7, I introduce variant calling, Single Nucleotide Polymorphisms (SNPs) and other variations. After this introduction I will explain an exercise by visualising the variations in BAM files using IGV.

At the final part of the sequence analysis course, I show you a method to generate variance call format (VCF) files. I introduce the programme Freebayes to generate the VCF file. Subsequently, I show you the format of a VCF file and how to understand the format. Then we include the read-group information into the VCF file using hisat2. Finally, we conclude the sequence analysis with a exercise using most of the used programmes and techniques discussed in the sequence analysis course.