cs4255

Course material for the Algorithms for sequence-based bioinformatics course (CS4255) at TU Delft

View the Project on GitHub AbeelLab/cs4255

CS4255 2019 Project Instructions for FastTree and Pindel

Problem statement

Script

You will be given 2 input files: an alignment file of 24 bacterial strains on which you will use your FastTree implementation to generate a phylogenetic tree, and a collection of paired-end reads mapped to a reference genome, and from these mappings you are asked to find the locations of breakpoints using your Pindel code.

Inputs:

  1. A muliple sequence alignment file for FastTree
  2. Sam file containing paired-end reads mapped to a reference genome for Pindel
    You can find the sequence files for the paired reads pindel_reads1.fq and pindel-reads2.fq, and the reference genome pindel-ref.fasta. The read length is 100 and the insertion size is 400.

Output: A phylogenetic tree in Newick format and the coordinates of breakpoints, including the number of reads that support each breakpoint.

The correct output tree for FastTree is:

fasttree-output

You can also find the Newick file here

You can check your Pindel results with the correct .vcf file that describes all breakpoints in the reference genome.


EDIT: Pindel inputs have been fixed, do let us know if you notice any mistakes!!

Testing your code

Test data

1. FastTree

You can find a small alignment input (fasttree-reallysmall.aln) to test your code, it contains the following 4 sequences (fasttree-reallysmall.fasta), aligned to each other:

ERR234681 GGACAGACCGGACACGGGT
ERR2307717  AGAACGACTGAACACGGGT
SRR6896212  ACAACGACTGACCATAAGG
ERR181778 GGACAGGGGACGCGGGT

The correct output tree for these sequences is:

(SRR6896212:0.41935,ERR2307717:0.00055,(ERR181778:0.49563,ERR234681:0.00055)0.863:0.33176);

test-tree

2. Pindel

There is a small reference genome of length 1000bp pindel-ref_reallysmall.fasta, and your are given 285 paired-end reads in pindel_reads1_reallysmall.fq and pindel-reads2_reallysmall.fq. You can also find the alignment file here. The insertion length of reads is 200.

Reads were obtained from a sequence that contains a deletion of length 100bp at position 150 (0-based index) and an insertion of 10bp at position 600 (0-based index).


Some useful notes

1. FastTree

Here is a quick recap of the method: the notes below are from the reference paper and the project website. It is strongly advised that you also read through the project website AND the supplementary text; method is described in a little more detail there.

1.1 Heuristic Neighbor-Joining

FastTree uses several heuristics to improve the neighbor joining algorithm:

1.2 Nearest Neighbor Interchanges (NNI)

Once we obtain our initial rough topology, we need to use NNIs to improve it:

1.3 Local Bootstrap

To estimate the support for each split, FastTree resamples the alignment’s columns with Knuth’s 2002 random number generator.

1.4 Branch Lengths

Once the topology is complete, FastTree computes branch lengths:

for internal branches and

for the branch leading to leaf A, where d are log-corrected profile distances.

1.5 Unique Sequences

Large alignments often contain many sequences that are exactly identical to each other. Before inferring a tree, FastTree uses hashing to quickly identify redundant sequences. It constructs a tree for the unique subset of sequences and then creates multifurcating nodes, without support values, as parents of the redundant sequences.

2. Pindel

These notes are from the reference article, please refer to the text for further details.

2.1 Detecting large deletions

Some of the reads will only one end mapped to the reference genome. One of the possibilities is that the unmapped read mate spans the break point of a deletion event in the test sample compared to the reference genome as shown in figure below.

2.2 Detecting medium sized insertions

The procedure is very similar to that used for deletions. The only difference is that the search range for the unique occurrence of minimum and maximum unique substrings from the 5’ end of the unmapped read is read length - one. In this case, we certainly cannot reconstruct the whole read and the extra bases are an inserted fragment compared to the reference genome as shown in figure below.

GOOD LUCK!!