cs4255

Course material for the Algorithms for sequence-based bioinformatics course (CS4255) at TU Delft

View the Project on GitHub AbeelLab/cs4255

CS4255 2019 Project Instructions for Minimap and Miniasm

Problem statement

You are given long read sequences of E. coli in FASTQ format as well as short reads; in this project you are asked to implement an assembly pipeline comprising as described below.

Script

Implement the minimap algorithm for read alignment, and miniasm algorithm for assembling the genome graph.

Input: Long reads in FASTQ format
Output: Genome assembly in FASTA format


Testing your code

Test data

1. You can test your codes on the small test dataset with 3 reads

Use k = 10 and w = 3

Correct output (using the same notation as in Algorithm 4 from the paper:

{0: [((0, 0, 0, 0), (0, 0, 0, 20))], 2: [((2, 1, 12, 6), (2, 1, 28, 14))]}
{1: [((1, 0, 0, 1), (1, 0, 0, 19))]}
{0: [((0, 1, 12, 6), (0, 1, 28, 14))], 2: [((2, 0, 0, 0), (2, 0, 0, 20))]}

I.e., (0, 0, 0, 4) = (t, r, c, i’) which translates to (read #, strand, distance, position), see Algorithm 4

2. A bigger sample test is also on the github page here

Intermediate tests

1. Hash function

Your script should
(1) convert the string input TTA to the number 60
(2) convert the number 60 to 9473421487448830983

2. Finding minimizers

For sequence input "GATTACAACT", w = 4 and k = 3 your script should find the minimizers:

(1396078460937419741, 6, 1)
(2859083004982788208, 3, 1)

Some useful notes

Your script should consist of 2 main functions: first one for aligning with minimap and the second one for assembling with miniasm. Here is a quick recap of both methods from the corresponding paper.

1. Minimap

1.1. Computing minimizers

Recall: a (w, k) minimizer of a string s is the smallest kmer in a surrounding window of w consecutive kmers.

Algorithm 1 Algorithm 2

1.2. Indexing

Using the output from Algorithm 1 you should make a hash table for minimizers of all target sequences: Key is the minimizer hash and the value is a set of target sequence index, the position of the minimizer and the strand.

Algoirthm 3

Hint: When you are implementing, you should append minimizers to an array (includes minimizer sequence and its position), and then sort this array once you have collected all the minimizers.

1.3. Mapping

Collect all the minimizer hits between the query and all the target sequences. Then, perform single-linkage clustering to group approximately colinear hits. Recall: Some hits in a cluster may not be colinear because two minimizer hits within distance ϵ are always ϵ-away. To fix this issue, you should find the maximal colinear subset of hits by solving a longest increasing sequencing problem (see line 3 in Algorithm 4)

Algorithm 4

HINT: In implementation, set thresholds on the size of the subset (size is 4 by default) and the number of matching bases in the subset to filter poor mappings (100 for read overlapping).

AN EVEN BETTER HINT: To fin the “maximal colinear subset”, instead of implementing the algorithm from scratch you can use this answer on stackoverflow

2. Miniasm

Recall: assembly graphs are directed acyclic graphs whose vertices are substrings, and the edges between two vertices represent their overlap relationship.

2.1. Trimming reads First things first, you should trim your reads by examining the read-to-read mappings.

For each read, you will compute per-base coverage based on good mappings against other reads (longer than 2000 bp with at least 100 bp non-redundant bases on matching minimizers). Then you should identify the longest region having coverage three or more, and trim bases outside this region.

2.2. Generating assembly graph
HINT: There are 2 ways we can have two strings v and w mapped to each other based on their sequence similarity: (i) v is mapped to a substring of w if w contains v (ii) v overlaps w if a suffix of v and a prefix of w are mapped to each other (written as v → w)

Fig 1

Algorithm 5

2.3. Graph cleaning
Once you complete steps 2.1 and 2.2 you should have an assembly graph. Now, you will be cleaning your graph by (i) removing transitive edges (ii) trimming tipping unitigs composed of few reads (4 by default in the paper) and (iii) popping small bubbles.

Recall: In an assembly graph, an edge vw is transitive if there exsits vu and uw. A vertex v is a tip if degv+ = 0 and degv- > 0. A bubble is a directed acyclic subgraph with a single source v and a single sink w having at least two paths between v and w, and without connecting the rest of the graph. The bubble is tight if degv+ > 1 and degv- > 1

Algorithm 6

2.4. Generating unitig sequences
Intuitively, a unitig is a maximal path on which adjacent vertices can be unambiguously merged without affecting the connectivity of the original assembly graph.

Now, you have sanity option in the middle: (1) Implement both minimap and miniasm from scratch and link them up directly
(2) Implement minimap and miniasm from scratch but, just like minimap, output a PAF file as intermediate, which could be used by miniasm (yours or the papers version). This allows testing of minimap and miniasm independently and interoperable with real-world tools.

HINT: If you choose to take the second route, you have to modify your first function to generate a PAF file as output because the miniasm tool only accepts PAF as input for assembly. See author’s webpage for more details on PAF

GOOD LUCK!!