Genome Assembly
Modern techniques for sequencing genomes produce billions of overlapping short reads that must be assembled into a contiguous reference sequence by software. A Nature Biotechnology primer describes a popular algorithm known as De Bruijn Graph Assembly, which reduces genome assembly to the problem of finding an Eulerian path in a directed graph (which Hierholzer’s algorithm can solve in linear time). The key idea is to represent each short read as two nodes that correspond to the left and right tails of the read, and draw a directed edge between the two nodes. Each edge in the resulting graph corresponds to a short read, and an Eulerian path in the graph corresponds to a sequence of short reads that, after squashing overlapping segments, corresponds to the desired reference sequence.