Presentation
Build / Load De Bruijn graphs
Iterating nodes
- Iterate the nodes of a De Bruijn graph
- Iterate the nodes of a De Bruijn graph in a multithread way
Neighborhoods
Branching nodes
Simple path
- Iterating simple path from a node
- Iterating simple path from a node (continued)
Miscellanous
- Playing with node strands
- Playing with fake nodes
To go further...

Presentation

This page presents some code snippets related to the use of De Bruijn graph API.

Some of the snippets presented below can be used online.

Additional snippets are available in directory: gatb-core/gatb-core/examples/debruijn.

Build / Load De Bruijn graphs

Building a De Bruijn graph from command line options

This snippet shows how to create a Graph object thanks to command line options with at least a mandatory FASTA file URI.

The first thing to do is to get a parser that analyzes the command line options from (argc,argv). Such a parser can be retrieved with a static method from Graph class.

Then, the parsed options can be provided to the Graph::create method and then we get a Graph object on which we can do anything we want.

The only mandatory option is '-in fastafile'. All other options have default values if not set through command line.

In this snippet, we dump information about the Graph object building with Graph::getInfo method.

Remarks: This snippet essentially does the same job as the dbgh5 tool provided by the gatb-core project: it takes a set of reads (as a FASTA file) and generates the corresponding De Bruijn graph as a HDF5 file.

Code is from example debruijn1.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*  Graph creation from command line options                                    */
/*                                                                              */
/*  This snippet uses the OptionsParser facility of GATB-Core library.          */
/*                                                                              */
/* Cmd-line: debruijn1 -in <fasta/q file>                                       */
/*                                                                              */
/* Sample: debruijn1 -in gatb-core/gatb-core/test/db/reads1.fa                  */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We get a command line parser for graphs available options.
    IOptionsParser* parser = Graph::getOptionsParser();
    LOCAL (parser);
    // We use a try/catch block in case we have some command line parsing issue.
    try
    {
        // We parse the user options.
        parser->parse (argc, argv);
        // We create the graph with the provided options.
        Graph graph = Graph::create (parser->getProperties());
        // We dump some information about the graph.
        std::cout << graph.getInfo() << std::endl;
    }
    catch (OptionFailure& e)
    {
        return e.displayErrors (std::cout);
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Building a De Bruijn graph from a command-line-like string

Like the previous snippet, we create a Graph object with command line options, but here the options are directly provided as a string.

Code is from example debruijn2.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*  Graph creation from command line options                                    */
/*                                                                              */
/*  This snippet DO NOT use the OptionsParser facility of GATB-Core library     */
/*  but a simple string (expected to be a Fasta/q file) passed in on the        */
/*  command-line.                                                               */
/*                                                                              */
/* Cmd-line: debruijn1 <fasta/q file>                                           */
/*                                                                              */
/* Sample: debruijn1 gatb-core/gatb-core/test/db/reads1.fa                      */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We check that the user provides at least one option (supposed to be a bank URI).
    if (argc < 2)
    {
        std::cerr << "You must provide a bank uri." << std::endl;
        return EXIT_FAILURE;
    }
    try
    {
        // We create the graph with the provided options.
        Graph graph = Graph::create ("-in %s", argv[1]);
        // We dump some information about the graph.
        std::cout << graph.getInfo() << std::endl;
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Building a De Bruijn graph from a bank object

Here, we create a Graph object by providing a bank object, more precisely a IBank object.

It is therefore possible to provide a Bank instance (ie a FASTA bank), or another kind of bank that implements the IBank interface.

Note in the example that we can provide additional options after the bank object.

Code is from example debruijn3.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*  Graph creation from a basic command-line.                                   */
/*  Uses abundance-min=5.                                                       */
/*                                                                              */
/* Cmd-line: debruijn1 <fasta/q file>                                           */
/*                                                                              */
/* Sample: debruijn1 gatb-core/gatb-core/test/db/reads1.fa                      */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We check that the user provides at least one option (supposed to be a FASTA file URI).
    if (argc < 2)
    {
        std::cerr << "You must provide a FASTA file uri." << std::endl;
        return EXIT_FAILURE;
    }
    try
    {
        // We create the graph with the bank and other options
        Graph graph = Graph::create (Bank::open(argv[1]), "-abundance-min %d", 5);
        // We dump some information about the graph.
        std::cout << graph.getInfo() << std::endl;
        // Note: Graph::create will take care about 'bank' object and will delete it if nobody else needs it.
        // In other words: there is no need here to call 'delete' on 'bank' here.
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Building a De Bruijn graph from a fake bank object

Like the previous snippet, we create a Graph object by providing a bank object, but here this is a 'fake' bank built "on the fly".

Such banks are often useful for testing purposes.

In such a case, the output file for the graph will be named "noname", unless a specific name is set through the command line option "-out mygraph".

Code is from example debruijn4.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/* Graph creation from a fake bank and no command line options                  */
/*                                                                              */
/* You can dump the result of this test with HDF5 tools (provided by GATB):     */
/*      h5dump mygraph.h5                                                       */
/*                                                                              */
/* Cmd-line: debruijn4 (takes no argument)                                      */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We get a handle on a fake bank made of 3 sequences.
    IBank* bank = new BankStrings (
        "ATCGTACGACGCTAGCTAGCA",
        "ACTACGTATCGGTATATATTTTCGATCGATCAG",
        "TGACGGTAGCATCGATCAGGATCGA",
        NULL
    );
    try
    {
        // We create the graph with the bank and other options
        Graph graph = Graph::create (bank, "-kmer-size 5  -abundance-min 1  -out mygraph");
        // We dump some information about the graph.
        std::cout << graph.getInfo() << std::endl;
        // Note: Graph::create will take care about 'bank' object and will delete it if nobody else needs it.
        // In other words: there is no need here to call 'delete' on 'bank' here.
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Load a De Bruijn graph from a graph file

Once we have built a graph, it is saved as a file (likely a HDF5 file).

This snippet shows how to load such a file to get a Graph object.

Code is from example debruijn5.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/* Graph loading from a HDF5 file.                                              */
/*                                                                              */
/* Cmd-line: debruijn5 <h5 file>                                                */
/*                                                                              */
/* Sample: debruijn5 gatb-core/gatb-core/test/db/celegans_reads.h5              */
/*                                                                              */
/* Note:                                                                        */
/*     - '.h5' file contains the HDF5 formatted representation of a de bruijn   */
/*     graph created from a set of reads.                                       */
/*     - a '.h5' file is created using dbgh5 program provided with GATB-Core.   */
/*     Basic use is as follows:                                                 */
/*        dbgh5 -in <fasta/q file> -out <h5 file>                               */
/*     You can also control kmer-size and kmer abundance, see dbgh5 help.       */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We check that the user provides at least one option (supposed to be in HDF5 format).
    if (argc < 2)
    {
        std::cerr << "You must provide a HDF5 file." << std::endl;
        return EXIT_FAILURE;
    }
    try
    {
        // We create the graph with the bank and other options
        Graph graph = Graph::load (argv[1]);
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Iterating nodes

Iterate the nodes of a De Bruijn graph

This snippet shows how to iterate all the nodes of a graph (the graph is loaded from a graph file).

The idea is to get an iterator from the graph and use it to get each node of the graph.

Here, the nodes are retrieved regardless of any edges between them.

Code is from example debruijn6.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/* Graph loading from a HDF5 file, the iterate over the nodes.                  */
/*                                                                              */
/* Cmd-line: debruijn6 <h5 file>                                                */
/*                                                                              */
/* Sample: debruijn6 gatb-core/gatb-core/test/db/celegans_reads.h5              */
/*                                                                              */
/*                                                                              */
/* Note:                                                                        */
/*     - '.h5' file contains the HDF5 formatted representation of a de bruijn   */
/*     graph created from a set of reads.                                       */
/*     - a '.h5' file is created using dbgh5 program provided with GATB-Core.   */
/*     Basic use is as follows:                                                 */
/*        dbgh5 -in <fasta/q file> -out <h5 file>                               */
/*     You can also control kmer-size and kmer abundance, see dbgh5 help.       */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We check that the user provides at least one option (supposed to be in HDF5 format).
    if (argc < 2)
    {
        std::cerr << "You must provide a HDF5 file." << std::endl;
        return EXIT_FAILURE;
    }
    try
    {
        // We load the graph from the given graph file
        Graph graph = Graph::load (argv[1]);
        // We get an iterator for all nodes of the graph.
        GraphIterator<Node> it = graph.iterator ();
        // We loop each node. Note the structure of the for loop.
        for (it.first(); !it.isDone(); it.next())
        {
            // The currently iterated node is available with it.item()
            // We dump an ascii representation of the current node.
            std::cout << graph.toString (it.item()) << std::endl;
        }
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Iterate the nodes of a De Bruijn graph in a multithread way

As the previous example, this snippet shows how to iterate all the nodes of a graph.

The difference here is that the iteration is parallelized, using all possible available cores, which should speed up the iteration.

WARNING ! don't forget this is parallel execution, so you have be careful if you want to modify the same object in different thread execution contexts.

Note: lambda expressions are used here to have code conciseness (which suppose to use an up-to-date compiler). You can have some information at http://stackoverflow.com/questions/7627098/what-is-a-lambda-expression-in-c11

Code is from example debruijn7.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*                                                                              */
/* Graph loading from a HDF5 file, the iterate over the nodes, parallel version.*/
/*                                                                              */
/* Cmd-line: debruijn7 <h5 file>                                                */
/*                                                                              */
/* Sample: debruijn7 gatb-core/gatb-core/test/db/celegans_reads.h5              */
/*                                                                              */
/* WARNING ! THIS SNIPPET SHOWS ALSO HOW TO USE LAMBDA EXPRESSIONS, SO YOU NEED */
/* TO USE A COMPILER THAT SUPPORTS THIS FEATURE.                                */
/*                                                                              */
/* Note:                                                                        */
/*     - '.h5' file contains the HDF5 formatted representation of a de bruijn   */
/*     graph created from a set of reads.                                       */
/*     - a '.h5' file is created using dbgh5 program provided with GATB-Core.   */
/*     Basic use is as follows:                                                 */
/*        dbgh5 -in <fasta/q file> -out <h5 file>                               */
/*     You can also control kmer-size and kmer abundance, see dbgh5 help.       */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We check that the user provides at least one option (supposed to be in HDF5 format).
    if (argc < 2)
    {
        std::cerr << "You must provide a HDF5 file." << std::endl;
        return EXIT_FAILURE;
    }
    try
    {
        // We load the graph from the given graph file
        Graph graph = Graph::load (argv[1]);
        // We get an iterator for all nodes of the graph.
        GraphIterator<Node> it = graph.iterator ();
        // We choose how many cores we want to use.
        // By convention, a 0 value will use all available cores.
        size_t nbCores = 4;
        // We iterate all the nodes of the graph.
        // Note how we create a Dispatcher object (with the required number of cores) and
        // use it for dispatching the iteration in several threads.
        IDispatcher::Status status = Dispatcher(nbCores).iterate (it, [&graph] (const Node& node)
        {
            // This instruction block will be executed within the context of one of the 4 created threads.
            // We are given a current node and we can do some process on it in the current thread.
        });
        // We display some information about the dispatching
        std::cout << "we used " << status.nbCores << " cores, elapsed time is " << status.time << " msec" << std::endl;
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Neighborhoods

Working with neighborhoods of nodes in the De Bruijn graph

This snippet shows how to use some methods related to neighbors in a graph.

We use a fake bank with one sequence of size 5 and use a kmer size of 4, so we will have 2 possible nodes in the graph.

We iterate these two nodes, and for one of them, we ask for its neighbors with the Graph::successors method. We can then check that the retrieved neighbor is the correct one by analyzing the node string representations.

In this example, we use the successors method, but note it is possible to get the incoming neighbors with the Graph::predecessors method. By the way, it is possible to know the in and out degree of a given node with the two corresponding methods Graph::indegree and Graph::outdegree

Code is from example debruijn9.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#undef NDEBUG
#include <cassert>
/********************************************************************************/
/*                          Node neighbors management                           */
/*                                                                              */
/* Cmd-line: debruijn9 (takes no argument)                                      */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    try
    {
        // We create the graph with a bank holding one sequence, and use a specific kmer size and kmer solid abundance to 1
        Graph graph = Graph::create (new BankStrings ("AATGC", NULL), "-kmer-size 4  -abundance-min 1  -verbose 0");
        // We get an iterator for all nodes of the graph.
        GraphIterator<Node> it = graph.iterator();
        // We check that we have only two possible nodes
        assert (it.size() == 2);
        // We loop each node.
        for (it.first(); !it.isDone(); it.next())
        {
            // A shortcut.
            Node& current = it.item();
            // We get the ascii representation of the current iterated node
            std::string s = graph.toString (current);
            // We check that it is correct.
            assert (s=="AATG" || s=="ATGC");
            if (s=="AATG")
            {
                // We get the neighbors of this specific current
                GraphVector<Node> neighbors = graph.successors(current);
                // We check that we got only one successor
                assert (neighbors.size() == 1);
                // Another way to check the number of successors
                assert (graph.outdegree(current) == 1);
                // We check this is the correct neighbor
                assert (graph.toString(neighbors[0]) == "ATGC");
            }
        }
        std::cout << "Test OK" << std::endl;
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Working with neighborhoods of nodes in the De Bruijn graph (continued)

This snippet shows how to use some methods related to neighbors in a graph.

We do the same work as the previous example. The only difference is that we retrieve the neighbors as Edge objects rather than Node objects, so we will have the full information about each transition between the source node and the retrieved neighbors (including the transition nucleotide for instance).

In particular, we use some of the Edge attributes (Edge::to, Edge::nt)

Code is from example debruijn10.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#undef NDEBUG
#include <cassert>
/********************************************************************************/
/*                          Node neighbors management                           */
/*                                                                              */
/* Cmd-line: debruijn10 (takes no argument)                                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    try
    {
        // We create the graph with a bank holding one sequence, and use a specific kmer size and solid kmer abundance to 1
        Graph graph = Graph::create (new BankStrings ("AATGC", NULL), "-kmer-size 4  -abundance-min 1  -verbose 0");
        // We get an iterator for all nodes of the graph.
        GraphIterator<Node> it = graph.iterator ();
        // We check that we have only two possible nodes
        assert (it.size() == 2);
        // We loop each node.
        for (it.first(); !it.isDone(); it.next())
        {
            // A shortcut.
            Node& current = it.item();
            // We get the ascii representation of the current iterated node
            std::string s = graph.toString (current);
            // We check that it is correct.
            assert (s=="AATG" || s=="ATGC");
            if (s=="AATG")
            {
                // We get the neighbors of this specific current
                // Note that we get here the full edge information
                GraphVector<Edge> neighbors = graph.successorsEdge (current);
                // We check that we got only one successor
                assert (neighbors.size() == 1);
                // We check that the target node is ok
                assert (graph.toString(neighbors[0].to) == "ATGC");
                // We check that the transition nucleotide is ok
                assert (neighbors[0].nt == NUCL_C);
            }
        }
        std::cout << "Test OK" << std::endl;
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Working with a specific neighbor for a specific node

This snippet shows how to get a specific neighbor for a specific node.

Sometimes, it is interesting to ask the graph for only one neighbor for a given node. It may happen when one has already got neighbors information through a Graph::neighbors call and memorized only the transition nucleotides for valid neighbors.

The Graph::neighbor fulfills this need. This method has two forms, with or without check to graph membership, according to performance considerations.

Code is from example debruijn11.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#undef NDEBUG
#include <cassert>
/********************************************************************************/
/*                          Node neighbors management                           */
/*                                                                              */
/* Cmd-line: debruijn11 (takes no argument)                                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    try
    {
        // We create the graph with a bank holding one sequence, and use a specific kmer size and solid kmer abundance to 1
        Graph graph = Graph::create (new BankStrings ("AATGC", NULL), "-kmer-size 4  -abundance-min 1  -verbose 0");
        // We get an iterator for all nodes of the graph.
        GraphIterator<Node> it = graph.iterator ();
        // We check that we have only two possible nodes
        assert (it.size() == 2);
        // We loop each node.
        for (it.first(); !it.isDone(); it.next())
        {
            // A shortcut.
            Node& current = it.item();
            // We get the ascii representation of the current iterated node
            std::string s = graph.toString (current);
            if (s=="AATG")
            {
                // We suppose that we know the only possible successor transition from this neighbor (nucl C)
                // It could have been retrieved by an previous call to Graph::neighbors<Edge> method
                // Now, we want to get the successor of the current node, only by giving the transition nucleotide.
                Node neighbor = graph.successor (current, NUCL_C);
                // WARNING ! This Graph::successor method doesn't check whether the neighbor actually belongs to the graph.
                // It is supposed here that the client knows perfectly that its transition nucleotide is valid.
                // This possibility is available for performance matters since checking graph membership may take some time.
                // We check the result
                assert (graph.toString (neighbor) == "ATGC");
                // There is an overloaded form for the Graph::successor method, where the user can provide an extra boolean.
                // With this implementation, a check is done to be sure whether the neighbor surely belongs to the graph.
                // If not so, the extra boolean is set to false, true otherwise.
                bool exists;
                Node potentialNeighbor;
                potentialNeighbor = graph.successor (current, NUCL_A, exists);
                assert (exists == false);
                potentialNeighbor = graph.successor (current, NUCL_C, exists);
                assert (exists == true);
                assert (graph.toString (potentialNeighbor) == "ATGC");
                potentialNeighbor = graph.successor (current, NUCL_G, exists);
                assert (exists == false);
                potentialNeighbor = graph.successor (current, NUCL_T, exists);
                assert (exists == false);
            }
        }
        std::cout << "Test OK" << std::endl;
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Branching nodes

In the De Bruijn graph, we can define two types of nodes:

a node N is 'simple' <=> indegree(N)==1 && outdegree(N)
a node N is 'branching' <=> N is not simple

Branching nodes set is an important subset in the whole nodes set of the De Bruijn graph, so it is of most interest to have some graph methods that work on such nodes. In particular, we can:

iterate all the branching nodes of the De Bruijn graph
get the branching neighbors of some node

Remarks: - With this definition, a branching node may have 0 outcoming neighbors or 0 incoming neighbors.; - Since we are considering assembly matters, we should have few branching nodes compared to the simple nodes.

Iterate the branching nodes of a De Bruijn graph

This snippet shows how to iterate the branching nodes of a graph (the graph is loaded from a graph file).

The idea is to get an iterator from the graph and use it to get each branching node of the graph.

Code is from example debruijn8.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/* Graph loading from a HDF5 file, the iterate over the branching nodes.        */
/*                                                                              */
/* Cmd-line: debruijn8 <h5 file>                                                */
/*                                                                              */
/* Sample: debruijn8 gatb-core/gatb-core/test/db/celegans_reads.h5              */
/*                                                                              */
/* Note:                                                                        */
/*     - '.h5' file contains the HDF5 formatted representation of a de bruijn   */
/*     graph created from a set of reads.                                       */
/*     - a '.h5' file is created using dbgh5 program provided with GATB-Core.   */
/*     Basic use is as follows:                                                 */
/*        dbgh5 -in <fasta/q file> -out <h5 file>                               */
/*     You can also control kmer-size and kmer abundance, see dbgh5 help.       */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We check that the user provides at least one option (supposed to be in HDF5 format).
    if (argc < 2)
    {
        std::cerr << "You must provide a HDF5 file." << std::endl;
        return EXIT_FAILURE;
    }
    try
    {
        // We load the graph from the given graph file
        Graph graph = Graph::load (argv[1]);
        // We get an iterator for branching nodes of the graph.
        GraphIterator<BranchingNode> it = graph.iteratorBranching ();
        // We loop each node. Note the structure of the for loop.
        for (it.first(); !it.isDone(); it.next())
        {
            // The currently iterated branching node is available with it.item()
            // We dump an ascii representation of the current node.
            std::cout << "[" << it.rank() << "] " << graph.toString (it.item()) << std::endl;
        }
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Working with branching neighbors of a node

This snippet shows how to get the branching neighbors of a node. Such neighbors are computed as follow:

the immediate neighbors of the node are retrieved
a simple path is done from each neighbor in order to reach the first non simple node

Here, we use directly the Graph::successors<BranchingNode> method that encapsulates this behavior.

Code is from example debruijn16.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*                          Branching nodes neighbors                           */
/*                                                                              */
/* This snippet shows how to detect a specific pattern in the graph.            */
/*                                                                              */
/* Cmd-line: debruijn16 (takes no argument)                                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    size_t kmerSize = 7;
    // We define some sequences used for building our test graph.
    // Note that the sequences have a difference at index==kmerSize,
    // so the initial node AGGCGCT has 3 outcoming neighbors (and
    // therefore is a branching node)
    const char* sequences[] =
    {
        //      x <- difference here
        "AGGCGCTAGGGAGAGGATGATGAAA",
        "AGGCGCTCGGGAGAGGATGATGAAA",
        "AGGCGCTTGGGAGAGGATGATGAAA"
    };
    try
    {
        // We create the graph.
        Graph graph = Graph::create (new BankStrings (sequences, ARRAY_SIZE(sequences)),  "-kmer-size %d  -abundance-min 1  -verbose 0", kmerSize);
        // We get the first node (should be AGGCGCT); this is a branching node.
        Node node = graph.buildNode (sequences[0]);
        // We retrieve the branching neighbors for the node.
        GraphVector<BranchingNode> branchingNeighbors = graph.successorsBranching (node);
        std::cout << "We found " << branchingNeighbors.size() << " branching neighbors from node " << graph.toString(node) << std::endl;
        // We loop over the branching neighbors. Here, we should have 3 branching neighbors, being the same GGGAGAG
        for (size_t i=0; i<branchingNeighbors.size(); i++)
        {
            std::cout << graph.toString (branchingNeighbors[i])  << std::endl;
        }
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Working with branching neighbors of a node (continued)

This snippet shows how to get the branching neighbors of a node.

It is similar to the previous snippet. The difference here is that we retrieve BranchingEdge objects. A BranchingEdge object is made of:

the source branching node
the target branching node
the direction of the neighbors (in/out coming)
the nucleotide of the transition between the initial branching node and the first neighbor on the simple path between the two branching nodes.
the number of transitions that link the two branching nodes.

Code is from example debruijn17.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*                          Branching nodes neighbors                           */
/*                                                                              */
/* This snippet shows how to detect a specific pattern in the graph.            */
/*                                                                              */
/* Cmd-line: debruijn17 (takes no argument)                                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    size_t kmerSize = 7;
    // We define some sequences used for building our test graph.
    // Note that the sequences have a difference at index==kmerSize,
    // so the initial node AGGCGCT has 3 outcoming neighbors (and
    // therefore is a branching node)
    const char* sequences[] =
    {
        //      x <- difference here
        "AGGCGCTAGGGAGAGGATGATGAAA",
        "AGGCGCTCGGGAGAGGATGATGAAA",
        "AGGCGCTTGGGAGAGGATGATGAAA"
    };
    try
    {
        // We create the graph.
        Graph graph = Graph::create (new BankStrings (sequences, ARRAY_SIZE(sequences)),  "-kmer-size %d  -abundance-min 1  -verbose 0", kmerSize);
        // We get the first node (should be AGGCGCT); this is a branching node.
        Node node = graph.buildNode (sequences[0]);
        // We retrieve the branching neighbors for the node.
        GraphVector<BranchingEdge> branchingNeighbors = graph.successorsBranchingEdge (node);
        std::cout << "We found " << branchingNeighbors.size() << " branching neighbors from node " << graph.toString(node) << std::endl;
        // We loop over the branching neighbors. Here, we should have 3 branching neighbors, being the same GGGAGAG
        for (size_t i=0; i<branchingNeighbors.size(); i++)
        {
            // Note: we don't display all the transition nucleotides, only the first transition nucleotide.
            // We also display the number of transitions needed to link the two branching nodes.
            std::cout << graph.toString (branchingNeighbors[i])  << std::endl;
        }
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Simple path

Iterating simple path from a node

As we saw previously, a simple node is defined as having indegree==1 and outdegree==1. It is often useful to iterate successive simple nodes in order to build some path in the De Bruijn graph.

This snippet shows how to iterate such a simple path. Here, the iterated items are the successive nodes of the path.

Code is from example debruijn14.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*                          Simple path iteration (with Node)                   */
/*                                                                              */
/* Cmd-line: debruijn14 (takes no argument)                                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    try
    {
        size_t kmerSize = 11;
        const char* seq = "AGGCGCTAGGGTAGAGGATGATGA";
        std::cout << "The initial sequence is '" << seq << "', kmer size is " << kmerSize << std::endl;
        // We create the graph from a given sequence, and for a given kmer size
        Graph graph = Graph::create (new BankStrings (seq, NULL),  "-kmer-size %d  -abundance-min 1  -verbose 0", kmerSize);
        // We get the first node of the sequence.
        Node node = graph.buildNode (seq);
        // We create a Node iterator that iterates all the simple nodes from the first node
        // Remember that a simple node has indegree==1 and outdegree==1
        GraphIterator<Node> path = graph.simplePath (node, DIR_OUTCOMING);
        // We iterate the simple path.
        for (path.first(); !path.isDone(); path.next())
        {
            std::cout << "   [" << path.rank() << "]  current item is " << graph.toString (path.item()) << std::endl;
        }
        std::cout << "The simple path was " << path.rank() << " long" << std::endl;
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Iterating simple path from a node (continued)

Like the previous example, this snippet shows how to iterate a simple path. Here, the iterated items are the successive edges of the path. If we note E an edge in this path, we will have:

outdegree(E.from)==1 && indegree(E.to)==1

Code is from example debruijn15.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*                          Simple path iteration (with Edge)                   */
/*                                                                              */
/* Cmd-line: debruijn15 (takes no argument)                                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    try
    {
        size_t kmerSize = 11;
        char* seq = (char*) "AGGCGCTAGGGTAGAGGATGATGA";
        std::cout << "The initial sequence is '" << seq << "', kmer size is " << kmerSize << std::endl;
        // We create the graph from a given sequence, and for a given kmer size
        Graph graph = Graph::create (new BankStrings (seq, NULL),  "-kmer-size %d  -abundance-min 1  -verbose 0", kmerSize);
        // We get the first node of the sequence.
        Node node = graph.buildNode (seq);
        // We create a Edge iterator that iterates all the simple edges from the first node
        // Recall that a simple node has indegree==1 and outdegree==1
        GraphIterator<Edge> path = graph.simplePathEdge (node, DIR_OUTCOMING);
        // We iterate the simple path.
        for (path.first(); !path.isDone(); path.next())
        {
            std::cout << "   [" << path.rank() << "]  current item is " << graph.toString (path.item()) << std::endl;
        }
        std::cout << "The simple path was " << path.rank() << " long" << std::endl;
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Miscellanous

Playing with node strands

A Node object is fully defined by a kmer value and a strand that disambiguates how to interpret the kmer.

It is often required to change the reading strand of a node. This can be done with the Graph::reverse method.

Code is from example debruijn12.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#undef NDEBUG
#include <cassert>
/********************************************************************************/
/*                          Node strands management                             */
/*                                                                              */
/* Cmd-line: debruijn12 (takes no argument)                                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    try
    {
        // We create the graph with a bank holding one sequence, and use a specific kmer size and solid kmer abundance to 1
        Graph graph = Graph::create (new BankStrings ("AATGC", NULL), "-kmer-size 5  -abundance-min 1  -verbose 0");
        // We get an iterator for all nodes of the graph.
        GraphIterator<Node> it = graph.iterator ();
        // We check that we have only one possible node
        assert (it.size() == 1);
        // We get the first (and only) node.
        it.first();  Node& current = it.item();
        // We get the ascii representation of the current iterated node
        std::string s = graph.toString (current);
        // We check the string value
        assert (s == "AATGC");
        // We reverse the node, which will be change its strand
        Node other = graph.reverse (current);
        // We check the string value of the reverse of the current node
        assert (graph.toString(other) == "GCATT");
        // We also check that the two nodes share the same kmer value
        assert (current.kmer == other.kmer);
        std::cout << "Test OK" << std::endl;
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Playing with fake nodes

Sometimes, it is useful to build "fake" nodes from a simple sequence, without having a graph holding true data.

It is possible to get an empty Graph object (its kmer size must be nevertheless specified), and then use the Graph::buildNode to get a node based on a Data object.

Code is from example debruijn13.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#undef NDEBUG
#include <cassert>
/********************************************************************************/
/*                          Fake nodes management                               */
/*                                                                              */
/* Cmd-line: debruijn13 (takes no argument)                                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    try
    {
        // We create an empty graph with a given kmer size
        Graph graph = Graph::create (7);
        // We create a sequence
        const char* seq = "ATGCATCGTA";
        // We ask for a fake node, starting at position 0 in the data
        Node n0 = graph.buildNode (seq+0);
        assert (graph.toString(n0) == "ATGCATC");
        // We ask for another fake node, starting at position 1 in the data
        Node n1 = graph.buildNode (seq+1);
        assert (graph.toString(n1) == "TGCATCG");
        std::cout << "Test OK" << std::endl;
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

To go further...

Getting branching nodes statistics in a parallel way

This example is a little bit harder than the previous ones. Its purpose is to show how to use the graph API for extracting some information. In particular, we will try to take care about time execution by using available cores.

The idea here is to compute information about branching nodes and their in and out degrees. For instance, we want to know how many branching nodes have indegree==2 and outdegree=3. We compute therefore the number of branching nodes having indegree==X and outdegree==Y, with X and Y in [0..4] and the constraint that we can't have X==Y==1 (otherwise the node wouldn't be a branching one).

We can do it easily by using the methods:

Graph::successors<BranchingNode>
Graph::predecessors<BranchingNode>

Moreover, we want to do it in a parallel way in order to speed up the computation. The idea is to get an iterator over the branching nodes and iterate it through a Dispatcher object; such a dispatcher will create as many threads as wanted and will feed each threads with branching nodes. Note that this scheme can work here because the processing of one branching node is independant of the others.

We need also some container for the in/out degrees statistics. A natural choice is to use a map, with the key being a unique identifier for a couple (indegree/outdegree) and the value the number of occurrences for the key. The idea is to use one map per thread and to merge the N maps into a global one after the parallel iteration of the branching nodes. We use here a ThreadObject object that allows to do it in a simple way. This object clones N time the global map and each clone is used in a specific thread. The ThreadObject class allows to hide many cumbersome points for the parallelization.

In this example, we also use progress notification feature (with ProgressIterator) in order to have some user feedback during the iteration of the branching nodes.

Code is from example debruijn18.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#include <map>
#include <algorithm>
#include <cstdio>
#include <cstdlib>
using namespace std;
/********************************************************************************/
/*                   Statistics about branching nodes in/out degrees.           */
/*                                                                              */
/* WARNING ! THIS SNIPPET SHOWS ALSO HOW TO USE LAMBDA EXPRESSIONS, SO YOU NEED */
/* TO USE A COMPILER THAT SUPPORTS THIS FEATURE.                                */
/*                                                                              */
/* Cmd-line: debruijn18 -graph <h5 file>                                        */
/*                                                                              */
/* Sample: debruijn18 -graph gatb-core/gatb-core/test/db/celegans_reads.h5      */
/*                                                                              */
/* Note:                                                                        */
/*     - '.h5' file contains the HDF5 formatted representation of a de bruijn   */
/*     graph created from a set of reads.                                       */
/*     - a '.h5' file is created using dbgh5 program provided with GATB-Core.   */
/*     Basic use is as follows:                                                 */
/*        dbgh5 -in <fasta/q file> -out <h5 file>                               */
/*     You can also control kmer-size and kmer abundance, see dbgh5 help.       */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    OptionsParser parser ("GraphStats");
    parser.push_back (new OptionOneParam (STR_URI_GRAPH, "graph input",  true));
    parser.push_back (new OptionOneParam (STR_NB_CORES,  "nb cores",     false, "0"));
    try
    {
        IProperties* options = parser.parse (argc, argv);
        // We load the graph
        Graph graph = Graph::load (options->getStr(STR_URI_GRAPH));
        // We set the number of cores to be used. Use all available cores if set to 0.
        size_t nbCores = options->getInt(STR_NB_CORES);
        // We get an iterator for branching nodes of the graph.
        // We use a progress iterator to get some progress feedback
        ProgressGraphIterator<BranchingNode,ProgressTimer>  itBranching (graph.iteratorBranching(), "statistics");
        // We define some kind of unique identifier for a couple (indegree,outdegree)
        typedef pair<size_t,size_t> InOut_t;
        // We want to gather some statistics during the iteration.
        // Note the use of ThreadObject: this object will be cloned N times (one object per thread) and each clone will
        // be reachable within the iteration block through ThreadObject::operator()
        ThreadObject <map <InOut_t, size_t> > topology;
        // We dispatch the iteration on several cores. Note the usage of lambda expression here.
        IDispatcher::Status status = Dispatcher(nbCores).iterate (itBranching, [&] (const BranchingNode& node)
        {
            // We retrieve the current instance of map <InOut_t,size_t> for the current running thread.
            map <InOut_t,size_t>& localTopology = topology();
            // We get branching nodes neighbors for the current branching node.
            GraphVector<BranchingEdge> successors   = graph.successorsBranchingEdge ((Node&)node); /* node casting is necessary due to incomplete Graph.hpp api for successorsBranchingEdge */
            GraphVector<BranchingEdge> predecessors = graph.predecessorsBranchingEdge ((Node&)node); 
            // We increase the occurrences number for the current couple (in/out) neighbors
            localTopology [make_pair(predecessors.size(), successors.size())] ++;
        });
        // Now, the parallel processing is done. We want now to aggregate the information retrieved
        // in each thread in a single map.
        // We get each map<InOut_t,size_t> object filled in each thread, and we add its data into the "global" map.
        // The global map is reachable through the ThreadObject::operator*. The "topology.foreach" will loop over
        // all cloned object used in the threads.
        topology.foreach ([&] (const map <InOut_t, size_t>& t)
        {
            // We update the occurrence of the current couple (in/out)
            for_each (t.begin(), t.end(), [&] (const pair<InOut_t, size_t>& p) { (*topology)[p.first] += p.second;  });
        });
        // We sort the statistics by decreasing occurrence numbers. Since map have its own ordering, we need to put all
        // the data into a vector and sort it with our own sorting criteria.
        vector < pair<InOut_t,size_t> >  stats;
        for (auto it = topology->begin(); it != topology->end(); it++)  { stats.push_back (*it); }
        sort (stats.begin(), stats.end(), [=] (const pair<InOut_t,size_t>& a, const pair<InOut_t,size_t>& b) { return a.second > b.second; });
        printf ("\nThere are %d branching nodes with the following distribution: \n", itBranching.size());
        size_t sum=0;
        for (size_t i=0; i<stats.size(); i++)
        {
            sum += stats[i].second;
            printf ("    [in=%d out=%d]  nb=%7d  percent=%5.2f  distrib=%5.2f\n",
                stats[i].first.first,
                stats[i].first.second,
                stats[i].second,
                100.0*(float)stats[i].second / (float)itBranching.size(),
                100.0*(float)sum             / (float)itBranching.size()
            );
        }
        printf ("\nDone on %d cores in %.2f sec\n\n", status.nbCores, (float)status.time/1000.0);
    }
    catch (OptionFailure& e)
    {
        return e.displayErrors (std::cout);
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Computing connected components of the branching nodes sub graph

Code is from example debruijn19.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#include <map>
#include <cstdio>
#include <cstdlib>
#include <set>
#undef NDEBUG
#include <cassert>
using namespace std;
/********************************************************************************/
//
//
/********************************************************************************/
/*  BFS exploration of a graph.                                                 */
/*                                                                              */
/*  We define a class that marks nodes in a graph and call tell whether a given */
/*  node is marked or not.                                                      */
/*  We use a map for the implementation (could be not optimal).                 */
/*                                                                              */
/* Cmd-line: debruijn19 -graph <h5 file>                                        */
/*                                                                              */
/* Sample: debruijn19 -graph gatb-core/gatb-core/test/db/celegans_reads.h5      */
/*                                                                              */
/* Note:                                                                        */
/*     - '.h5' file contains the HDF5 formatted representation of a de bruijn   */
/*     graph created from a set of reads.                                       */
/*     - a '.h5' file is created using dbgh5 program provided with GATB-Core.   */
/*     Basic use is as follows:                                                 */
/*        dbgh5 -in <fasta/q file> -out <h5 file>                               */
/*     You can also control kmer-size and kmer abundance, see dbgh5 help.       */
/*                                                                              */
/********************************************************************************/
class GraphMarker
{
public:
    GraphMarker (const Graph& graph) : graph(graph)
    {
        // We insert all the nodes into our map.
        auto iter = [&] (const BranchingNode& item)  {  this->markMap[item] = false;  };
        graph.iteratorBranching().iterate (iter);
    }
    void mark (const BranchingNode& item)  {  markMap [item] = true;  }
    void mark (const set<BranchingNode>& items)
    {
        for (typename set<BranchingNode>::const_iterator it=items.begin(); it != items.end(); ++it)  {  mark (*it); }
    }
    bool isMarked (const BranchingNode& item) const  {  return markMap.find(item)->second;  }
private:
    const Graph& graph;
    map<BranchingNode,bool>  markMap;
};
/********************************************************************************/
// We define a class that performs a breadth first search in a graph from a starting node.
// We need a marker to globally mark the visited nodes from one call to another.
// The nodes visit is handled by a std::queue object.
class BFS
{
public:
    BFS (const Graph& graph) : graph(graph) {}
    // Process the BFS started from the provided node. As a result, we get the nodes number in the
    // found connected component.
    const set<BranchingNode>& run (const BranchingNode& node)
    {
        // ALGORITHM (recursion on n)
        //    We define as C(n) the set of nodes of the connected component to be computed.
        //    We define as F(n) the set of nodes for which we get neighbors for extending the BFS
        //    We use the function N(E) that returns the set of neighbors nodes of items in set E
        //
        //  Init:
        //      C(0)={b} and F(0)={b},  where b is the initial node
        //  Recursion:
        //      F(n+1) = N(F(n)) - C(n)
        //      C(n+1) = N(F(n)) + C(n)
        //  End criteria:
        //      card(F(n)) = 0
        // We set the initial state for F(0)
        set<BranchingNode> frontline;
        frontline.insert (node);
        // We set the initial state for C(0)
        connectedComponent.clear();
        connectedComponent.insert (node);
        // We launch the recursion.
        while (!frontline.empty())
        {
            // We get the neighbors for the current front line, ie. we get N(F(n))
            set<BranchingNode> neighbors = graph.neighbors (frontline.begin(), frontline.end()); // this is a very special overload of neighbors() from Graph.hpp
            // We reset the current front line => we reuse it for computing F(n+1)
            frontline.clear();
            // We compute the recursion for F(n+1) and C(n+1)
            for (typename set<BranchingNode>::iterator it = neighbors.begin(); it != neighbors.end(); it++)
            {
                if (connectedComponent.find (*it) == connectedComponent.end())
                {
                    // F(n+1) = N(F(n)) - C(n)
                    frontline.insert (*it);
                    // C(n+1) = N(F(n)) + C(n)
                    connectedComponent.insert (*it);
                }
            }
        }
        // We return the number of nodes for this connected component
        return connectedComponent;
    }
    // We provide an accessor to the nodes of the found connected component
    const set<BranchingNode>& get() const { return connectedComponent; }
private:
    const Graph& graph;
    set<BranchingNode>       connectedComponent;
};
/********************************************************************************/
/*        Computing connected components of the branching nodes subgraph.       */
/*                                                                              */
/* WARNING ! THIS SNIPPET SHOWS ALSO HOW TO USE LAMBDA EXPRESSIONS, SO YOU NEED */
/* TO USE A COMPILER THAT SUPPORTS THIS FEATURE.                                */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    OptionsParser parser ("GraphStats");
    parser.push_back (new OptionOneParam (STR_URI_GRAPH, "graph input",  true));
    try
    {
        IProperties* options = parser.parse (argc, argv);
        // We load the graph
        Graph graph = Graph::load (options->getStr(STR_URI_GRAPH));
        // We create a graph marker.
        GraphMarker marker (graph);
        // We create an object for Breadth First Search for the de Bruijn graph.
        BFS bfs (graph);
        // We want to compute the distribution of connected components of the branching nodes.
        //    - key is a connected component class (for a given number of branching nodes for this component)
        //    - value is the number of times this component class occurs in the branching sub graph
        map<size_t,size_t> distrib;
        // We get an iterator for branching nodes of the graph. We use a progress iterator to get some progress feedback
        ProgressGraphIterator<BranchingNode,ProgressTimer>  itBranching (graph.iteratorBranching(), "statistics");
        // We want to know the number of connected components
        size_t nbConnectedComponents = 0;
        // We want time duration of the iteration
        TimeInfo ti;
        ti.start ("compute");
        // We loop the branching nodes
        for (itBranching.first(); !itBranching.isDone(); itBranching.next())
        {
            // We skip already visited nodes.
            if (marker.isMarked (*itBranching))  { continue; }
            // We launch the breadth first search; we get as a result the set of branching nodes in this component
            const set<BranchingNode>& component = bfs.run (*itBranching);
            // We mark the nodes for this connected component
            marker.mark (component);
            // We update our distribution
            distrib[component.size()] ++;
            // We update the number of connected components.
            nbConnectedComponents++;
        }
        ti.stop ("compute");
        // We compute the total number of branching nodes in all connected components.
        size_t sum = 0;   for (map<size_t,size_t>::iterator it = distrib.begin(); it != distrib.end(); it++)  {  sum += it->first*it->second; }
        // Note: it must be equal to the number of branching nodes of the graph
        assert (sum == itBranching.size());
        // We aggregate the computed information
        Properties props ("connected_components");
        props.add (1, "graph_name",              "%s", graph.getName().c_str());
        props.add (1, "nb_branching_nodes",      "%d", sum);
        props.add (1, "nb_connected_classes",    "%d", distrib.size());
        props.add (1, "nb_connected_components", "%d", nbConnectedComponents);
        for (map<size_t,size_t>::iterator it = distrib.begin(); it!=distrib.end(); it++)
        {
            props.add (2, "component");
            props.add (3, "nb_nodes",    "%d", it->first);
            props.add (3, "nb_occurs",   "%d", it->second);
            props.add (3, "freq_nodes",  "%f", 100.0*(float)(it->first*it->second) / (float)sum);
            //props.add (3, "freq_occurs", "%f", 100.0*(float)it->second / (float)sum);
        }
        props.add (1, ti.getProperties("time"));
        // We dump the results in a XML file in the current directory
        XmlDumpPropertiesVisitor v (graph.getName() + ".xml", false);
        props.accept (&v);
    }
    catch (OptionFailure& e)
    {
        return e.displayErrors (std::cout);
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Generate dot file from a graph

This snippet generates a dot file from a de Bruijn graph. It is then possible to generate a pdf output of the graph (dot -Tpdf graph.dot -o graph.pdf)

Code is from example debruijn24.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#include <fstream>
#include <queue>
#include <stack>
#include <map>
using namespace std;
#define DEBUG(a)  //a
#define INFO(a)   //a
/********************************************************************************/
/* Cmd-line: debruijn24 -graph <h5 file> [-out <output file>]                   */
/*                                                                              */
/* Sample: debruijn24 -graph gatb-core/gatb-core/test/db/celegans_reads.h5      */
/*                                                                              */
/* Note:                                                                        */
/*     - '.h5' file contains the HDF5 formatted representation of a de bruijn   */
/*     graph created from a set of reads.                                       */
/*     - a '.h5' file is created using dbgh5 program provided with GATB-Core.   */
/*     Basic use is as follows:                                                 */
/*        dbgh5 -in <fasta/q file> -out <h5 file>                               */
/*     You can also control kmer-size and kmer abundance, see dbgh5 help.       */
/********************************************************************************/
const char* STR_NODE_TYPE = "-type";
/********************************************************************************/
class DotGeneratorTool : public Tool
{
public:
    // Constructor
    DotGeneratorTool () : Tool ("DotGenerator")
    {
        _parser->push_front (new OptionOneParam (STR_URI_GRAPH,  "graph file", true ));
        _parser->push_front (new OptionOneParam (STR_URI_OUTPUT, "dot file",  false ));
        _parser->push_front (new OptionOneParam (STR_NODE_TYPE,  "node type (0: all,  1:branching)", false, "1" ));
    }
    void processNode(Graph &graph, ofstream &output)
    {
        map<Node, u_int64_t> mapping;
        u_int64_t count = 0;
    
        GraphIterator<Node> itMap = graph.iterator ();
        for (itMap.first(); !itMap.isDone(); itMap.next())  { mapping[itMap.item()] = count++; }
        ProgressGraphIterator<Node,ProgressTimer> it = graph.iterator ();
        for (it.first(); !it.isDone(); it.next())
        {
            Node current = it.item();
            GraphVector<Node> neighbors = graph.neighbors(current.kmer);
            for (size_t i=0; i<neighbors.size(); i++)
            {
                output << mapping[current.kmer] << " -> " <<  mapping[neighbors[i].kmer] << " ;\n";
            }
        }
        output << "}\n";
        output.close();
    }
    void processBranchingNode(Graph &graph, ofstream & output)
    {
        map<Node, u_int64_t> mapping;
        u_int64_t count = 0;
    
        GraphIterator<BranchingNode> itMap = graph.iteratorBranching();
        for (itMap.first(); !itMap.isDone(); itMap.next())  { mapping[itMap.item()] = count++; }
        ProgressGraphIterator<BranchingNode,ProgressTimer> it = graph.iteratorBranching ();
        for (it.first(); !it.isDone(); it.next())
        {
            BranchingNode current = it.item();
            GraphVector<BranchingNode> neighbors = graph.neighborsBranching (current.kmer);
            for (size_t i=0; i<neighbors.size(); i++)
            {
                output << mapping[current.kmer] << " -> " <<  mapping[neighbors[i].kmer] << " ;\n";
            }
        }
        output << "}\n";
        output.close();
    }
 
    // Actual job done by the tool is here
    void execute ()
    {
        string outputFile = getInput()->get(STR_URI_OUTPUT) ?
            getInput()->getStr(STR_URI_OUTPUT) :
            (System::file().getBaseName(getInput()->getStr(STR_URI_GRAPH)) + ".dot");
        ofstream output (outputFile.c_str());
        // We load the graph
        Graph graph = Graph::load (getInput()->getStr(STR_URI_GRAPH));
        switch (getInput()->getInt(STR_NODE_TYPE))
        {
            case 0: 
                output << "digraph " << "all" << "{\n";
                processNode(graph, output);
                break;
            case 1: 
                output << "digraph " << "branching" << "{\n";
                processBranchingNode(graph, output);  
                break;
            default: break;
        }
     }
};
/********************************************************************************/
/*                                                                              */
/*                   Generate dot file from a graph.                            */
/*                                                                              */
/*  This snippet generates a dot file from a graph file. You can then generate  */
/*  a pdf file with "dot -Tpdf graph.dot -o graph.pdf"                          */
/*                                                                              */
/*  NOTE: de Bruijn graphs may be huge and complex, so dot is not the best tool */
/*  to display such graphs. You should use it on small graphs with only a few   */
/*  hundreds of nodes.                                                          */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    try
    {
        // We run the tool with the provided command line arguments.
        DotGeneratorTool().run (argc, argv);
    }
    catch (Exception& e)
    {
        std::cout << "EXCEPTION: " << e.getMessage() << std::endl;
        return EXIT_FAILURE;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Retrieve extra information from a node

This snippet shows how to use the 'queryAbundance' method that returns the occurrences number of the node in the initial set of reads.

Code is from example debruijn26.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
// We use the required packages
using namespace std;
/********************************************************************************/
/*                   Getting abundances for nodes graph.                        */
/*                                                                              */
/* This snippet shows how to retrieve the abundance of a node in the graph.     */
/* This feature is enabled only when the "-mphf emphf" is set during graph      */
/* creation.                                                                    */
/*                                                                              */
/* Cmd-line: debruijn26 (takes no argument)                                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    const char* seq = "AAAAACTACATTACCCGTTTGCGAGACAGGTA";
    size_t kmerSize = strlen (seq) - 1;
    // We create a fake bank with 4 times the same sequence
    IBank* bank = new BankStrings (seq, seq, seq, seq, 0);
    // We create the graph. (it needs a mphf, but now mphf is created by default)
    Graph graph = Graph::create (bank, "-kmer-size %d  -abundance-min 1 -verbose 0", kmerSize);
    // We build a fake node (we are sure that it will be in the graph).
    Node node = graph.buildNode (seq);
    // We query its abundance.
    cout << "abundance=" << graph.queryAbundance(node) << endl;
}

[go back to top]

Table of Contents

Presentation

Build / Load De Bruijn graphs

Building a De Bruijn graph from command line options

Building a De Bruijn graph from a command-line-like string

Building a De Bruijn graph from a bank object

Building a De Bruijn graph from a fake bank object

Load a De Bruijn graph from a graph file

Iterating nodes

Iterate the nodes of a De Bruijn graph

Iterate the nodes of a De Bruijn graph in a multithread way

Neighborhoods

Working with neighborhoods of nodes in the De Bruijn graph

Working with neighborhoods of nodes in the De Bruijn graph (continued)

Working with a specific neighbor for a specific node

Branching nodes

Iterate the branching nodes of a De Bruijn graph

Working with branching neighbors of a node

Working with branching neighbors of a node (continued)

Simple path

Iterating simple path from a node

Iterating simple path from a node (continued)

Miscellanous

Playing with node strands

Playing with fake nodes

To go further...

Getting branching nodes statistics in a parallel way

Computing connected components of the branching nodes sub graph

Generate dot file from a graph

Retrieve extra information from a node