Presentation
Using a kmer model
Computing kmers with a model
Iterating kmers from a sequence
Iterating kmers from one or several banks
Computing statistics about minimizers
Checking span of kmers model
Other kind of statistics about minimizers
Setting custom minimizers definition
Counting kmers from a bank with the sorting count algorithm
Reading a file generated by the sorting count algorithm

Presentation

This page presents some code snippets related to the use of k-mer API.

Some of the snippets presented below can be used online.

Additional snippets are available in directory: gatb-core/gatb-core/examples/kmer.

Using a kmer model

This snippet shows how to create kmer models.

Code is from example kmer1.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*                              Kmer management                                 */
/*                                                                              */
/* This snippet shows how instantiate the Model class and how to get info       */
/* about it.                                                                    */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We declare a default kmer model
    Kmer<>::ModelCanonical model;
    // We get some information about the model.
    std::cout << "----------   MODEL   ----------"               << std::endl;
    std::cout << "span:             " << model.getSpan()         << std::endl;
    std::cout << "kmer size:        " << model.getKmerSize()     << std::endl;
    std::cout << "kmer memory size: " << model.getMemorySize()   << std::endl;
}

[go back to top]

Computing kmers with a model

This snippet shows how to get kmers from a model. Here, we can see that we can use 3 different kinds of models, giving different kinds of kmers:

ModelDirect : direct kmers
ModelCanonical : minimum value of the direct kmer and its reverse complement
ModelMinimizer : provides also a minimizer for the kmer

The snippet shows different methods usable for each kind of model.

Code is from example kmer2.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*                              Kmer management                                 */
/*                                                                              */
/* This snippet shows how to get kmer information from a raw ascii sequence.    */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
#define KSIZE_32 (KSIZE_LIST == 32)                                                                                                                                                                               
        std::cout << "this example cannot be run when gatb-core is compiled for kmers <= 32" << std::endl;
#if KSIZE_32                                                                                                                                                                                                      
#else  
    // We set the maximal span of the kmers. We use here a constant of gatb-core
    // that gives a default span for kmers up to the second kmer size value
    const size_t span = KMER_SPAN(1);
    // We define some nucleotides sequence.
    const char* seq = argc<2 ? "CTACGAATT" : argv[1];
    std::cout << std::endl << "INITIAL STRING IS '" << seq << "'" << std::endl;
    // We set the kmer size as the length of our sequence.
    size_t kmerSize = strlen(seq);
    if (kmerSize >= span)
    {
        std::cerr << "STRING TOO BIG (" << kmerSize << "),  must be less than " << span << std::endl;
        return EXIT_FAILURE;
    }
    // Once we have defined our span, we define some typedefs
    // Note : such definitions are not mandatory but provides better code readability
    typedef Kmer<span>::ModelDirect                    ModelDirect;
    typedef Kmer<span>::ModelCanonical                 ModelCanonical;
    typedef Kmer<span>::ModelMinimizer<ModelCanonical> ModelMinimizer;
    // FIRST EXAMPLE : model direct
    {
        // We declare a kmer model with kmer size big enough to represent our sequence.
        ModelDirect model (kmerSize);
        // We compute the kmer for a given sequence
        ModelDirect::Kmer kmer = model.codeSeed (seq, Data::ASCII);
        std::cout << std::endl;
        std::cout << "-------------------- DIRECT --------------------" << std::endl;
        std::cout << "kmer  value is: " << kmer.value()                 << std::endl;
        std::cout << "kmer string is: " << model.toString(kmer.value()) << std::endl;
    }
    // SECOND EXAMPLE : model canonical
    {
        // We declare a kmer model with kmer size big enough to represent our sequence.
        ModelCanonical model (kmerSize);
        // We compute the kmer for a given sequence
        ModelCanonical::Kmer kmer = model.codeSeed (seq, Data::ASCII);
        std::cout << std::endl;
        std::cout << "-------------------- CANONICAL --------------------" << std::endl;
        std::cout << "kmer  value is: " << kmer.value()                    << std::endl;
        std::cout << "kmer string is: " << model.toString(kmer.value())    << std::endl;
        std::cout << "forward value  is: " << kmer.forward()                    << std::endl;
        std::cout << "forward string is: " << model.toString(kmer.forward())    << std::endl;
        std::cout << "revcomp value  is: " << kmer.revcomp()                    << std::endl;
        std::cout << "revcomp string is: " << model.toString(kmer.revcomp())    << std::endl;
        std::cout << "used strand is   : " << toString(kmer.strand())           << std::endl;
    }
    // THIRD EXAMPLE : model minimizer
    {
        // We declare a kmer model with kmer size big enough to represent our sequence.
        // Note that we give a second size, which is the size of the minimizers
        ModelMinimizer model (kmerSize, 8);
        // We get a reference on the minimizer model, which will be useful for dumping
        // string value of a minimizer. Recall that 'model' is a model configured with
        // 'kmerSize' but has also to deal with mmers of size kmerSize/2. The 'getMmersModel'
        // method just provides access to this inner model.
        const ModelCanonical& modelMinimizer = model.getMmersModel();
        // We compute the kmer for a given sequence
        ModelMinimizer::Kmer kmer = model.codeSeed (seq, Data::ASCII);
        std::cout << std::endl;
        std::cout << "-------------------- MINIMIZER --------------------" << std::endl;
        std::cout << "kmer  value is: " << kmer.value()                    << std::endl;
        std::cout << "kmer string is: " << model.toString(kmer.value())    << std::endl;
        // With this model, we have extra information.
        std::cout << "forward value  is: " << kmer.forward()                    << std::endl;
        std::cout << "forward string is: " << model.toString(kmer.forward())    << std::endl;
        std::cout << "revcomp value  is: " << kmer.revcomp()                    << std::endl;
        std::cout << "revcomp string is: " << model.toString(kmer.revcomp())    << std::endl;
        std::cout << "used strand is   : " << toString(kmer.strand())           << std::endl;
        // We can also have information about minimizers.
        // Note :  kmer.minimizer() is of type ModelCanonical, ie the type provided as
        // template argument of the ModelMinimizer class.
        std::cout << "minimizer model size       : " << modelMinimizer.getKmerSize() << std::endl;
        std::cout << "minimizer value is         : " << kmer.minimizer().value()     << std::endl;
        std::cout << "minimizer string is        : " << modelMinimizer.toString(kmer.minimizer().value()) << std::endl;
        std::cout << "minimizer position in kmer : " << kmer.position()   << std::endl;
        std::cout << "minimizer changed          : " << kmer.hasChanged() << std::endl;
    }
#endif
}

[go back to top]

Iterating kmers from a sequence

This snippet shows how to iterate the kmers from a sequence, for a given model.

Code is from example kmer3.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#include <iostream>
/********************************************************************************/
/*                              Kmer management                                 */
/*                                                                              */
/* This snippet shows how to iterate the kmers of a given string.               */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We define a sequence of nucleotides
    const char* seq = "CATTGATAGTGGATGGT";
    std::cout << "Initial sequence: " << seq << std::endl;
    // We configure a data object with a sequence (in ASCII format)
    Data data ((char*)seq);
    // We declare a kmer model with kmer of size 5.
    // Note that we want "direct" kmers, not the min(forward,revcomp) default behavior.
    Kmer<>::ModelDirect model (5);
    // We declare an iterator on a given sequence.
    Kmer<>::ModelDirect::Iterator it (model);
    // We configure the iterator with our sequence
    it.setData (data);
    // We iterate the kmers.
    for (it.first(); !it.isDone(); it.next())
    {
        std::cout << "kmer " << model.toString(it->value()) << ",  value " << it->value() << std::endl;
    }
}

[go back to top]

Iterating kmers from one or several banks

This snippet shows how to iterate the kmers from banks. In particular, we use two iterators and two loops:

outer loop on sequences of the bank
inner loop on kmer on the current sequence from the outer loop

Code is from example kmer4.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#include <iostream>
#include <string.h>
// We use the required packages
using namespace std;
static const size_t span = KMER_SPAN(0);
/********************************************************************************/
/*                              Kmer management                                 */
/*                                                                              */
/* This snippet shows how to iterate the kmers of the sequences from a bank.    */
/*                                                                              */
/* Cmd-line: kmer4 -in <fasta/q file> -kmer-size <value> [-verbose]             */
/*                                                                              */
/* Sample: kmer4 -in gatb-core/gatb-core/test/db/reads1.fa -kmer-size 11        */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    OptionsParser parser ("KmerTest");
    parser.push_back (new OptionOneParam (STR_URI_INPUT, "bank input",    true));
    parser.push_back (new OptionOneParam (STR_KMER_SIZE, "kmer size",     true));
    parser.push_back (new OptionNoParam  (STR_VERBOSE,   "display kmers", false));
    try
    {
        IProperties* options = parser.parse (argc, argv);
        // We define the max size of a data line in the FASTA output file
        size_t kmerSize = options->getInt(STR_KMER_SIZE);
        bool verbose = options->get(STR_VERBOSE) != 0;
        u_int64_t nbSequences = 0;
        u_int64_t nbKmers     = 0;
        // We open the input bank
        IBank* bank = Bank::open (options->getStr(STR_URI_INPUT));
        LOCAL (bank);
        // We declare a kmer model with a given span size.
        Kmer<span>::ModelDirect model (kmerSize);
        // We declare an iterator on a given sequence.
        Kmer<span>::ModelDirect::Iterator itKmer (model);
        // We create an iterator over this bank.
        ProgressIterator<Sequence> itSeq (*bank);
        // We loop over sequences.
        for (itSeq.first(); !itSeq.isDone(); itSeq.next())
        {
            // We set the data from which we want to extract kmers.
            itKmer.setData (itSeq->getData());
            // We iterate the kmers.
            for (itKmer.first(); !itKmer.isDone(); itKmer.next())
            {
                if (verbose)  {  cout << model.toString (itKmer->value()) << endl;  }
                nbKmers++;
            }
            //  We increase the sequences counter.
            nbSequences++;
        }
        // We dump some information about the iterations
        cout << "FOUND " << nbKmers << " kmers in " << nbSequences << " sequences" << endl;
    }
    catch (OptionFailure& e)
    {
        return e.displayErrors (std::cout);
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Computing statistics about minimizers

This snippet shows iterate the kmers of an input bank and computes some statistics about the iterated minimizers.

It also computes the following distribution : number of times a read has X different minimizers (in other words, the number of super kmers).

Code is from example kmer5.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#include <iostream>
#include <fstream>
#include <string.h>
/********************************************************************************/
/*                              Minimizers                                      */
/*                                                                              */
/* This snippet shows how to iterate the kmers and have minimizer statistics.   */
/*                                                                              */
/* Cmd-line: kmer5 -in <fasta/q file> -kmer-size <value> ...                    */
/*                  -minimizer-size <size>                                      */
/*                                                                              */
/* Sample: kmer5 -in gatb-core/gatb-core/test/db/reads1.fa -kmer-size 11 ...    */
/*               -minimizer-size 11                                             */
/*                                                                              */
/********************************************************************************/
// We use the required packages
using namespace std;
const size_t span = KMER_SPAN(0);
typedef Kmer<span>::ModelDirect    ModelDirect;
typedef Kmer<span>::ModelMinimizer<ModelDirect> ModelMinimizer;
/********************************************************************************/
int main (int argc, char* argv[])
{
    static const char* STR_URI_DISTRIB = "-distrib";
    OptionsParser parser ("KmerTest");
    parser.push_back (new OptionOneParam (STR_URI_INPUT,      "bank input",     true));
    parser.push_back (new OptionOneParam (STR_KMER_SIZE,      "kmer size",      true));
    parser.push_back (new OptionOneParam (STR_MINIMIZER_SIZE, "minimizer size", true));
    parser.push_back (new OptionNoParam  (STR_URI_DISTRIB,    "compute distribution: number of times a read has X superkmers",  false));
    parser.push_back (new OptionNoParam  (STR_VERBOSE,        "display kmers",  false));
    try
    {
        IProperties* options = parser.parse (argc, argv);
        string bankFilename = options->getStr(STR_URI_INPUT);
        // We get the kmer and minimizer sizes.
        size_t kmerSize = options->getInt(STR_KMER_SIZE);
        size_t mmerSize = options->getInt(STR_MINIMIZER_SIZE);
        // We define a try/catch block in case some method fails (bad filename for instance)
        u_int64_t nbSequences   = 0;
        u_int64_t nbKmers       = 0;
        u_int64_t nbMinimizers  = 0;
        u_int64_t nbMinimizers2 = 0;
        bool display = options->get(STR_VERBOSE) != 0;
        // We declare a bank instance defined by a list of filenames
        IBank* bank = Bank::open (bankFilename);
        // We declare a kmer model and a minimizer model
        ModelMinimizer model (kmerSize, mmerSize);
        // We get a reference on the minimizer model, which will be useful for dumping
        const ModelDirect& modelMinimizer = model.getMmersModel();
        // We compute a distribution : number of times a reads has X superkmers
        map<size_t,size_t> distribSuperKmers;
        // We create an iterator over this bank.
        ProgressIterator<Sequence> itSeq (*bank);
        // We loop over sequences.
        for (itSeq.first(); !itSeq.isDone(); itSeq.next())
        {
            size_t nbMinimizersPerRead = 0;
            // We iterate the kmers (and minimizers) of the current sequence.
            model.iterate (itSeq->getData(), [&] (const ModelMinimizer::Kmer& kmer, size_t idx)
            {
                // We could display some information about the kmer and its minimizer
                if (display)
                {
                    std::cout << "KMER=" << model.toString(kmer.value()) << "  "
                          << (kmer.hasChanged() ? "NEW" : "OLD") << " "
                          << "MINIMIZER=" << modelMinimizer.toString(kmer.minimizer().value()) << " "
                          << "at position " << kmer.position()
                          << std::endl;
                }
                // We may have to update the number of different minimizer in the current sequence
                if (kmer.hasChanged()==true)  {  nbMinimizersPerRead++;  }
                nbKmers ++;
            });
            distribSuperKmers [nbMinimizersPerRead]++;
            // We update global statistics
            nbSequences   ++;
            nbMinimizers  += nbMinimizersPerRead;
            nbMinimizers2 += nbMinimizersPerRead*nbMinimizersPerRead;
        }
        double mean  = nbSequences>0 ? (double)nbMinimizers  / (double) nbSequences : 0;
        double mean2 = nbSequences>0 ? (double)nbMinimizers2 / (double) nbSequences : 0;
        double devia = sqrt (mean2 - mean*mean);
        // We dump results
        Properties info;
        info.add (0, "info");
        info.add (1, "bank",          "%s",   bankFilename.c_str());
        info.add (1, "kmer_size",     "%ld",  kmerSize);
        info.add (1, "mmer_size",     "%ld",  mmerSize);
        info.add (1, "nb_sequences",  "%ld",  nbSequences);
        info.add (1, "nb_kmers",      "%ld",  nbKmers);
        info.add (1, "nb_minimizers", "%ld",  nbMinimizers);
        info.add (1, "mean",          "%.2f", mean);
        info.add (1, "deviation",     "%.2f", devia);
        cout << info << endl;
        // We dump the superkmers distribution if any
        if (options->get(STR_URI_DISTRIB))
        {
            string outputFilename = System::file().getBaseName(bankFilename) + string (".distrib");
            FILE* output = fopen (outputFilename.c_str(), "w");
            if (output)
            {
                for (map<size_t,size_t>::iterator it = distribSuperKmers.begin(); it != distribSuperKmers.end(); ++it)
                {
                    fprintf (output, "%ld %ld\n", it->first, it->second);
                }
                fclose (output);
            }
        }
    }
    catch (OptionFailure& e)
    {
        return e.displayErrors (std::cout);
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Checking span of kmers model

This snippet shows what is legal in terms of kmers size.

Actually, a kmers model has a type with a 'span' template. This span represents the maximum kmer size reachable for this type. For instance, a span of 32 allows up to kmer size of 31, a span of 64 up to 63, etc...

The 'span' value must be one of the project defined constants: KSIZE_1, KSIZE_2, KSIZE_3 and KSIZE_4. By default, we have KSIZE_n = 32*n. Note that it is possible to compile GATB-CORE with a bigger value of KSIZE_4=128. See Instructions to compile GATB-Core.

It is important to understand that each one of the four span values defines a specific kmers model with a specific integer type that represents the kmers values. For instance:

KSIZE_1=32 implies that we need 64 bits integers (2 bits per nucleotide), which is available as a native type on 64 bits architecture
KSIZE_2=64 implies that we need 128 bits integers which may (or not) be available as native integer type
for KSIZE_3 and KSIZE_4, we need to switch to specific large integer representations that are no more native on the system, which implies bigger computation times.

Code is from example kmer6.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
/********************************************************************************/
/*                              Kmer management                                 */
/*                                                                              */
/* A Model is defined:                                                          */
/*      - the maximum kmer size supported (the template argument)               */
/*      - the actual kmer size used (the constructor argument)                  */
/*                                                                              */
/* This snippet shows that the actual size can't exceed the max size.           */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We declare a kmer model that supports size up to 32
    // Unfortunately, we try to have a model with kmer of size 51
    // => we should get an exception
    try
    {
        Kmer<32>::ModelCanonical model (51);
    }
    catch (Exception& e)
    {
        std::cout << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    try
    {
        Kmer<64>::ModelCanonical model (51);
        std::cout << "It's OK..." << std::endl;
    }
    catch (Exception& e)
    {
        std::cout << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Other kind of statistics about minimizers

This snippet shows how to iterate the kmers of an input bank and computes some statistics about the iterated minimizers.

Code is from example kmer7.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#include <vector>
#include <memory>
using namespace std;
/********************************************************************************/
/*                              Kmer statistics                                 */
/*                                                                              */
/* This snippet is a little bit more complex and uses different features.       */
/*                                                                              */
/* Its purpose is to get the kmers distribution of a bank, for a given kmer     */
/* size K. Each sequence of size N is split in pieces of size K, for instance:  */
/*     ATCTCGGGCTAGCTCTCGATAAGC => for K=3, ATC TCG GGC TAG CTC TCG ATA AGC     */
/* Then we count the number of occurrences of these pieces, and sort the result.*/
/*                                                                              */
/* Cmd-line: kmer7 -in <fasta/q file> -kmer-size <value>                       */
/*                                                                              */
/* Sample: kmer7 -in gatb-core/gatb-core/test/db/reads1.fa -kmer-size 11        */
/*                                                                              */
/* Finally, this distribution is saved in a Storage object (HDF5 format here).  */
/* The resulting data can be shown with HDF5 tools (provided with GATB) by:     */
/*      h5dump output.h5                                                        */
/* You can directly get a gnuplot output with (4 here is the provided kmer size)*/
/*                                                                              */
/*      h5dump -y -w 1 -d distrib/4 -o out output.h5 > /dev/null                */
/*      gnuplot -p -e 'plot [][0:] "out" with lines'                            */
/*                                                                              */
/* NOTE: don't use too big kmer size because of the potential huge memory usage.*/
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    OptionsParser parser ("KmerStats");
    parser.push_back (new OptionOneParam (STR_URI_INPUT,  "bank reference", true));
    parser.push_back (new OptionOneParam (STR_KMER_SIZE,  "kmer size",      true));
    try
    {
        IProperties* options = parser.parse (argc, argv);
        size_t kmerSize = options->getInt (STR_KMER_SIZE);
        size_t nbKmers  = 1 << (2*kmerSize);
        // We create a kmer model.
        Kmer<>::ModelDirect model (kmerSize);
        // We open the bank.
        IBank* bank = Bank::open (options->getStr(STR_URI_INPUT));
        LOCAL (bank);
        vector<NativeInt64> distrib (nbKmers);
        size_t totalKmers = 0;
        // We define an iterator that encapsulates the sequences iterator with progress feedback
        ProgressIterator<Sequence> iter (*bank, "iterate bank");
        // We loop the bank
        for (iter.first(); !iter.isDone(); iter.next())
        {
            // Shortcut
            Sequence& seq = iter.item();
            size_t len  = seq.getDataSize() / kmerSize;
            char*  data = seq.getDataBuffer();
            // We iterate the sequence data by block of size kmerSize
            for (size_t i=0; i<len; i++, data += kmerSize)
            {
                // We get the kmer value of the current block
                Kmer<>::ModelDirect::Kmer kmer = model.codeSeed (data, seq.getDataEncoding());
                // We update the occurrences number for this kmer value
                distrib [kmer.value().toInt()] += 1;
            }
        }
        // We sort the distribution
        std::sort (distrib.begin(), distrib.end());
        // We create the output storage object
        auto_ptr<Storage> storage (StorageFactory(STORAGE_HDF5).create ("output", true, false));
        // We create a data set in our storage object
        Collection<NativeInt64>& distribCollect = storage->getGroup("distrib").getCollection<NativeInt64> (options->getStr (STR_KMER_SIZE));
        // We tag our data set with the user parameters
        distribCollect.addProperty ("name",      options->getStr (STR_URI_INPUT));
        distribCollect.addProperty ("kmer_size", options->getStr (STR_KMER_SIZE));
        // We insert the distribution into our storage object
        distribCollect.insert (distrib);
        // We can get a gnuplot output with the following (for kmerSize=5)
        //      h5dump -y -w 1 -d distrib/5 -o out output.h5 > /dev/null ; gnuplot -p -e 'plot [][0:] "out" with lines'
    }
    catch (OptionFailure& e)
    {
        return e.displayErrors (cout);
    }
    catch (Exception& e)
    {
        std::cout << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Setting custom minimizers definition

This snippet shows how to configure custom minimizer definition through a functor.

Code is from example kmer8.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
#include <iostream>
#include <fstream>
#include <string.h>
/********************************************************************************/
/*                              Minimizers                                      */
/*                                                                              */
/* This snippet shows how to iterate the kmers and have minimizer statistics.   */
/*                                                                              */
/* Cmd-line: kmer5 -in <fasta/q file> -kmer-size <value> ...                    */
/*                  -minimizer-size <size>                                      */
/*                                                                              */
/* Sample: kmer5 -in gatb-core/gatb-core/test/db/reads1.fa -kmer-size 11 ...    */
/*               -minimizer-size 11                                             */
/*                                                                              */
/********************************************************************************/
// We use the required packages
using namespace std;
const size_t span = KMER_SPAN(0);
struct CustomMinimizer
{
    template<class Model>  void init (const Model& model, Kmer<span>::Type& optimum) const
    {
        optimum = model.getKmerMax();
    }
    bool operator() (const Kmer<span>::Type& current, const Kmer<span>::Type& optimum) const
    {
        return current < optimum;
    }
    void include_frequency (uint32_t* freq_order) {}
};
struct CustomMaximizer
{
    template<class Model>  void init (const Model& model, Kmer<span>::Type& optimum) const
    {
        optimum.setVal(0);
    }
    bool operator() (const Kmer<span>::Type& current, const Kmer<span>::Type& optimum) const
    {
        return !(current < optimum);
    }
    void include_frequency (uint32_t* freq_order) {}
};
typedef Kmer<span>::ModelDirect    ModelDirect;
typedef Kmer<span>::ModelMinimizer<ModelDirect,CustomMinimizer> ModelMinimizer;
typedef Kmer<span>::ModelMinimizer<ModelDirect,CustomMaximizer> ModelMaximizer;
typedef ModelMinimizer  Model;
/********************************************************************************/
int main (int argc, char* argv[])
{
    OptionsParser parser ("KmerTest");
    parser.push_back (new OptionOneParam (STR_URI_INPUT,      "bank input",     true));
    parser.push_back (new OptionOneParam (STR_KMER_SIZE,      "kmer size",      true));
    parser.push_back (new OptionOneParam (STR_MINIMIZER_SIZE, "minimizer size", true));
    parser.push_back (new OptionNoParam  (STR_VERBOSE,        "display kmers",  false));
    try
    {
        IProperties* options = parser.parse (argc, argv);
        // We get the kmer and minimizer sizes.
        size_t kmerSize = options->getInt(STR_KMER_SIZE);
        size_t mmerSize = options->getInt(STR_MINIMIZER_SIZE);
        // We define a try/catch block in case some method fails (bad filename for instance)
        u_int64_t nbKmers       = 0;
        bool display = options->get(STR_VERBOSE) != 0;
        // We declare a Bank instance defined by a list of filenames
        IBank* bank = Bank::open (options->getStr(STR_URI_INPUT));
        LOCAL (bank);
        // We declare a kmer model and a minimizer model
        Model model (kmerSize, mmerSize);
        // We get a reference on the minimizer model, which will be useful for dumping
       const ModelMinimizer::Model& modelMinimizer = model.getMmersModel();
        Kmer<span>::Type checksum;
        size_t nbChanged = 0;
        size_t nbInvalid = 0;
        // We define an iterator that encapsulates the sequences iterator with progress feedback
        ProgressIterator<Sequence> iter (*bank, "iterate bank");
        // We loop over sequences.
        for (iter.first(); !iter.isDone(); iter.next())
        {
            // Shortcut
            Sequence& seq = iter.item();
            // We iterate the kmers (and minimizers) of the current sequence.
            model.iterate (seq.getData(), [&] (const Model::Kmer& kmer, size_t idx)
            {
                nbKmers ++;
                if (kmer.hasChanged() == true)   { nbChanged++;  }
                if (kmer.isValid()    == false)  { nbInvalid++;  }
                checksum += kmer.minimizer().value();
            });
        }
        cout << "nbKmers   : " << nbKmers   << endl;
        cout << "nbInvalid : " << nbInvalid << endl;
        cout << "nbChanged : " << nbChanged << endl;
        cout << "ratio     : " << (nbChanged > 0 ? (double)nbKmers / (double)nbChanged : 0) << endl;
        cout << "checksum  : " << checksum  << endl;
    }
    catch (OptionFailure& e)
    {
        return e.displayErrors (std::cout);
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Counting kmers from a bank with the sorting count algorithm

This snippet shows how to use the SortingCountAlgorithm class for counting kmers in a bank

Code is from example kmer9.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
using namespace std;
/********************************************************************************/
/*                              Sorting count                                   */
/*                                                                              */
/* This snippet shows how to count the kmers of a bank by using the sorting     */
/* count algorithm.                                                             */
/*                                                                              */
/* Cmd-line: kmer9 -in <fasta/q file>                                           */
/*                                                                              */
/* Sample: kmer9 -in gatb-core/gatb-core/test/db/reads1.fa                      */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We create a command line parser.
    IOptionsParser* parser = SortingCountAlgorithm<>::getOptionsParser();
    LOCAL (parser);
    try
    {
        // We parse the user options.
        IProperties* options = parser->parse (argc, argv);
        // We open the input bank
        IBank* bank = Bank::open (options->getStr(STR_URI_INPUT));
        // We create a SortingCountAlgorithm instance.
        SortingCountAlgorithm<> algo (bank, options);
        // We launch the algorithm
        algo.execute();
        // We display the stats of the sorting count execution
        cout << *algo.getInfo();
    }
    catch (OptionFailure& e)
    {
        return e.displayErrors (std::cout);
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Reading a file generated by the sorting count algorithm

This snippet shows how to read the output of the SortingCountAlgorithm.

We use two ways for reading the couples [kmer,abundance]

we read all the couples with a single iterator
we read each solid collection and use an iterator on it

Code is from example kmer10.cpp:

// We include what we need for the test
#include <gatb/gatb_core.hpp>
using namespace std;
/********************************************************************************/
/*                              Sorting count                                   */
/*                                                                              */
/* This snippet shows how read the file generated by the sorting count algo.    */
/*                                                                              */
/* Cmd-line: kmer10 -in <fasta/q file>                                           */
/*                                                                              */
/* Sample: kmer10 -in gatb-core/gatb-core/test/db/reads1.fa                     */
/*                                                                              */
/********************************************************************************/
int main (int argc, char* argv[])
{
    // We create a command line parser.
    OptionsParser parser ("SortingCount");
    parser.push_back (new OptionOneParam (STR_URI_INPUT, "sorting count input", true));
    try
    {
        // Shortcuts.
        typedef Kmer<>::Count Count;
        typedef Kmer<>::Type  Type;
        // We parse the user options.
        IProperties* options = parser.parse (argc, argv);
        // We load the object storing the couples [kmer,abundance]
        Storage* storage = StorageFactory(STORAGE_HDF5).load (options->getStr(STR_URI_INPUT));   LOCAL (storage);
        // We get the group inside the storage object
        Group& dskGroup = storage->getGroup("dsk");
        // We retrieve the partition holding the couples [kmer,abundance]
        Partition<Count>& solidKmers = dskGroup.getPartition<Count> ("solid");
        // Now, we read the couples in two ways, computing a checksum in each case.
        Type checksum1, checksum2;
        // CASE 1: we read the couples [kmer,abundance] with an iterator over the whole partition
        Iterator<Count>* it = solidKmers.iterator();  LOCAL (it);
        for (it->first(); !it->isDone(); it->next())   {   checksum1 = checksum1 + it->item().value;  }
        // CASE 2: we read the couples [kmer,abundance] with an iterator over each collection of the partition
        for (size_t i=0; i<solidKmers.size(); i++)
        {
            // We get the current collection inside the partition
            Collection<Count>& collection = solidKmers [i];
            Iterator<Count>* it = collection.iterator();  LOCAL (it);
            for (it->first(); !it->isDone(); it->next())   {   checksum2 = checksum2 + it->item().value;  }
        }
        // We check that we got the same checksum
        cout << "checksum1=" << checksum1 << endl;
        cout << "checksum2=" << checksum1 << endl;
    }
    catch (OptionFailure& e)
    {
        return e.displayErrors (std::cout);
    }
    catch (Exception& e)
    {
        std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
    }
    return EXIT_SUCCESS;
}

[go back to top]

Table of Contents

Presentation

Using a kmer model

Computing kmers with a model

Iterating kmers from a sequence

Iterating kmers from one or several banks

Computing statistics about minimizers

Checking span of kmers model

Other kind of statistics about minimizers

Setting custom minimizers definition

Counting kmers from a bank with the sorting count algorithm

Reading a file generated by the sorting count algorithm