gatb.core-API-0.0.0
Bank snippets

Presentation

This page presents some code snippets related to the use of Bank API.

Some of the snippets presented below can be used online.

Additional snippets are available in directory: gatb-core/gatb-core/examples/bank.

Parsing a single FASTA bank without checks

This snippet shows how to read one FASTA bank in a simple way. No check is done about the correctness of the FASTA bank file path.

Some information of each iterated sequence are diplayed as output.

Code is from example bank1.cpp:

// We include what we need for the test
#include <iostream>
/********************************************************************************/
/* Bank management */
/* */
/* This snippet shows how to open a FASTA bank and iterate its sequences. */
/* Some attributes of the iterated Sequence objects are used. */
/* */
/* Cmd-line: bank1 <fasta/q file> */
/* */
/* Sample: bank1 gatb-core/gatb-core/test/db/reads1.fa */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
// We get the file name from the user arguments
const char* filename = argc >= 2 ? argv[1] : "";
// We declare a Bank instance.
BankFasta b (filename);
// We create an iterator over this bank.
BankFasta::Iterator it (b);
// We loop over sequences.
for (it.first(); !it.isDone(); it.next())
{
// In the following, see how we access the current sequence information through
// the -> operator of the iterator
// We dump the data size and the comment
std::cout << "[" << it->getDataSize() << "] " << it->getComment() << std::endl;
// We dump the data
std::cout << it->toString() << std::endl;
}
}

[go back to top]

Parsing several banks

This snippet shows how to read one ore more banks in a simple way. The idea is to use the Bank::open method that analyzes the provided uri and get the correct IBank handle for one or more banks. For instance, one can run this snippet with:

  • bank2 reads1.fa,reads2.fa,reads3.fa

Some information of each iterated sequence are diplayed as output.

Code is from example bank2.cpp:

// We include what we need for the test
#include <iostream>
/********************************************************************************/
/* Bank with exception management */
/* */
/* This snippet shows how to open a bank and iterate its sequences. */
/* Note: we use here a try/catch block in case the bank opening doesn't work. */
/* */
/* Cmd-line: bank2 <fasta/q file> */
/* */
/* Sample: bank2 gatb-core/gatb-core/test/db/reads1.fa */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
if (argc < 2)
{
std::cerr << "you must provide a bank." << std::endl;
return EXIT_FAILURE;
}
// We define a try/catch block in case some method fails (bad filename for instance)
try
{
// We declare an input Bank and use it locally
IBank* inputBank = Bank::open (argv[1]);
LOCAL (inputBank);
// We create an iterator over this bank.
Iterator<Sequence>* it = inputBank->iterator();
LOCAL (it);
// We loop over sequences.
for (it->first(); !it->isDone(); it->next())
{
// Shortcut
Sequence& seq = it->item();
// We dump the data size and the comment
std::cout << "[" << seq.getDataSize() << "] " << seq.getComment() << std::endl;
// We dump the data
std::cout << seq.toString() << std::endl;
}
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}

[go back to top]

Parsing a FASTA bank in a different way

This snippet shows how to read one or more FASTA banks in a "push" model; it means that the sequence iterator calls some function for each sequence.

This is another way to iterate items and opposite to the "pull" model where the iterator is called to provide the current item, instead of calling some function to do as we do in this sample.

Code is from example bank3.cpp:

// We include what we need for the test
#include <iostream>
/********************************************************************************/
/* Bank iteration with a functor */
/* */
/* This snippet shows how to iterate sequences from a FASTA through a functor. */
/* Note that this approach is necessary when one wants to parallelize the */
/* iteration (see snippets on multithreading). */
/* */
/* Cmd-line: bank3 <fasta/q file> */
/* */
/* Sample: bank3 gatb-core/gatb-core/test/db/reads1.fa */
/* */
/********************************************************************************/
// We define a functor that will be called for every iterated sequence.
// The advantages are:
// - we have a code that focuses on the treatment to do on the sequence;
// this may be interesting when the corresponding code is long and is
// likely to be moved in an independent method.
// - such a functor can be reused in other contexts.
struct Functor { void operator () (Sequence& s) const
{
// We dump the data size and the comment
std::cout << "[" << s.getDataSize() << "] " << s.getComment() << std::endl;
// We dump the data
std::cout << s.toString () << std::endl;
}};
/********************************************************************************/
int main (int argc, char* argv[])
{
if (argc < 2)
{
std::cerr << "you must provide a bank." << std::endl;
return EXIT_FAILURE;
}
// We declare an input Bank and use it locally
IBank* inputBank = Bank::open (argv[1]);
LOCAL (inputBank);
// We create an iterator over this bank.
Iterator<Sequence>* it = inputBank->iterator();
LOCAL (it);
// We loop over sequences in a "push" fashion (a functor is called for each sequence)
Functor fct;
it->iterate (fct);
}

[go back to top]

Parsing a FASTA bank and getting progress information

This snippet shows how to create an iterator on something (here sequences from a FASTA file) and encapsulate it with another iterator that adds the possibility to notify some listener every 10 iterated sequences (used here for showing some progression during the iteration).

Note: the "notifying" iterator is generic and could be reused to send progress notification with any kind of iterator, not only on sequences.

Code is from example bank4.cpp:

// We include what we need for the test
#include <iostream>
/********************************************************************************/
/* Bank iteration with progress information */
/* */
/* This snippet shows how to iterate sequences from a FASTA with some progress */
/* information. In this example, we provide our own progress manager. */
/* */
/* Cmd-line: bank4 <fasta/q file> */
/* */
/* Sample: bank4 gatb-core/gatb-core/test/db/reads1.fa */
/* */
/********************************************************************************/
// We a define a functor that will be called during bank parsing
struct ProgressFunctor : public IteratorListener { void inc (u_int64_t current) { std::cout << "."; } };
/********************************************************************************/
int main (int argc, char* argv[])
{
if (argc < 2)
{
std::cerr << "you must provide a bank." << std::endl;
return EXIT_FAILURE;
}
// We define a try/catch block in case some method fails
try
{
// We declare an input Bank and use it locally
IBank* inputBank = Bank::open (argv[1]);
LOCAL (inputBank);
// Note also that we have to parameterize the SubjectIterator by the kind of iterated
// items (Sequence) and the processing that has to be done on each iteration (ProgressFunctor).
SubjectIterator<Sequence> iter (inputBank->iterator(), 10);
// We create some listener to be notified every 10 iterations and attach it to the iterator.
iter.addObserver (new ProgressFunctor());
// We loop over sequences.
for (iter.first(); !iter.isDone(); iter.next())
{
// Note that we do nothing inside the sequence iterating loop about the progression management.
// In other words, we don't "pollute" the code inside this loop by presentation concerns and
// we can therefore focus on the job to be done on the iterated sequences.
}
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}

[go back to top]

Parsing a FASTA bank and getting percentage progress information

This snippet shows how to read one or more FASTA banks and get a percentage progress information during the iteration.

We use there a ProgressIterator on Sequence. By default, we got progress information with remaining time estimation and resources usage (cpu and memory). Note that the ProgressIterator class has a second template parameter that can provide other progress information, try for instance:

  • ProgressIterator<Sequence,ProgressTimer>

Code is from example bank5.cpp:

// We include what we need for the test
#include <iostream>
/********************************************************************************/
/* Bank iteration with progress information */
/* */
/* This snippet shows how to iterate sequences from a bank with some progress */
/* information. In this example, we use some pre defined progress manager. */
/* */
/* Cmd-line: bank5 <fasta/q file> */
/* */
/* Sample: bank5 gatb-core/gatb-core/test/db/reads1.fa */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
if (argc < 2)
{
std::cerr << "you must provide a bank." << std::endl;
return EXIT_FAILURE;
}
// We define a try/catch block in case some method fails
try
{
// We declare an input Bank and use it locally
IBank* inputBank = Bank::open (argv[1]);
LOCAL (inputBank);
// We create a sequence iterator for the bank with progress information
ProgressIterator<Sequence> iter (*inputBank, "Iterating sequences");
// We loop over sequences.
for (iter.first(); !iter.isDone(); iter.next())
{
// Note that we do nothing inside the sequence iterating loop
}
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}

[go back to top]


Output a FASTA file with data line of given size

This snippet provides a small utility for cutting lines of data with a given size.

Code is from example bank6.cpp:

// We include what we need for the test
#include <iostream>
#include <fstream>
#include <string>
// We use the required packages
using namespace std;
/********************************************************************************/
/* Bank copy */
/* */
/* This snippet shows how to copy a bank into another one, with some possible */
/* modifications. Note that this snippet can be really useful for re-formatting */
/* FASTA banks */
/* */
/* Cmd-line: bank6 <fasta/q file> */
/* */
/* Sample: bank6 gatb-core/gatb-core/test/db/reads1.fa */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
if (argc < 2)
{
cerr << "you must provide at least the FASTA file path. Arguments are:" << endl;
cerr << " 1) FASTA file path" << endl;
cerr << " 2) data lines size (60 by default, 0 means only one line for sequence data)" << endl;
cerr << " 3) type of comments: 0 for new comments, 1 for only ids, 2 (default) for full comments" << endl;
cerr << " 4) number of sequences to keep (all by default)" << endl;
return EXIT_FAILURE;
}
// We get the file name from the user arguments
const char* filename = argv[1];
// We define the max size of a data line in the FASTA output file
size_t dataLineSize = argc >= 3 ? atoi(argv[2]) : 60;
// By convention, setting data line size to 0 will dump data on a single line
if (dataLineSize==0) { dataLineSize = ~0; }
// We define the kind of output comments
BankFasta::Iterator::CommentMode_e mode = BankFasta::Iterator::FULL;
if (argc >= 4)
{
int m = atoi (argv[3]);
if (m==0) { mode = BankFasta::Iterator::NONE; }
else if (m==1) { mode = BankFasta::Iterator::IDONLY; }
}
// We set the number of sequences to be kept.
u_int64_t nbSequences = (argc >= 5 ? atol (argv[4]) : ~0);
try {
// We declare a Bank instance.
BankFasta b (filename);
// We create an iterator over this bank.
// Note : here, we must use specifically BankFasta::Iterator in order to set the mode
BankFasta::Iterator itSeq (b, mode);
// We encapsulate it with a truncation iterator
TruncateIterator<Sequence> it (itSeq, nbSequences);
size_t idxSeq = 1;
// We loop over sequences.
for (it.first(); !it.isDone(); it.next(), idxSeq++)
{
// We dump the comment into the file according to the user mode
if (!it->getComment().empty()) { cout << ">" << it->getComment() << endl; }
else { cout << ">seq=" << idxSeq << " len=" << it.item().getDataSize() << endl; }
// shortcut
size_t len = it->getDataSize();
// We dump the data with fixed sized columns
for (size_t i=0; i<len; )
{
for (size_t j=0; j<dataLineSize && i<len; j++, i++)
{
cout << (char) (it->getData() [i]);
}
cout << endl;
}
}
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
return EXIT_SUCCESS;
}

[go back to top]


Filter sequences with a given data size

This snippet shows how to parse a bank with a functor used to filter out some items.

Code is from example bank7.cpp:

// We include what we need for the test
#include <iostream>
// We use the required packages
using namespace std;
size_t threshold = 500;
// We a define a functor that will be called during iteration for filtering odd items.
// Here, we keep sequences whose data size is greater than 500.
struct FilterFunctor { bool operator () (Sequence& seq) const { return seq.getDataSize() >= threshold; } };
/********************************************************************************/
/* Bank filtering */
/* */
/* This snippet shows how to iterate a bank and filter some sequences through */
/* a functor. */
/* Note: lambda expressions could be used for the functor (in case the used */
/* compiler supports it) */
/* */
/* Cmd-line: bank7 <fasta/q file> <threshold> */
/* */
/* Sample: bank7 gatb-core/gatb-core/test/db/reads1.fa 500 */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
if (argc < 3)
{
cerr << "you must provide at least the size threshold and the bank file path. Arguments are:" << endl;
cerr << " 1) bank" << endl;
cerr << " 2) minimum size threshold (default 500)" << endl;
return EXIT_FAILURE;
}
// We define a try/catch block in case some method fails (bad filename for instance)
try
{
// We declare an input Bank and use it locally
IBank* inputBank = Bank::open (argv[1]);
LOCAL (inputBank);
// We set the threshold with given parameter
threshold = atoi (argv[2]);
// We use another iterator for filtering out some sequences.
FilterIterator<Sequence,FilterFunctor> itSeq (inputBank->iterator(), FilterFunctor());
// We loop over sequences.
for (itSeq.first(); !itSeq.isDone(); itSeq.next())
{
// We dump the data size and the comment
cout << "[" << itSeq->getDataSize() << "] " << itSeq->getComment() << endl;
// We dump the data
cout << itSeq->toString() << endl;
}
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}

[go back to top]


Filter sequences using Phred quality

This snippet shows how to parse a FastQ file and filtering out reads by Phred quality.

Code is from example bank24.cpp:

// We include what we need for the test
#include <iostream>
/********************************************************************************/
/* Bank management */
/* */
/* This snippet shows how to open a FastQ file and iterate its sequences */
/* using a filter discarding all reads having a Phred score >= 30. */
/* */
/* Cmd-line: bank24 <fastq file> */
/* */
/* Sample: bank23 gatb-core/gatb-core/test/db/sample.fastq */
/* */
/********************************************************************************/
int threshold = 30;
int computeMeanPhredScore(const std::string& quality){
int score=0;
// quality information is supposed to be FastQ-Sanger-encoded:
// each letter in quality string is an ASCII code ranging from 33 to 126.
for(char c : quality){
score += (c-33);
}
return score / quality.size();
}
struct QualityFilter {
bool operator () (Sequence& seq) const {
return computeMeanPhredScore(seq.getQuality()) >= threshold;
}
};
/********************************************************************************/
// START Application
int main (int argc, char* argv[])
{
// We check that the user provides at least one option: a Fasta/FastQ file.
// Online GATB-Tutorial: this argument is automatically filled in with an
// appropriate file.
if (argc < 2)
{
std::cerr << "Please, provide a sequence file." << std::endl;
return EXIT_FAILURE;
}
// We define a try/catch block in case some method fails (bad filename for instance)
try
{
// We declare an input Bank and use it locally
IBank* inputBank = Bank::open (argv[1]);
LOCAL (inputBank);
// We create an iterator over this bank using some filtering system
FilterIterator<Sequence,QualityFilter> it (inputBank->iterator(), QualityFilter());
// We loop over sequences.
for (it.first(); !it.isDone(); it.next())
{
// Shortcut
Sequence& seq = it.item();
// We dump the sequence quality
std::cout << "[" << seq.getQuality() << "] " << computeMeanPhredScore(seq.getQuality()) << std::endl;
}
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}

[go back to top]


Conversion of a FASTA bank to a binary format

This snippet shows how to parse a nucleic bank and convert it to a binary format.

Code is from example bank8.cpp:

// We include what we need for the test
#include <iostream>
#include <iomanip>
// We use the required packages
using namespace std;
/********************************************************************************/
/* Bank conversion to binary format */
/* */
/* This snippet shows how to convert a bank from a format to another one. The */
/* main idea is to iterate the input bank and to insert each sequence into the */
/* output bank. We use progress information to get feedback about the iteration */
/* progression. */
/* */
/* Cmd-line: bank8 <fasta/q file> <outfile> */
/* */
/* Sample: bank8 gatb-core/gatb-core/test/db/reads1.fa reads1.bin */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
if (argc < 3)
{
cerr << "you must provide an input and output banks names:" << endl;
cerr << " 1) bank file path" << endl;
cerr << " 2) binary file path" << endl;
cerr << " 3) dump binary output: 0 for no (default), 1 for yes" << endl;
return EXIT_FAILURE;
}
// We define a try/catch block in case some method fails (bad filename for instance)
try
{
// We declare an input Bank and use it locally
IBank* inputBank = Bank::open (argv[1]);
LOCAL (inputBank);
// We declare an output Bank
BankBinary outputBank (argv[2]);
// We create a sequence iterator on the input bank (with progress information).
ProgressIterator<Sequence> itSeq (*inputBank, "Converting input file into binary format");
// We insert each sequence of the input bank into the output bank.
for (itSeq.first(); !itSeq.isDone(); itSeq.next()) { outputBank.insert (itSeq.item()); }
// We make sure that the output bank is flushed correctly.
outputBank.flush ();
if (argc >= 4 && atoi(argv[3])==1)
{
// We create an iterator on our binary bank
Iterator<Sequence>* itBinary = outputBank.iterator();
LOCAL (itBinary);
// We iterate the sequences whose data is binary coded
for (itBinary->first(); !itBinary->isDone(); itBinary->next())
{
// Shortcut
Sequence& seq = itBinary->item();
// We display some (dummy) comment
cout << seq.getComment() << endl;
// In binary format, each byte holds 4 nucleotides.
for (size_t i=0; i< (seq.getDataSize()+3)/4; i++)
{
cout << "[" << setfill('0') << setw(2) << hex << (int) (seq.getData()[i] & 0xFF) << " ";
// We convert back from binary to ASCII
char b = (seq.getData()[i] & 0xFF) ;
static char tableASCII[] = {'A', 'C', 'T', 'G'};
cout << tableASCII[(b>>6)&3]
<< tableASCII[(b>>4)&3]
<< tableASCII[(b>>2)&3]
<< tableASCII[(b>>0)&3]
<< "] ";
if ((i+1)%16==0) { cout << endl; }
}
cout << endl;
}
}
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}

[go back to top]


Conversion of a bank with some filtering

This snippet shows how to parse a bank, check whether the sequences match a criteria and dump the matching sequences into an output FASTA bank.s

Code is from example bank9.cpp:

// We include what we need for the test
#include <iostream>
// We use the required packages
using namespace std;
/********************************************************************************/
/* Bank conversion with some filtering */
/* */
/* This snippet shows how to copy a bank to another one with some filtering */
/* criteria. The criteria is described through a functor. */
/* Here, we can keep only every 'modulo' sequence that have no 'N' in data. */
/* */
/* Cmd-line: bank9 <fasta/q file> <outfile> <modulo> */
/* */
/* Sample: bank9 gatb-core/gatb-core/test/db/reads1.fa reads1_filtered.fa 5 */
/* */
/********************************************************************************/
// We define a functor for our sequence filtering.
struct FilterFunctor
{
int modulo;
FilterFunctor (int modulo) : modulo(modulo) {}
bool operator () (Sequence& seq)
{
return (seq.getIndex() % modulo == 0) && (seq.toString().find("N") == string::npos);
}
};
/********************************************************************************/
int main (int argc, char* argv[])
{
if (argc < 3)
{
cerr << "you must provide an input and output banks names:" << endl;
cerr << " 1) input URI" << endl;
cerr << " 2) output URI" << endl;
cerr << " 3) modulo (ie keep sequence id % modulo)" << endl;
return EXIT_FAILURE;
}
// We define a try/catch block in case some method fails (bad filename for instance)
try
{
// We declare an input Bank and use it locally
IBank* inputBank = Bank::open (argv[1]);
LOCAL (inputBank);
// We declare an output Bank
BankFasta outputBank (argv[2]);
// We get the modulo value (1 by default), defined as static to be retrieved in the functor
int modulo = argc >= 4 ? atoi (argv[3]) : 1;
if (modulo<=0) { modulo=1; }
// We create a filter on the input bank.
// Note the first argument of the FilterIterator is an iterator of the input bank and the
// second argument is the filtering functor.
FilterIterator<Sequence,FilterFunctor>* itFilter = new FilterIterator<Sequence,FilterFunctor>(
inputBank->iterator(),
FilterFunctor (modulo)
);
// We create an iterator that will provide progress notification.
// Note the estimated number of items takes into account the modulo
ProgressIterator<Sequence> itSeq (itFilter, "Filtering bank", inputBank->estimateNbItems()/modulo );
// We loop over sequences.
for (itSeq.first(); !itSeq.isDone(); itSeq.next())
{
// We insert the current sequence into the output bank.
outputBank.insert (*itSeq);
}
// We make sure that the output bank is flushed correctly.
outputBank.flush ();
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}

[go back to top]


Split a bank

This snippet shows how to split an input bank into parts. It creates an output album bank by using the BankAlbum class. Example of use:

bank10 -in reads.fa -max-size 10000000

This example will create a directory 'reads_S1000000' where we can find:

  • album.txt => album file containing all reads_X split parts
  • reads_0
  • reads_1
  • ...
  • reads_N

All the 'reads_X' files are about 1000000 bytes. Now, the generated album.txt file can be used as a bank and could be the input bank of the snippet bank14 for instance (and we should get the same results as using directly the 'reads.fa' bank)

Code is from example bank10.cpp:

// We include what we need for the test
#include <iostream>
#include <sstream>
// We use the required packages
using namespace std;
static const char* STR_MAX_INPUT_SIZE = "-max-size";
static const char* STR_OUTPUT_FASTQ = "-fastq";
static const char* STR_OUTPUT_GZ = "-gz";
/********************************************************************************/
/* Bank split */
/* */
/* This snippet shows how to split a bank into smaller banks and how to create */
/* an album bank (ie a list of URL of banks). Such an album bank could be used */
/* as bank input by other tools. */
/* Note: all the generated files are put in a directory created by the snippet. */
/* */
/* Cmd-line: -in <fasta/q file> -max-size <nb. nucl.> */
/* [-out-dir <some_dir> -fastq 1 -gz 1 ] */
/* */
/* Sample: bank10 -in gatb-core/gatb-core/test/db/reads1.fa */
/* -out-dir /tmp */
/* -max-size 5000 */
/* output: a new directory within ${out-dir} with sequences splitted */
/* by chunck of 5000 nucleotides */
/********************************************************************************/
int main (int argc, char* argv[])
{
OptionsParser parser ("BankSplitter");
parser.push_back (new OptionOneParam (STR_URI_INPUT, "bank reference", true));
parser.push_back (new OptionOneParam (STR_MAX_INPUT_SIZE, "average db size per split", true));
parser.push_back (new OptionOneParam (STR_URI_OUTPUT_DIR, "output directory", false, "."));
parser.push_back (new OptionNoParam (STR_OUTPUT_FASTQ, "fastq output", false));
parser.push_back (new OptionNoParam (STR_OUTPUT_GZ, "gzip output", false));
// We define a try/catch block in case some method fails (bad filename for instance)
try
{
IProperties* options = parser.parse (argc, argv);
u_int64_t maxDbSize = options->getInt(STR_MAX_INPUT_SIZE);
// We declare an input Bank
IBank* inputBank = Bank::open (options->getStr(STR_URI_INPUT));
LOCAL (inputBank);
// We get the basename of the input bank.
string inputBasename = System::file().getBaseName (options->getStr(STR_URI_INPUT));
stringstream ss; ss << inputBasename << "_S" << maxDbSize;
string outputDirName = ss.str();
string outputDir = options->getStr(STR_URI_OUTPUT_DIR) + "/" + outputDirName;
System::file().mkdir (outputDir, S_IRWXU);
// We create the album bank.
BankAlbum album (outputDir + "/album.txt");
u_int64_t number, totalSize, maxSize;
inputBank->estimate (number, totalSize, maxSize);
u_int64_t estimationNbSeqToIterate = number;
// We create an iterator over the input bank
ProgressIterator<Sequence> itSeq (*inputBank, "split");
// We loop over sequences to get the exact number of sequences.
int64_t nbBanksOutput = -1;
u_int64_t nbSequences = 0;
u_int64_t dbSize = ~0;
bool isFastq = options->get(STR_OUTPUT_FASTQ) != 0;
bool isGzipped = options->get(STR_OUTPUT_GZ) != 0;
IBank* currentBank = 0;
for (itSeq.first(); !itSeq.isDone(); itSeq.next())
{
if (dbSize > maxDbSize)
{
if (currentBank != 0) { currentBank->flush(); currentBank->finalize(); }
nbBanksOutput ++;
stringstream ss; ss << inputBasename << "_" << nbBanksOutput << (isFastq ? ".fastq" : ".fasta");
if (isGzipped) { ss << ".gz"; }
currentBank = album.addBank (outputDir, ss.str(), isFastq, isGzipped);
dbSize = 0;
}
dbSize += itSeq->getDataSize();
currentBank->insert (*itSeq);
}
if (currentBank != 0) { currentBank->flush(); }
}
catch (OptionFailure& e)
{
return e.displayErrors (cout);
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}

[go back to top]


Iterating a bank whose sequences are kmers

This snippet shows how to iterate a BankKmers. Such a bank iterates all possible kmers for a given kmers size.

Each sequence is saved into an output FASTA file.

Code is from example bank11.cpp:

// We include what we need for the test
#include <iostream>
// We use the required packages
using namespace std;
/********************************************************************************/
/* Iterate a bank whose sequences are each kmer of a model. */
/* */
/* This snippet shows how iterate all possible kmers of a given size and dump */
/* them as output. This program can be viewed as a kmer generator. */
/* */
/* Cmd-line: bank11 -kmer-size <size> -out <fasta file> */
/* */
/* Sample: bank11 -kmer-size 4 -out /tmp/kmer.fa */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
OptionsParser parser ("KmerIteration");
parser.push_back (new OptionOneParam (STR_KMER_SIZE, "kmer size", true));
parser.push_back (new OptionOneParam (STR_URI_OUTPUT, "output fasta bank", false));
try
{
IProperties* options = parser.parse (argc, argv);
BankKmers bank (options->getInt(STR_KMER_SIZE));
string outputName = options->get(STR_URI_OUTPUT) ?
options->getStr(STR_URI_OUTPUT) :
Stringify::format ("bank_k%d.fa", options->getInt(STR_KMER_SIZE));
BankFasta outputBank (outputName);
ProgressIterator<Sequence> it (bank);
for (it.first(); !it.isDone(); it.next()) { outputBank.insert (it.item()); }
outputBank.flush();
}
catch (OptionFailure& e)
{
return e.displayErrors (cout);
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}

[go back to top]


Extracting sequences with specific ids with FilterIterator

This snippet shows how to extract sequences from a bank. A sequence is kept if its index in its bank belongs to a set (provided as an input file holding one index per line)

It uses the FilterIterator feature.

Code is from example bank12.cpp:

// We include what we need for the test
#include <iostream>
// We use the required packages
using namespace std;
static const char* STR_URI_SEQ_IDS = "-seq-ids";
// We use a filtering functor that knows which sequence indexes have to be kept.
struct FilterFunctor
{
set<size_t>& indexes;
FilterFunctor (set<size_t>& indexes) : indexes(indexes) {}
bool operator () (Sequence& seq) const { return indexes.find (seq.getIndex()) != indexes.end(); }
};
/********************************************************************************/
/* Extract sequences from a bank. */
/* */
/* This snippet shows how to extract some sequences for a given list of indexes.*/
/* An "index" i means the i-th sequence inside the input file */
/* These indexes are read from an input file, one index per line. */
/* E.g. if the index files contains:
* 1
5 */
/* Then read number 1 and number 5 will be returned */
/* */
/* Cmd-line: bank12 -in <fasta/q file> -seq-ids <seq-ids file> */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
OptionsParser parser ("BankFilter");
parser.push_back (new OptionOneParam (STR_URI_INPUT, "bank reference", true));
parser.push_back (new OptionOneParam (STR_URI_SEQ_IDS, "file holding indexes of bank", true));
try
{
IProperties* options = parser.parse (argc, argv);
set<size_t> indexes;
FILE* file = fopen (options->getStr(STR_URI_SEQ_IDS).c_str(), "r");
if (file != 0)
{
char buffer[128];
while (fgets (buffer, sizeof(buffer), file)) { indexes.insert (atoi(buffer)); }
fclose (file);
}
cout << "found " << indexes.size() << " indexes" << endl;
string outputBankUri = options->getStr(STR_URI_INPUT) + "_" + System::file().getBaseName (options->getStr(STR_URI_SEQ_IDS));
BankFasta outputBank(outputBankUri);
IBank* inputBank = Bank::open (options->getStr(STR_URI_INPUT));
LOCAL (inputBank);
FilterIterator<Sequence,FilterFunctor> itSeq (inputBank->iterator(), FilterFunctor(indexes));
for (itSeq.first(); !itSeq.isDone(); itSeq.next())
{
outputBank.insert (itSeq.item());
}
outputBank.flush();
}
catch (OptionFailure& e)
{
return e.displayErrors (cout);
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}

[go back to top]


Extracting sequences with too many N characters

This snippet shows how to extract sequences that don't have too many N.

It uses the FilterIterator feature.

Code is from example bank13.cpp:

// We include what we need for the test
#include <iostream>
// We use the required packages
using namespace std;
static const char* STR_FILTER_RATIO = "-filter-ratio";
/********************************************************************************/
/* Filter sequences having too many bad letters. */
/* */
/* This snippet shows how to keep sequences that have enough "good" data. */
/* */
/* A sequence is retained only if nb(A,C,G,T)/seq_size>=filter-ratio . */
/* */
/* Cmd-line: bank13 -in <fasta/q file> -filter-ratio <[0.0..1.0]> */
/* */
/* Sample: bank13 -in gatb-core/gatb-core/test/db/reads1.fa -filter-ratio 0.8 */
/* (output bank will be file: reads1.fa_filtered) */
/* */
/* WARNING ! THIS SNIPPET SHOWS ALSO HOW TO USE LAMBDA EXPRESSIONS, SO YOU NEED */
/* TO USE A COMPILER THAT SUPPORTS THIS FEATURE. */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
OptionsParser parser ("BankFilter");
parser.push_back (new OptionOneParam (STR_URI_INPUT, "bank input", true));
parser.push_back (new OptionOneParam (STR_FILTER_RATIO, "skip a sequence if 'good letters number / seq.len > X'", false, "0.8"));
try
{
IProperties* options = parser.parse (argc, argv);
double percentThreshold = options->getDouble(STR_FILTER_RATIO);
IBank* inBank = Bank::open (options->getStr(STR_URI_INPUT));
LOCAL (inBank);
IBank* outBank = new BankFasta (options->getStr(STR_URI_INPUT) + "_filtered");
LOCAL (outBank);
inBank->iterate ([&] (Sequence& s)
{
char* data = s.getDataBuffer();
size_t nbOK = 0;
for (size_t i=0; i<s.getDataSize(); i++)
{
if (data[i]=='A' || data[i]=='C' || data[i]=='G' || data[i]=='T') { nbOK++; }
}
if ((double)nbOK / (double)s.getDataSize() > percentThreshold) { outBank->insert (s); }
});
outBank->flush();
}
catch (OptionFailure& e)
{
return e.displayErrors (cout);
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}

[go back to top]


Computing statistics on a bank

This snippet shows how to read a bank and get statistics on it.

Code is from example bank14.cpp:

// We include what we need for the test
#include <iostream>
/********************************************************************************/
/* Bank management */
/* */
/* This snippet shows how to open a FASTA bank and iterate its sequences */
/* to provide some stats: data size, nb. sequences, etc. */
/* */
/* Cmd-line: bank14 -in <fasta/q file> */
/* */
/* Sample: bank14 -in gatb-core/gatb-core/test/db/reads1.fa */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
OptionsParser parser ("BankStats");
parser.push_back (new OptionOneParam (STR_URI_INPUT, "bank input", true));
try
{
IProperties* options = parser.parse (argc, argv);
// We get information about the bank.
u_int64_t nbSequences=0, dataSize=0, seqMaxSize=0, seqMinSize=~0;
// We declare an input Bank and use it locally
IBank* inputBank = Bank::open (options->getStr(STR_URI_INPUT));
LOCAL (inputBank);
ProgressIterator<Sequence> it (*inputBank, "iterate");
for (it.first(); !it.isDone(); it.next())
{
Data& data = it.item().getData();
nbSequences ++;
if (data.size() > seqMaxSize) { seqMaxSize = data.size(); }
if (data.size() < seqMinSize) { seqMinSize = data.size(); }
dataSize += data.size ();
}
std::cout << "data size : " << dataSize << std::endl;
std::cout << "sequence number : " << nbSequences << std::endl;
std::cout << "sequence max size : " << seqMaxSize << std::endl;
std::cout << "sequence min size : " << seqMinSize << std::endl;
}
catch (OptionFailure& e)
{
return e.displayErrors (std::cout);
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}

[go back to top]


Iterating paired end banks

This snippet shows how to read two banks in the same time, and each item is a pair of items of bank1 and bank2.

This is good example how to reads paired end banks.

Code is from example bank18.cpp:

// We include what we need for the test
#include <iostream>
/********************************************************************************/
/* Read two banks */
/* */
/* This snippet shows how to read two banks at the same time. Iterated items */
/* are pairs of two sequences. This may be useful to read pair ends banks for */
/* instance. */
/* */
/* This code produces on std::out the sequence description of each sequence */
/* contained in the two sequence files. */
/* */
/* See also bank17.cpp where we do the same using a single album file. */
/* */
/* Cmd-line: bank18 -one <fasta/q file> -two <fasta/q file> */
/* */
/* Sample: bank18 -one gatb-core/gatb-core/test/db/reads1.fa */
/* -two gatb-core/gatb-core/test/db/reads2.fa */
/* */
/********************************************************************************/
int main (int argc, char* argv[])
{
static const char* STR_BANK1 = "-one";
static const char* STR_BANK2 = "-two";
OptionsParser parser ("TwoBanks");
parser.push_back (new OptionOneParam (STR_BANK1, "bank one", true));
parser.push_back (new OptionOneParam (STR_BANK2, "bank two", true));
try
{
IProperties* options = parser.parse (argc, argv);
// We open the two banks
IBank* bank1 = Bank::open (options->getStr(STR_BANK1)); LOCAL (bank1);
IBank* bank2 = Bank::open (options->getStr(STR_BANK2)); LOCAL (bank2);
// We iterate the two banks. Note how we provide two iterators from the two banks.
PairedIterator<Sequence> * itPair = new PairedIterator<Sequence> (bank1->iterator(), bank2->iterator());
LOCAL(itPair);
ProgressIterator< std::pair <Sequence, Sequence>> progress_iter (itPair, "coucou la paire", bank1->estimateNbItems());
for (progress_iter.first(); !progress_iter.isDone(); progress_iter.next())
{
Sequence& s1 = itPair->item().first;
Sequence& s2 = itPair->item().second;
std::cout << "seq1.len=" << s1.getDataSize() << " seq2.len=" << s2.getDataSize() << std::endl;
}
}
catch (OptionFailure& e)
{
return e.displayErrors (std::cout);
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}

[go back to top]