Presentation
This page presents some code snippets related to the use of Bank API.
Some of the snippets presented below can be used online.
Additional snippets are available in directory: gatb-core/gatb-core/examples/bank.
Parsing a single FASTA bank without checks
This snippet shows how to read one FASTA bank in a simple way. No check is done about the correctness of the FASTA bank file path.
Some information of each iterated sequence are diplayed as output.
Code is from example bank1.cpp:
#include <iostream>
int main (int argc, char* argv[])
{
const char* filename = argc >= 2 ? argv[1] : "";
BankFasta b (filename);
BankFasta::Iterator it (b);
for (it.first(); !it.isDone(); it.next())
{
std::cout << "[" << it->getDataSize() << "] " << it->getComment() << std::endl;
std::cout << it->toString() << std::endl;
}
}
[go back to top]
Parsing several banks
This snippet shows how to read one ore more banks in a simple way. The idea is to use the Bank::open method that analyzes the provided uri and get the correct IBank handle for one or more banks. For instance, one can run this snippet with:
- bank2 reads1.fa,reads2.fa,reads3.fa
Some information of each iterated sequence are diplayed as output.
Code is from example bank2.cpp:
#include <iostream>
int main (int argc, char* argv[])
{
if (argc < 2)
{
std::cerr << "you must provide a bank." << std::endl;
return EXIT_FAILURE;
}
try
{
IBank* inputBank = Bank::open (argv[1]);
Iterator<Sequence>* it = inputBank->iterator();
for (it->first(); !it->isDone(); it->next())
{
Sequence& seq = it->item();
std::cout << "[" << seq.getDataSize() << "] " << seq.getComment() << std::endl;
std::cout << seq.toString() << std::endl;
}
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}
[go back to top]
Parsing a FASTA bank in a different way
This snippet shows how to read one or more FASTA banks in a "push" model; it means that the sequence iterator calls some function for each sequence.
This is another way to iterate items and opposite to the "pull" model where the iterator is called to provide the current item, instead of calling some function to do as we do in this sample.
Code is from example bank3.cpp:
#include <iostream>
struct Functor { void operator () (Sequence& s) const
{
std::cout << "[" << s.getDataSize() << "] " << s.getComment() << std::endl;
std::cout << s.toString () << std::endl;
}};
int main (int argc, char* argv[])
{
if (argc < 2)
{
std::cerr << "you must provide a bank." << std::endl;
return EXIT_FAILURE;
}
IBank* inputBank = Bank::open (argv[1]);
Iterator<Sequence>* it = inputBank->iterator();
Functor fct;
it->iterate (fct);
}
[go back to top]
Parsing a FASTA bank and getting progress information
This snippet shows how to create an iterator on something (here sequences from a FASTA file) and encapsulate it with another iterator that adds the possibility to notify some listener every 10 iterated sequences (used here for showing some progression during the iteration).
Note: the "notifying" iterator is generic and could be reused to send progress notification with any kind of iterator, not only on sequences.
Code is from example bank4.cpp:
#include <iostream>
struct ProgressFunctor : public IteratorListener { void inc (u_int64_t current) { std::cout << "."; } };
int main (int argc, char* argv[])
{
if (argc < 2)
{
std::cerr << "you must provide a bank." << std::endl;
return EXIT_FAILURE;
}
try
{
IBank* inputBank = Bank::open (argv[1]);
SubjectIterator<Sequence> iter (inputBank->iterator(), 10);
iter.addObserver (new ProgressFunctor());
for (iter.first(); !iter.isDone(); iter.next())
{
}
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}
[go back to top]
Parsing a FASTA bank and getting percentage progress information
This snippet shows how to read one or more FASTA banks and get a percentage progress information during the iteration.
We use there a ProgressIterator on Sequence. By default, we got progress information with remaining time estimation and resources usage (cpu and memory). Note that the ProgressIterator class has a second template parameter that can provide other progress information, try for instance:
- ProgressIterator<Sequence,ProgressTimer>
Code is from example bank5.cpp:
#include <iostream>
int main (int argc, char* argv[])
{
if (argc < 2)
{
std::cerr << "you must provide a bank." << std::endl;
return EXIT_FAILURE;
}
try
{
IBank* inputBank = Bank::open (argv[1]);
ProgressIterator<Sequence> iter (*inputBank, "Iterating sequences");
for (iter.first(); !iter.isDone(); iter.next())
{
}
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}
[go back to top]
Output a FASTA file with data line of given size
This snippet provides a small utility for cutting lines of data with a given size.
Code is from example bank6.cpp:
#include <iostream>
#include <fstream>
#include <string>
int main (int argc, char* argv[])
{
if (argc < 2)
{
cerr << "you must provide at least the FASTA file path. Arguments are:" << endl;
cerr << " 1) FASTA file path" << endl;
cerr << " 2) data lines size (60 by default, 0 means only one line for sequence data)" << endl;
cerr << " 3) type of comments: 0 for new comments, 1 for only ids, 2 (default) for full comments" << endl;
cerr << " 4) number of sequences to keep (all by default)" << endl;
return EXIT_FAILURE;
}
const char* filename = argv[1];
size_t dataLineSize = argc >= 3 ? atoi(argv[2]) : 60;
if (dataLineSize==0) { dataLineSize = ~0; }
BankFasta::Iterator::CommentMode_e mode = BankFasta::Iterator::FULL;
if (argc >= 4)
{
int m = atoi (argv[3]);
if (m==0) { mode = BankFasta::Iterator::NONE; }
else if (m==1) { mode = BankFasta::Iterator::IDONLY; }
}
u_int64_t nbSequences = (argc >= 5 ? atol (argv[4]) : ~0);
try {
BankFasta b (filename);
BankFasta::Iterator itSeq (b, mode);
TruncateIterator<Sequence> it (itSeq, nbSequences);
size_t idxSeq = 1;
for (it.first(); !it.isDone(); it.next(), idxSeq++)
{
if (!it->getComment().empty()) { cout << ">" << it->getComment() << endl; }
else { cout << ">seq=" << idxSeq << " len=" << it.item().getDataSize() << endl; }
size_t len = it->getDataSize();
for (size_t i=0; i<len; )
{
for (size_t j=0; j<dataLineSize && i<len; j++, i++)
{
cout << (char) (it->getData() [i]);
}
cout << endl;
}
}
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
return EXIT_SUCCESS;
}
[go back to top]
Filter sequences with a given data size
This snippet shows how to parse a bank with a functor used to filter out some items.
Code is from example bank7.cpp:
#include <iostream>
size_t threshold = 500;
struct FilterFunctor { bool operator () (Sequence& seq) const { return seq.getDataSize() >= threshold; } };
int main (int argc, char* argv[])
{
if (argc < 3)
{
cerr << "you must provide at least the size threshold and the bank file path. Arguments are:" << endl;
cerr << " 1) bank" << endl;
cerr << " 2) minimum size threshold (default 500)" << endl;
return EXIT_FAILURE;
}
try
{
IBank* inputBank = Bank::open (argv[1]);
threshold = atoi (argv[2]);
FilterIterator<Sequence,FilterFunctor> itSeq (inputBank->iterator(), FilterFunctor());
for (itSeq.first(); !itSeq.isDone(); itSeq.next())
{
cout << "[" << itSeq->getDataSize() << "] " << itSeq->getComment() << endl;
cout << itSeq->toString() << endl;
}
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}
[go back to top]
Filter sequences using Phred quality
This snippet shows how to parse a FastQ file and filtering out reads by Phred quality.
Code is from example bank24.cpp:
#include <iostream>
int threshold = 30;
int computeMeanPhredScore(const std::string& quality){
int score=0;
for(char c : quality){
score += (c-33);
}
return score / quality.size();
}
struct QualityFilter {
bool operator () (Sequence& seq) const {
return computeMeanPhredScore(seq.getQuality()) >= threshold;
}
};
int main (int argc, char* argv[])
{
if (argc < 2)
{
std::cerr << "Please, provide a sequence file." << std::endl;
return EXIT_FAILURE;
}
try
{
IBank* inputBank = Bank::open (argv[1]);
FilterIterator<Sequence,QualityFilter> it (inputBank->iterator(), QualityFilter());
for (it.first(); !it.isDone(); it.next())
{
Sequence& seq = it.item();
std::cout << "[" << seq.getQuality() << "] " << computeMeanPhredScore(seq.getQuality()) << std::endl;
}
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}
[go back to top]
Conversion of a FASTA bank to a binary format
This snippet shows how to parse a nucleic bank and convert it to a binary format.
Code is from example bank8.cpp:
#include <iostream>
#include <iomanip>
int main (int argc, char* argv[])
{
if (argc < 3)
{
cerr << "you must provide an input and output banks names:" << endl;
cerr << " 1) bank file path" << endl;
cerr << " 2) binary file path" << endl;
cerr << " 3) dump binary output: 0 for no (default), 1 for yes" << endl;
return EXIT_FAILURE;
}
try
{
IBank* inputBank = Bank::open (argv[1]);
BankBinary outputBank (argv[2]);
ProgressIterator<Sequence> itSeq (*inputBank, "Converting input file into binary format");
for (itSeq.first(); !itSeq.isDone(); itSeq.next()) { outputBank.insert (itSeq.item()); }
outputBank.flush ();
if (argc >= 4 && atoi(argv[3])==1)
{
Iterator<Sequence>* itBinary = outputBank.iterator();
for (itBinary->first(); !itBinary->isDone(); itBinary->next())
{
Sequence& seq = itBinary->item();
cout << seq.getComment() << endl;
for (size_t i=0; i< (seq.getDataSize()+3)/4; i++)
{
cout << "[" << setfill('0') << setw(2) << hex << (int) (seq.getData()[i] & 0xFF) << " ";
char b = (seq.getData()[i] & 0xFF) ;
static char tableASCII[] = {'A', 'C', 'T', 'G'};
cout << tableASCII[(b>>6)&3]
<< tableASCII[(b>>4)&3]
<< tableASCII[(b>>2)&3]
<< tableASCII[(b>>0)&3]
<< "] ";
if ((i+1)%16==0) { cout << endl; }
}
cout << endl;
}
}
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}
[go back to top]
Conversion of a bank with some filtering
This snippet shows how to parse a bank, check whether the sequences match a criteria and dump the matching sequences into an output FASTA bank.s
Code is from example bank9.cpp:
#include <iostream>
struct FilterFunctor
{
int modulo;
FilterFunctor (int modulo) : modulo(modulo) {}
bool operator () (Sequence& seq)
{
return (seq.getIndex() % modulo == 0) && (seq.toString().find("N") == string::npos);
}
};
int main (int argc, char* argv[])
{
if (argc < 3)
{
cerr << "you must provide an input and output banks names:" << endl;
cerr << " 1) input URI" << endl;
cerr << " 2) output URI" << endl;
cerr << " 3) modulo (ie keep sequence id % modulo)" << endl;
return EXIT_FAILURE;
}
try
{
IBank* inputBank = Bank::open (argv[1]);
BankFasta outputBank (argv[2]);
int modulo = argc >= 4 ? atoi (argv[3]) : 1;
if (modulo<=0) { modulo=1; }
FilterIterator<Sequence,FilterFunctor>* itFilter = new FilterIterator<Sequence,FilterFunctor>(
inputBank->iterator(),
FilterFunctor (modulo)
);
ProgressIterator<Sequence> itSeq (itFilter, "Filtering bank", inputBank->estimateNbItems()/modulo );
for (itSeq.first(); !itSeq.isDone(); itSeq.next())
{
outputBank.insert (*itSeq);
}
outputBank.flush ();
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}
[go back to top]
Split a bank
This snippet shows how to split an input bank into parts. It creates an output album bank by using the BankAlbum class. Example of use:
bank10 -in reads.fa -max-size 10000000
This example will create a directory 'reads_S1000000' where we can find:
- album.txt => album file containing all reads_X split parts
- reads_0
- reads_1
- ...
- reads_N
All the 'reads_X' files are about 1000000 bytes. Now, the generated album.txt file can be used as a bank and could be the input bank of the snippet bank14 for instance (and we should get the same results as using directly the 'reads.fa' bank)
Code is from example bank10.cpp:
#include <iostream>
#include <sstream>
static const char* STR_MAX_INPUT_SIZE = "-max-size";
static const char* STR_OUTPUT_FASTQ = "-fastq";
static const char* STR_OUTPUT_GZ = "-gz";
int main (int argc, char* argv[])
{
OptionsParser parser ("BankSplitter");
parser.push_back (new OptionOneParam (STR_URI_INPUT, "bank reference", true));
parser.push_back (new OptionOneParam (STR_MAX_INPUT_SIZE, "average db size per split", true));
parser.push_back (new OptionOneParam (STR_URI_OUTPUT_DIR, "output directory", false, "."));
parser.push_back (new OptionNoParam (STR_OUTPUT_FASTQ, "fastq output", false));
parser.push_back (new OptionNoParam (STR_OUTPUT_GZ, "gzip output", false));
try
{
IProperties* options = parser.parse (argc, argv);
u_int64_t maxDbSize = options->getInt(STR_MAX_INPUT_SIZE);
IBank* inputBank = Bank::open (options->getStr(STR_URI_INPUT));
string inputBasename = System::file().getBaseName (options->getStr(STR_URI_INPUT));
stringstream ss; ss << inputBasename << "_S" << maxDbSize;
string outputDirName = ss.str();
string outputDir = options->getStr(STR_URI_OUTPUT_DIR) + "/" + outputDirName;
System::file().mkdir (outputDir, S_IRWXU);
BankAlbum album (outputDir + "/album.txt");
u_int64_t number, totalSize, maxSize;
inputBank->estimate (number, totalSize, maxSize);
u_int64_t estimationNbSeqToIterate = number;
ProgressIterator<Sequence> itSeq (*inputBank, "split");
int64_t nbBanksOutput = -1;
u_int64_t nbSequences = 0;
u_int64_t dbSize = ~0;
bool isFastq = options->get(STR_OUTPUT_FASTQ) != 0;
bool isGzipped = options->get(STR_OUTPUT_GZ) != 0;
IBank* currentBank = 0;
for (itSeq.first(); !itSeq.isDone(); itSeq.next())
{
if (dbSize > maxDbSize)
{
if (currentBank != 0) { currentBank->flush(); currentBank->finalize(); }
nbBanksOutput ++;
stringstream ss; ss << inputBasename << "_" << nbBanksOutput << (isFastq ? ".fastq" : ".fasta");
if (isGzipped) { ss << ".gz"; }
currentBank = album.addBank (outputDir, ss.str(), isFastq, isGzipped);
dbSize = 0;
}
dbSize += itSeq->getDataSize();
currentBank->insert (*itSeq);
}
if (currentBank != 0) { currentBank->flush(); }
}
catch (OptionFailure& e)
{
return e.displayErrors (cout);
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}
[go back to top]
Iterating a bank whose sequences are kmers
This snippet shows how to iterate a BankKmers. Such a bank iterates all possible kmers for a given kmers size.
Each sequence is saved into an output FASTA file.
Code is from example bank11.cpp:
#include <iostream>
int main (int argc, char* argv[])
{
OptionsParser parser ("KmerIteration");
parser.push_back (new OptionOneParam (STR_KMER_SIZE, "kmer size", true));
parser.push_back (new OptionOneParam (STR_URI_OUTPUT, "output fasta bank", false));
try
{
IProperties* options = parser.parse (argc, argv);
BankKmers bank (options->getInt(STR_KMER_SIZE));
string outputName = options->get(STR_URI_OUTPUT) ?
options->getStr(STR_URI_OUTPUT) :
Stringify::format ("bank_k%d.fa", options->getInt(STR_KMER_SIZE));
BankFasta outputBank (outputName);
ProgressIterator<Sequence> it (bank);
for (it.first(); !it.isDone(); it.next()) { outputBank.insert (it.item()); }
outputBank.flush();
}
catch (OptionFailure& e)
{
return e.displayErrors (cout);
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}
[go back to top]
Extracting sequences with specific ids with FilterIterator
This snippet shows how to extract sequences from a bank. A sequence is kept if its index in its bank belongs to a set (provided as an input file holding one index per line)
It uses the FilterIterator feature.
Code is from example bank12.cpp:
#include <iostream>
static const char* STR_URI_SEQ_IDS = "-seq-ids";
struct FilterFunctor
{
set<size_t>& indexes;
FilterFunctor (set<size_t>& indexes) : indexes(indexes) {}
bool operator () (Sequence& seq) const { return indexes.find (seq.getIndex()) != indexes.end(); }
};
int main (int argc, char* argv[])
{
OptionsParser parser ("BankFilter");
parser.push_back (new OptionOneParam (STR_URI_INPUT, "bank reference", true));
parser.push_back (new OptionOneParam (STR_URI_SEQ_IDS, "file holding indexes of bank", true));
try
{
IProperties* options = parser.parse (argc, argv);
set<size_t> indexes;
FILE* file = fopen (options->getStr(STR_URI_SEQ_IDS).c_str(), "r");
if (file != 0)
{
char buffer[128];
while (fgets (buffer, sizeof(buffer), file)) { indexes.insert (atoi(buffer)); }
fclose (file);
}
cout << "found " << indexes.size() << " indexes" << endl;
string outputBankUri = options->getStr(STR_URI_INPUT) + "_" + System::file().getBaseName (options->getStr(STR_URI_SEQ_IDS));
BankFasta outputBank(outputBankUri);
IBank* inputBank = Bank::open (options->getStr(STR_URI_INPUT));
FilterIterator<Sequence,FilterFunctor> itSeq (inputBank->iterator(), FilterFunctor(indexes));
for (itSeq.first(); !itSeq.isDone(); itSeq.next())
{
outputBank.insert (itSeq.item());
}
outputBank.flush();
}
catch (OptionFailure& e)
{
return e.displayErrors (cout);
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}
[go back to top]
Extracting sequences with too many N characters
This snippet shows how to extract sequences that don't have too many N.
It uses the FilterIterator feature.
Code is from example bank13.cpp:
#include <iostream>
static const char* STR_FILTER_RATIO = "-filter-ratio";
int main (int argc, char* argv[])
{
OptionsParser parser ("BankFilter");
parser.push_back (new OptionOneParam (STR_URI_INPUT, "bank input", true));
parser.push_back (new OptionOneParam (STR_FILTER_RATIO, "skip a sequence if 'good letters number / seq.len > X'", false, "0.8"));
try
{
IProperties* options = parser.parse (argc, argv);
double percentThreshold = options->getDouble(STR_FILTER_RATIO);
IBank* inBank = Bank::open (options->getStr(STR_URI_INPUT));
IBank* outBank = new BankFasta (options->getStr(STR_URI_INPUT) + "_filtered");
inBank->iterate ([&] (Sequence& s)
{
char* data = s.getDataBuffer();
size_t nbOK = 0;
for (size_t i=0; i<s.getDataSize(); i++)
{
if (data[i]=='A' || data[i]=='C' || data[i]=='G' || data[i]=='T') { nbOK++; }
}
if ((double)nbOK / (double)s.getDataSize() > percentThreshold) { outBank->insert (s); }
});
outBank->flush();
}
catch (OptionFailure& e)
{
return e.displayErrors (cout);
}
catch (Exception& e)
{
cerr << "EXCEPTION: " << e.getMessage() << endl;
}
}
[go back to top]
Computing statistics on a bank
This snippet shows how to read a bank and get statistics on it.
Code is from example bank14.cpp:
#include <iostream>
int main (int argc, char* argv[])
{
OptionsParser parser ("BankStats");
parser.push_back (new OptionOneParam (STR_URI_INPUT, "bank input", true));
try
{
IProperties* options = parser.parse (argc, argv);
u_int64_t nbSequences=0, dataSize=0, seqMaxSize=0, seqMinSize=~0;
IBank* inputBank = Bank::open (options->getStr(STR_URI_INPUT));
ProgressIterator<Sequence> it (*inputBank, "iterate");
for (it.first(); !it.isDone(); it.next())
{
Data& data = it.item().getData();
nbSequences ++;
if (data.size() > seqMaxSize) { seqMaxSize = data.size(); }
if (data.size() < seqMinSize) { seqMinSize = data.size(); }
dataSize += data.size ();
}
std::cout << "data size : " << dataSize << std::endl;
std::cout << "sequence number : " << nbSequences << std::endl;
std::cout << "sequence max size : " << seqMaxSize << std::endl;
std::cout << "sequence min size : " << seqMinSize << std::endl;
}
catch (OptionFailure& e)
{
return e.displayErrors (std::cout);
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}
[go back to top]
Iterating paired end banks
This snippet shows how to read two banks in the same time, and each item is a pair of items of bank1 and bank2.
This is good example how to reads paired end banks.
Code is from example bank18.cpp:
#include <iostream>
int main (int argc, char* argv[])
{
static const char* STR_BANK1 = "-one";
static const char* STR_BANK2 = "-two";
OptionsParser parser ("TwoBanks");
parser.push_back (new OptionOneParam (STR_BANK1, "bank one", true));
parser.push_back (new OptionOneParam (STR_BANK2, "bank two", true));
try
{
IProperties* options = parser.parse (argc, argv);
IBank* bank1 = Bank::open (options->getStr(STR_BANK1));
LOCAL (bank1);
IBank* bank2 = Bank::open (options->getStr(STR_BANK2));
LOCAL (bank2);
PairedIterator<Sequence> * itPair = new PairedIterator<Sequence> (bank1->iterator(), bank2->iterator());
ProgressIterator< std::pair <Sequence, Sequence>> progress_iter (itPair, "coucou la paire", bank1->estimateNbItems());
for (progress_iter.first(); !progress_iter.isDone(); progress_iter.next())
{
Sequence& s1 = itPair->item().first;
Sequence& s2 = itPair->item().second;
std::cout << "seq1.len=" << s1.getDataSize() << " seq2.len=" << s2.getDataSize() << std::endl;
}
}
catch (OptionFailure& e)
{
return e.displayErrors (std::cout);
}
catch (Exception& e)
{
std::cerr << "EXCEPTION: " << e.getMessage() << std::endl;
}
}
[go back to top]