vg
tools for working with variation graphs
Public Member Functions | Public Attributes | List of all members
vg::HaplotypeIndexer Class Reference

#include <haplotype_indexer.hpp>

Inheritance diagram for vg::HaplotypeIndexer:
vg::Progressive

Public Member Functions

 HaplotypeIndexer ()
 Perform initialization of backing libraries. More...
 
size_t parse_vcf (const PathHandleGraph *graph, map< string, Path > &alt_paths, const vector< path_handle_t > &contigs, vcflib::VariantCallFile &variant_file, std::vector< std::string > &sample_names, const function< void(size_t, const gbwt::VariantPaths &, gbwt::PhasingInformation &)> &handle_contig_haplotype_batch)
 
tuple< vector< string >, size_t, vector< string > > generate_threads (const PathHandleGraph *graph, map< string, Path > &alt_paths, bool index_paths, const string &vcf_filename, const vector< string > &aln_filenames, const string &aln_format, const function< void(size_t)> &bit_width_ready, const function< void(const gbwt::vector_type &, const gbwt::size_type(&)[4])> &each_thread)
 
unique_ptr< gbwt::DynamicGBWT > build_gbwt (const PathHandleGraph *graph, map< string, Path > &alt_paths, bool index_paths, const string &vcf_filename, const vector< string > &aln_filenames, const string &aln_format)
 
- Public Member Functions inherited from vg::Progressive
void preload_progress (const string &message)
 
void create_progress (const string &message, long count)
 
void create_progress (long count)
 
void update_progress (long i)
 
void increment_progress ()
 
void destroy_progress (void)
 

Public Attributes

bool warn_on_missing_variants = true
 Print a warning if variants in the VCF can't be found in the graph. More...
 
size_t found_missing_variants = 0
 
size_t max_missing_variant_warnings = 10
 Only report up to this many of them. More...
 
map< string, string > path_to_vcf
 
bool rename_variants = true
 
string batch_file_prefix = ""
 
bool index_paths = false
 
bool phase_homozygous = true
 Phase homozygous unphased variants. More...
 
bool force_phasing = false
 Arbitrarily phase all unphased variants. More...
 
bool discard_overlaps = false
 Join together overlapping haplotypes. More...
 
size_t samples_in_batch = 200
 Number of samples to process together in a haplotype batch. More...
 
size_t gbwt_buffer_size = gbwt::DynamicGBWT::INSERT_BATCH_SIZE / gbwt::MILLION
 Size of the GBWT buffer in millions of nodes. More...
 
size_t id_interval = gbwt::DynamicGBWT::SAMPLE_INTERVAL
 Interval at which to sample for GBWT locate. More...
 
pair< size_t, size_t > sample_range = pair<size_t, size_t>(0, numeric_limits<size_t>::max())
 Range of VCF samples to process (first to past-last). More...
 
map< string, pair< size_t, size_t > > regions
 
unordered_set< string > excluded_samples
 
- Public Attributes inherited from vg::Progressive
bool show_progress = false
 

Detailed Description

Allows indexing haplotypes, either to pre-parsed haplotype files or to a GBWT.

Constructor & Destructor Documentation

◆ HaplotypeIndexer()

vg::HaplotypeIndexer::HaplotypeIndexer ( )

Perform initialization of backing libraries.

Member Function Documentation

◆ build_gbwt()

unique_ptr< gbwt::DynamicGBWT > vg::HaplotypeIndexer::build_gbwt ( const PathHandleGraph graph,
map< string, Path > &  alt_paths,
bool  index_paths,
const string &  vcf_filename,
const vector< string > &  aln_filenames,
const string &  aln_format 
)

Build a GBWT from the given haplotype sources.

graph is the graph to operate on.

alt_paths is a map of pre-extracted alt paths. If not filled in, alt paths will be extracted. The map will be cleared when the function returns.

index_paths is a flag for whether to include non-alt paths in the graph as haplotypes in the GBWT.

If vcf_filename is set, includes haplotypes from the VCF in the GBWT. If batch_file_prefix is set on the object, also dumps VCF parse information.

If aln_filenames is nonempty, includes alignment paths from those files as haplotypes. In that case, index_paths must be false and vcf_filenames must be empty.

aln_format can be "GAM" or "GAF"

Respects excluded_samples and does not produce threads for them.

◆ generate_threads()

tuple< vector< string >, size_t, vector< string > > vg::HaplotypeIndexer::generate_threads ( const PathHandleGraph graph,
map< string, Path > &  alt_paths,
bool  index_paths,
const string &  vcf_filename,
const vector< string > &  aln_filenames,
const string &  aln_format,
const function< void(size_t)> &  bit_width_ready,
const function< void(const gbwt::vector_type &, const gbwt::size_type(&)[4])> &  each_thread 
)

Collect haplotype threads and metadata by combining haplotype sources.

graph is the graph to operate on.

alt_paths is a map of pre-extracted alt paths. If not filled in, alt paths will be extracted. The map will be cleared when the function returns.

index_paths is a flag for whether to include non-alt paths in the graph as haplotypes in the GBWT.

If vcf_filename is set, includes haplotypes from the VCF in the GBWT. If batch_file_prefix is set on the object, also dumps VCF parse information.

If aln_filenames is nonempty, includes alignment paths from those files as haplotypes. In that case, index_paths must be false and vcf_filenames must be empty.

aln_format can be "GAM" or "GAF"

First, determines the bit width necessary to encode the threads that will be produced, and announces it to the bit_width_ready callback.

Then, for each thread in serial (describing a contiguous portion of a haplotype on a contig), calls each_thread with the thread data itself and an array of numbers describing the thread name.

Respects excluded_samples and does not produce threads for them.

Returns the sample names, the total haplotype count, and the contig names.

◆ parse_vcf()

size_t vg::HaplotypeIndexer::parse_vcf ( const PathHandleGraph graph,
map< string, Path > &  alt_paths,
const vector< path_handle_t > &  contigs,
vcflib::VariantCallFile &  variant_file,
std::vector< std::string > &  sample_names,
const function< void(size_t, const gbwt::VariantPaths &, gbwt::PhasingInformation &)> &  handle_contig_haplotype_batch 
)

Parse a VCF file into the types needed for GBWT indexing.

Takes a graph, a map of alt paths by name that will be extracted from the graph if not populated, a vector of contigs in the graph to process, in order, and the corresponding VCF file, already open. Sample parsing on the VCF file should be turned off.

Uses the given vector of sample names, which must be pre-populated with any other samples (such as "ref") already in a GBWTBuilder that you are using with the results of this function. Sample names from the VCF will be added.

Calls the callback serially with the contig number, each contig's gbwt::VariantPaths, for each gbwt::PhasingInformation batch of samples. The gbwt::PhasingInformation is not const because the GBWT library needs to modify it in order to generate haplotypes from it efficiently.

If batch_file_prefix is set on the object, also dumps VCF parse information.

Doesn't create threads for embedded graph paths itself.

Ignores excluded_samples.

Returns the number of haplotypes created (2 per sample) This number will need to be adjusted if any samples' haplotypes are filtered out later. This function ignores any sample filters and processes the entire VCF.

Member Data Documentation

◆ batch_file_prefix

string vg::HaplotypeIndexer::batch_file_prefix = ""

If batch_file_prefix is nonempty, a file for each contig is saved to PREFIX_VCFCONTIG, and files for each batch of haplotypes are saved to files named like PREFIX_VCFCONTIG_STARTSAMPLE_ENDSAMPLE. Otherwise, the batch files are still saved, but to temporary files.

◆ discard_overlaps

bool vg::HaplotypeIndexer::discard_overlaps = false

Join together overlapping haplotypes.

◆ excluded_samples

unordered_set<string> vg::HaplotypeIndexer::excluded_samples

Excluded VCF sample names, for which threads will not be generated. Ignored during VCF parsing.

◆ force_phasing

bool vg::HaplotypeIndexer::force_phasing = false

Arbitrarily phase all unphased variants.

◆ found_missing_variants

size_t vg::HaplotypeIndexer::found_missing_variants = 0

Track the number of variants in the phasing VCF that aren't found in the graph TODO: Make atomic?

◆ gbwt_buffer_size

size_t vg::HaplotypeIndexer::gbwt_buffer_size = gbwt::DynamicGBWT::INSERT_BATCH_SIZE / gbwt::MILLION

Size of the GBWT buffer in millions of nodes.

◆ id_interval

size_t vg::HaplotypeIndexer::id_interval = gbwt::DynamicGBWT::SAMPLE_INTERVAL

Interval at which to sample for GBWT locate.

◆ index_paths

bool vg::HaplotypeIndexer::index_paths = false

If set to true, store paths from the graph alognside haplotype threads from the VCF, if any.

◆ max_missing_variant_warnings

size_t vg::HaplotypeIndexer::max_missing_variant_warnings = 10

Only report up to this many of them.

◆ path_to_vcf

map<string, string> vg::HaplotypeIndexer::path_to_vcf

Path names in the graph are mapped to VCF contig names via path_to_vcf, or used as-is if no entry there is found.

◆ phase_homozygous

bool vg::HaplotypeIndexer::phase_homozygous = true

Phase homozygous unphased variants.

◆ regions

map<string, pair<size_t, size_t> > vg::HaplotypeIndexer::regions

Region restrictions for contigs, in VCF name space, as 0-based exclusive-end ranges.

◆ rename_variants

bool vg::HaplotypeIndexer::rename_variants = true

Use graph path names instead of VCF path names when composing variant alt paths.

◆ sample_range

pair<size_t, size_t> vg::HaplotypeIndexer::sample_range = pair<size_t, size_t>(0, numeric_limits<size_t>::max())

Range of VCF samples to process (first to past-last).

◆ samples_in_batch

size_t vg::HaplotypeIndexer::samples_in_batch = 200

Number of samples to process together in a haplotype batch.

◆ warn_on_missing_variants

bool vg::HaplotypeIndexer::warn_on_missing_variants = true

Print a warning if variants in the VCF can't be found in the graph.


The documentation for this class was generated from the following files: