The experimental parquet reader class to optimally read parquet files subject to highly selective filters, called a Hybrid Scan operation. More...
#include <hybrid_scan.hpp>
Public Member Functions | |
hybrid_scan_reader (cudf::host_span< uint8_t const > footer_bytes, parquet_reader_options const &options) | |
Constructor for the experimental parquet reader class to optimally read Parquet files subject to highly selective filters. More... | |
~hybrid_scan_reader () | |
Destructor for the experimental parquet reader class. | |
FileMetaData | parquet_metadata () const |
Get the Parquet file footer metadata. More... | |
byte_range_info | page_index_byte_range () const |
Get the byte range of the page index in the Parquet file. More... | |
void | setup_page_index (cudf::host_span< uint8_t const > page_index_bytes) const |
Setup the page index within the Parquet file metadata (FileMetaData ) More... | |
std::vector< size_type > | all_row_groups (parquet_reader_options const &options) const |
Get all available row groups from the parquet file. More... | |
size_type | total_rows_in_row_groups (cudf::host_span< size_type const > row_group_indices) const |
Get the total number of top-level rows in the row groups. More... | |
std::vector< size_type > | filter_row_groups_with_stats (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const |
Filter the input row groups using column chunk statistics. More... | |
std::pair< std::vector< byte_range_info >, std::vector< byte_range_info > > | secondary_filters_byte_ranges (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options) const |
Get byte ranges of bloom filters and dictionary pages (secondary filters) for row group pruning. More... | |
std::vector< size_type > | filter_row_groups_with_dictionary_pages (cudf::host_span< rmm::device_buffer > dictionary_page_data, cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const |
Filter the row groups using column chunk dictionary pages. More... | |
std::vector< size_type > | filter_row_groups_with_bloom_filters (cudf::host_span< rmm::device_buffer > bloom_filter_data, cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const |
Filter the row groups using column chunk bloom filters. More... | |
std::pair< std::unique_ptr< cudf::column >, std::vector< thrust::host_vector< bool > > > | filter_data_pages_with_stats (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const |
Filter data pages of filter columns using page statistics from page index metadata. More... | |
std::vector< byte_range_info > | filter_column_chunks_byte_ranges (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options) const |
Get byte ranges of column chunks of filter columns. More... | |
table_with_metadata | materialize_filter_columns (cudf::host_span< thrust::host_vector< bool > const > page_mask, cudf::host_span< size_type const > row_group_indices, std::vector< rmm::device_buffer > column_chunk_buffers, cudf::mutable_column_view row_mask, parquet_reader_options const &options, rmm::cuda_stream_view stream) const |
Materializes filter columns and updates the input row mask to only the rows that exist in the output table. More... | |
std::vector< byte_range_info > | payload_column_chunks_byte_ranges (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options) const |
Get byte ranges of column chunks of payload columns. More... | |
table_with_metadata | materialize_payload_columns (cudf::host_span< size_type const > row_group_indices, std::vector< rmm::device_buffer > column_chunk_buffers, cudf::column_view row_mask, parquet_reader_options const &options, rmm::cuda_stream_view stream) const |
Materialize payload columns and applies the row mask to the output table. More... | |
The experimental parquet reader class to optimally read parquet files subject to highly selective filters, called a Hybrid Scan operation.
This class is designed to best exploit reductive optimization techniques to speed up reading Parquet files subject to highly selective filters. The parquet file contents are read in two passes. In the first pass, only the filter
columns (i.e. columns that appear in the filter expression) are read allowing pruning of row groups and filter column data pages using the filter expression. In the second pass, only the payload
columns (i.e. columns that do not appear in the filter expression) are optimally read by applying the surviving row mask from the first pass to prune payload column data pages.
The following code snippets demonstrate how to use the experimental parquet reader.
Start with an instance of the experimental reader with a span of parquet file footer bytes and parquet reader options.
Metadata handling (OPTIONAL): Get a materialized parquet file footer metadata struct (FileMetaData
) from the reader to get insights into the parquet data as needed. Optionally, set up the page index to materialize page level stats used for data page pruning.
Row group pruning (OPTIONAL): Start with either a list of custom or all row group indices in the parquet file and optionally filter it subject to filter expression using column chunk statistics, dictionaries and bloom filters. Byte ranges for column chunk dictionary pages and bloom filters within parquet file may be obtained via secondary_filters_byte_ranges()
function. The byte ranges may be read into a corresponding vector of device buffers and passed to the corresponding row group filtration function.
Filter column page pruning (OPTIONAL): Once the row groups are filtered, the next step is to optionally prune the data pages within the current span of row groups subject to the same filter expression using page statistics contained in the page index of the parquet file. To get started, first set up the page index using the setup_page_index()
function if not previously done and then filter the data pages using the filter_data_pages_with_stats()
function. This function returns a row mask. i.e. BOOL8 column indicating which rows may survive in the materialized table of filter columns (first reader pass), and a data page mask. i.e. a vector of boolean host vectors indicating which data pages for each filter column need to be processed to materialize the table filter columns (first reader pass).
Materialize filter columns: Once we are finished with pruning row groups and filter column data pages, the next step is to materialize filter columns into a table (first reader pass). This is done using the materialize_filter_columns()
function. This function requires a vector of device buffers containing column chunk data for the current list of row groups, and the data page and row masks obtained from the page pruning step. The function returns a table of materialized filter columns and also updates the row mask column to only the valid rows that satisfy the filter expression. If no row group pruning is needed, pass a span of all row group indices from all_row_groups()
function as the current list of row groups. Similarly, if no page pruning is desired, pass an empty span as data page mask and a mutable view of a BOOL8 column of size equal to total number of rows in the current row groups list (computed by total_rows_in_row_groups()
) containing all true
values as row mask. Further, the byte ranges for the required column chunk data may be obtained using the filter_column_chunks_byte_ranges()
function and read into a corresponding vector of vectors of device buffers.
Materialize payload columns: Once the filter columns are materialized, the final step is to materialize the payload columns into another table (second reader pass). This is done using the materialize_payload_columns()
function. This function requires a vector of device buffers containing column chunk data for the current list of row groups, and the updated row mask from the materialize_filter_columns()
. The function uses the row mask - may be a BOOL8 column of size equal to total number of rows in the current row groups list containing all true
values if no pruning is desired - to internally prune payload column data pages and mask the materialized payload columns to the desired rows. Similar to the first reader pass, the byte ranges for the required column chunk data may be obtained using the payload_column_chunks_byte_ranges()
function and read into a corresponding vector of vectors of device buffers.
Once both reader passes are complete, the filter and payload column tables may be trivially combined by releasing the columns from both tables and moving them into a new cudf table.
cudf::io::read_parquet()
function. Definition at line 266 of file hybrid_scan.hpp.
|
explicit |
Constructor for the experimental parquet reader class to optimally read Parquet files subject to highly selective filters.
footer_bytes | Host span of parquet file footer bytes |
options | Parquet reader options |
std::vector<size_type> cudf::io::parquet::experimental::hybrid_scan_reader::all_row_groups | ( | parquet_reader_options const & | options | ) | const |
Get all available row groups from the parquet file.
options | Parquet reader options |
std::vector<byte_range_info> cudf::io::parquet::experimental::hybrid_scan_reader::filter_column_chunks_byte_ranges | ( | cudf::host_span< size_type const > | row_group_indices, |
parquet_reader_options const & | options | ||
) | const |
Get byte ranges of column chunks of filter columns.
row_group_indices | Input row groups indices |
options | Parquet reader options |
std::pair<std::unique_ptr<cudf::column>, std::vector<thrust::host_vector<bool> > > cudf::io::parquet::experimental::hybrid_scan_reader::filter_data_pages_with_stats | ( | cudf::host_span< size_type const > | row_group_indices, |
parquet_reader_options const & | options, | ||
rmm::cuda_stream_view | stream, | ||
rmm::device_async_resource_ref | mr | ||
) | const |
Filter data pages of filter columns using page statistics from page index metadata.
row_group_indices | Input row groups indices |
options | Parquet reader options |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::vector<size_type> cudf::io::parquet::experimental::hybrid_scan_reader::filter_row_groups_with_bloom_filters | ( | cudf::host_span< rmm::device_buffer > | bloom_filter_data, |
cudf::host_span< size_type const > | row_group_indices, | ||
parquet_reader_options const & | options, | ||
rmm::cuda_stream_view | stream | ||
) | const |
Filter the row groups using column chunk bloom filters.
bloom_filter_data
device buffers must be allocated using a 32 byte aligned memory resourcebloom_filter_data | Device buffers containing bloom filter data of column chunks with an equality predicate |
row_group_indices | Input row groups indices |
options | Parquet reader options |
stream | CUDA stream used for device memory operations and kernel launches |
std::vector<size_type> cudf::io::parquet::experimental::hybrid_scan_reader::filter_row_groups_with_dictionary_pages | ( | cudf::host_span< rmm::device_buffer > | dictionary_page_data, |
cudf::host_span< size_type const > | row_group_indices, | ||
parquet_reader_options const & | options, | ||
rmm::cuda_stream_view | stream | ||
) | const |
Filter the row groups using column chunk dictionary pages.
dictionary_page_data | Device buffers containing dictionary page data of column chunks with (in)equality predicate |
row_group_indices | Input row groups indices |
options | Parquet reader options |
stream | CUDA stream used for device memory operations and kernel launches |
std::vector<size_type> cudf::io::parquet::experimental::hybrid_scan_reader::filter_row_groups_with_stats | ( | cudf::host_span< size_type const > | row_group_indices, |
parquet_reader_options const & | options, | ||
rmm::cuda_stream_view | stream | ||
) | const |
Filter the input row groups using column chunk statistics.
row_group_indices | Input row groups indices |
options | Parquet reader options |
stream | CUDA stream used for device memory operations and kernel launches |
table_with_metadata cudf::io::parquet::experimental::hybrid_scan_reader::materialize_filter_columns | ( | cudf::host_span< thrust::host_vector< bool > const > | page_mask, |
cudf::host_span< size_type const > | row_group_indices, | ||
std::vector< rmm::device_buffer > | column_chunk_buffers, | ||
cudf::mutable_column_view | row_mask, | ||
parquet_reader_options const & | options, | ||
rmm::cuda_stream_view | stream | ||
) | const |
Materializes filter columns and updates the input row mask to only the rows that exist in the output table.
page_mask | Boolean vectors indicating which data pages are not pruned, one per filter column. All data pages considered not pruned if empty | |
row_group_indices | Input row groups indices | |
column_chunk_buffers | Device buffers containing column chunk data of filter columns | |
[in,out] | row_mask | Mutable boolean column indicating surviving rows from page pruning |
options | Parquet reader options | |
stream | CUDA stream used for device memory operations and kernel launches |
table_with_metadata cudf::io::parquet::experimental::hybrid_scan_reader::materialize_payload_columns | ( | cudf::host_span< size_type const > | row_group_indices, |
std::vector< rmm::device_buffer > | column_chunk_buffers, | ||
cudf::column_view | row_mask, | ||
parquet_reader_options const & | options, | ||
rmm::cuda_stream_view | stream | ||
) | const |
Materialize payload columns and applies the row mask to the output table.
row_group_indices | Input row groups indices |
column_chunk_buffers | Device buffers containing column chunk data of payload columns |
row_mask | Boolean column indicating which rows need to be read. All rows read if empty |
options | Parquet reader options |
stream | CUDA stream used for device memory operations and kernel launches |
byte_range_info cudf::io::parquet::experimental::hybrid_scan_reader::page_index_byte_range | ( | ) | const |
Get the byte range of the page index in the Parquet file.
FileMetaData cudf::io::parquet::experimental::hybrid_scan_reader::parquet_metadata | ( | ) | const |
Get the Parquet file footer metadata.
Returns the materialized Parquet file footer metadata struct. The footer will contain the materialized page index if called after setup_page_index()
.
std::vector<byte_range_info> cudf::io::parquet::experimental::hybrid_scan_reader::payload_column_chunks_byte_ranges | ( | cudf::host_span< size_type const > | row_group_indices, |
parquet_reader_options const & | options | ||
) | const |
Get byte ranges of column chunks of payload columns.
row_group_indices | Input row groups indices |
options | Parquet reader options |
std::pair<std::vector<byte_range_info>, std::vector<byte_range_info> > cudf::io::parquet::experimental::hybrid_scan_reader::secondary_filters_byte_ranges | ( | cudf::host_span< size_type const > | row_group_indices, |
parquet_reader_options const & | options | ||
) | const |
Get byte ranges of bloom filters and dictionary pages (secondary filters) for row group pruning.
row_group_indices | Input row groups indices |
options | Parquet reader options |
void cudf::io::parquet::experimental::hybrid_scan_reader::setup_page_index | ( | cudf::host_span< uint8_t const > | page_index_bytes | ) | const |
Setup the page index within the Parquet file metadata (FileMetaData
)
Materialize the ColumnIndex
and OffsetIndex
structs (collectively called the page index) within the Parquet file metadata struct (returned by parquet_metadata()
). The statistics contained in page index can be used to prune data pages before decoding.
page_index_bytes | Host span of Parquet page index buffer bytes |
size_type cudf::io::parquet::experimental::hybrid_scan_reader::total_rows_in_row_groups | ( | cudf::host_span< size_type const > | row_group_indices | ) | const |
Get the total number of top-level rows in the row groups.
row_group_indices | Input row groups indices |