Skip to end of metadata
Go to start of metadata

Table Creation Options

By default, ORC file properties are configured to optimize HDFS read performance and usage.  This doesn't always translate into optimized query response times.

An ORC file consists of 1 or more "stripes".  These strips contain rows that are grouped together and can be read independent of each other.

NEED TO VERIFY: ORC files are splittable at the "stripe".  This means that a large "ORC" file can be read in parallel across several containers.  If you're goal is optimized query response times, the default stripe of 256M may be to large. You can adjust it through the "tblproperties" in the DDL.

create table mytest (
tblproperties("orc.stripe.size"=67108864); -- 64MB Stripes

Each stripe contains a row index every 10,000 rows, by default.  The index is a data structure that includes min and max values for each column and the row positions within each column (A bit field or bloom filter could also be included.)  The row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block.

When your data is efficiently organized, these index records can significantly improve query times by enabling to ability to to skip large chunks of data that don't match your conditions through a process called "predicate push down".  This optimization is possible on the runtime side by setting "hive.optimize.ppd" to "true".

Affects Hive 0.13

Predicate Push Down / ORC / Column Type Issue:

The default index stride of 10,000 can be adjusted to help increase performance.  If the predicate values you're looking for span larger areas, increasing this value could have a positive affect.

tblproperties("orc.row.index.stride"=50000); -- 50K index stride

To maximize the capabilities of "Predicate Push Down" and the "Row Index", you should apply some type of secondary sorting to the data while inserting it.  Where the "primary" filter is usually the "partition", sorting your data within the "partition" will increase the effectiveness of the row index structure, ensuring the clustering of records allows you to skip large sections of data.

ORC File Compression

SNAPPY for time based performance, ZLIB for resource performance (Drive Space).

Populating ORC Tables

ORC tables are usually populated via an INSERT INTO statement, drawing from another source table/file in HDFS.  Another approach is through the new Hive Streaming API, but that is beyond the scope of this document.

Prior to Hive 0.13, you generally needed to include a DISTRIBUTE BY/SORT BY in order to entice the query planner to execute a reduce side query plan.  With Hive 0.13, you don't need to do this, IF you're inserting into a partitioned table.  The Hive 0.13 query planner understands this and will automatically distribute, based on the partition field.  Including these values will actually turn off some optimizations that allow for the efficient flushing of an ORC file to disk.  An issue prior to Hive 0.13, that would cause OOM errors on large inserts.

Open Question to Engineering

If DISTRIBUTE BY / SORT BY turn off certain optimizations that allow the FileSink to close ORC files early and flush memory faster, how can we control the secondary sorting of data within an ORC file during insert. Without that, PPD isn't as efficient as it could be.

Table Definitions
create external table test_source (
observation_dt String, 
station_id String,
hour Int,
temp Float,
Precip Float)
LOCATION '/inbound_ingestion_datasets/wx';

create table test (
station_id String,
hour Int,
temp Float,
Precip Float)
PARTITIONED BY (observation_year String, observation_month String)
.... -- See below for scenarios
Time Based Optimizations
tblproperties ("orc.compress"="SNAPPY", "orc.stripe.size"="67108864","orc.row.index.stride"="50000") 
Analytics Based Performance
tblproperties ("orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000") 
-- stripe size and index stride values are defaults and can be omitted. 
-- included here for clarity and comparison.

Building a partitioned table, while provided a secondary sort on the 'station_id','hour'.  This secondary sorting will allow the 

Loading Option #1 - Pre Hive 0.13
		station_id, hour, temp, precip, 
		year(observation_dt) as obs_year, 
		month(observation_dt) as obs_month
	DISTRIBUTE BY obs_year, obs_month SORT BY stationid, hour;
  • No labels