By default, ORC file properties are configured to optimize HDFS read performance and usage. This doesn't always translate into optimized query response times.
An ORC file consists of 1 or more "stripes". These strips contain rows that are grouped together and can be read independent of each other.
NEED TO VERIFY: ORC files are splittable at the "stripe". This means that a large "ORC" file can be read in parallel across several containers. If you're goal is optimized query response times, the default stripe of 256M may be to large. You can adjust it through the "tblproperties" in the DDL.
Each stripe contains a row index every 10,000 rows, by default. The index is a data structure that includes min and max values for each column and the row positions within each column (A bit field or bloom filter could also be included.) The row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block.
When your data is efficiently organized, these index records can significantly improve query times by enabling to ability to to skip large chunks of data that don't match your conditions through a process called "predicate push down". This optimization is possible on the runtime side by setting "hive.optimize.ppd" to "true".
The default index stride of 10,000 can be adjusted to help increase performance. If the predicate values you're looking for span larger areas, increasing this value could have a positive affect.
tblproperties("orc.row.index.stride"=50000); -- 50K index stride
To maximize the capabilities of "Predicate Push Down" and the "Row Index", you should apply some type of secondary sorting to the data while inserting it. Where the "primary" filter is usually the "partition", sorting your data within the "partition" will increase the effectiveness of the row index structure, ensuring the clustering of records allows you to skip large sections of data.
ORC File Compression
SNAPPY for time based performance, ZLIB for resource performance (Drive Space).
Populating ORC Tables
ORC tables are usually populated via an INSERT INTO statement, drawing from another source table/file in HDFS. Another approach is through the new Hive Streaming API, but that is beyond the scope of this document.
Prior to Hive 0.13, you generally needed to include a DISTRIBUTE BY/SORT BY in order to entice the query planner to execute a reduce side query plan. With Hive 0.13, you don't need to do this, IF you're inserting into a partitioned table. The Hive 0.13 query planner understands this and will automatically distribute, based on the partition field. Including these values will actually turn off some optimizations that allow for the efficient flushing of an ORC file to disk. An issue prior to Hive 0.13, that would cause OOM errors on large inserts.
Open Question to Engineering
If DISTRIBUTE BY / SORT BY turn off certain optimizations that allow the FileSink to close ORC files early and flush memory faster, how can we control the secondary sorting of data within an ORC file during insert. Without that, PPD isn't as efficient as it could be.
create external table test_source (
STORED AS TEXTFILE
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
create table test (
PARTITIONED BY (observation_year String, observation_month String)
STORED AS ORC
.... -- See below for scenarios
tblproperties ("orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000")
-- stripe size and index stride values are defaults and can be omitted.
-- included here for clarity and comparison.
Building a partitioned table, while provided a secondary sort on the 'station_id','hour'. This secondary sorting will allow the
Loading Option #1 - Pre Hive 0.13
INSERT INTO test (
station_id, hour, temp, precip,
year(observation_dt) as obs_year,
month(observation_dt) as obs_month
DISTRIBUTE BY obs_year, obs_month SORT BY stationid, hour;