Hive Performance Tips

Working with Hive can be a frustrating experience if your used to traditional RDBM systems.  Due to the separation of concerns inherent to Hive and HDFS/MR, traditional statistics used by RDBM systems to assist with the planning and query optimization aren't available (currently1), so a lot of the magic is left up to the user in the way they organize their data and form the queries.

One of the most common hurdles I've seen in my travels is the basic lack of understanding about the data contents.  Making the transition to Hadoop, sometimes requires some "unlearning".  Making adjustments to techniques and processes that were popular in DW, don't always apply.

Configuration Parameters:

Various Hive Parameters:

Hive Indexes:

Hive Windowing and Analytics Functions:

Registering Permanent UDF (Functions) in Hive

Analytical Queries in Hive:

Hive Server2 Clients:

The table below is a listing of various Hive settings that can be used throughout this journey into Hive Optimization


Skew Merge Bucket Joins
set hive.optimize.skewjoin = true; 
set hive.skewjoin.key = skew_key_threshold

Join bottlenecked on the reducer who gets the
skewed key

Hadoop Summit Preso

Sort Merge Bucket Map Join
set hive.optimize.bucketmapjoin = true; 
set hive.optimize.bucketmapjoin.sortedmerge = true;

No limit on file/partition/table size.

1. Work together with bucket map join
2. Bucket columns == Join columns == sort columns

Hadoop Summit Preso 

Bucket Map Join
set hive.optimize.bucketmapjoin = true;

Total table/partition size is big, not good for mapjoin.

1. Work together with map join
2. All join tables are bucketized, and each small table’s bucket number can be divided by
big table’s bucket number.
3. Bucket columns == Join columns

Prevent MapJoins for Large Tables

Hive Join Optimization Doc

Controlling the CombinedHiveInputFormat Size

set mapred.max.split.size=268435456;
Hadoop v1.x  
set mapred.min.split.size=
set mapreduce.input.fileinputformat.split.maxsize=
Hadoop v2.x  
set mapreduce.input.fileinputformat.split.minsize=
0.10Use with above split size settings to help build larger processing blocks. 
Dynamic Partition Creation
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=500;
set = true;
set = 10000;

Hive Join Optimization Doc 

Control the Output Compression
set hive.exec.compress.output=true
Control the Interme
set hive.exec.compress.intermediate=true
set io.sort.mb=400 
 The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. 
set hive.limit.pushdown.memory.usage=0.1f
That is used in ORDER BY LIMIT queries for pushing down the LIMIT clause.
select * from table order by key1 limit 10;
would use 10% of memory to store a Top-K, which would mean that the impl will keep an in-memory ordered list of 10 rows & replace/discard rows which fall outside the top-10.
This means Top-K is worst case of O(log(K)*n), while the unoptimized sorter+LIMIT is O(log(n)*n), which is a few magnitudes of performance gains when K is something like 100 and n is ~1+ million.

 set hive.optimize.correlation=true;0.12 Correlation Optimizer
set hive.vectorized.execution.enabled=true;
0.13 (Requires ORC Formatted Datasets)Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory.

Vectorization Design

Vectorization Execution

set hive.mapjoin.lazy.hashtable=false;
The issue is with a single reducer vertex running both Shuffle and Map Joins in the same vertex.
As a step forward, you should try disabling the lazy hashtable implementation 
Hive Security[true|false]
  • false (when

When doAs is false, HS2 does NOT impersonate you on the cluster when running jobs, or accessing files on HDFS. Files created in Hive will be owned by Hive and not the user that ran the SQL. That's because the process is run as the "hive" user on the cluster. Hence, all files/directories created as owned by "hive".

This can cause considerable confusion for those that use both the "Hive CLI" and "HS2" to access the cluster. SQL issued via "hive cli" is run as the $USER. The underlying "authorization manager" is most likely "Storage Based". If the user is NOT a member of the "hdfs" group, they may not be able to access the underlying files.

This happens in reverse as well. If a table is created with the "hive cli", it's contents are owned by the user. Accessing the table via HS2 when in "mr" execution mode, will return permissions errors. When running "tez", the Tez Task will simply fail with no indication that it was permissions related (as of Hive 0.13.1)

Configuration Property Listing

Running Hive on Tez

set hive.optimize.ppd=[true,false]
set hive.execution.engine=[tez,mr]
set hive.optimize.index.filter=true;
 Setting this or the next value to false, will allow you to avoid HIVE-7147, but will cost in performance. 
 See Above. 
0.13Establish the Queue, other than the "default" to run the Tez containers in. 
0.13The tez setting for multi-tenancy,
which means that a container idling for more than 1 second will be killed.
This is ideal for re-use within a query, but will free up resources between queries. This has been fairly good for multi-tenancy and keeps reuse working within long-running queries (
But for a single query perf run, you can set this to 2 minutes for most queries to reuse containers from a previous query in the same session.
 0.13 - Note: This feature doesn't work too well yet


 <description>Enables container prewarm for tez (hadoop 2 only)</description>




With the addition of TEZ-766, Tez supports pre-warmed containers within the Tez session.

Allow hive users to enable and use this feature from within the hive shell.

Hive Stats
The items below will require the stats for each table.  Set the stats and 'analyze' the table/partitions as directed.
set hive.stats.autogather=true;
set hive.stats.dbclass=fs;
analyze table store_sales partition(ss_sold_date) compute statistics partialscan;

set hive.stats.autographer = [true, **false**]


While data is inserted 
set hive.compute.query.using.stats=true;

This optimizes "select count(1) from foo;" to run in ~1 second.

 set hive.fetch.task.conversion=more;0.13This optimizes "select x from foo limit 10;" to run <1 second. 
set hive.optimize.index.filter=true;
0.13This optimizes "select x from foo where y = 10;" on ORC tables. 
Hive CBO    
 hive.compute.query.using.stats = [true, **false**];0.13  
 hive.stats.fetch.column.stats = [true, **false**];0.13  
 hive.stats.fetch.partition.stats = [true, **false**];0.13  
 hive.cbo.enable = [true, **false**];0.13  
Hive Tuning    
hive.optimize.sort.dynamic.partition = [ **true**, false ]
Hive Server 2 0.13  
This setting determines whether Hive queries will be executed using Tez or MapReduce.If this value is set to "mr", Hive queries will be executed using MapReduce. If this value is set to "tez", Hive queries will be executed using Tez. All queries executed through HiveServer2 will use the specifiedhive.execution.engine setting. 
The memory (in MB) to be used for Tez tasks. If this is not specified (-1), the memory settings from the MapReduce configurations ( be used by default for map tasks.-1(not specified) If this is not specified, the memory settings from the MapReduce configurations ( be used by default.
Java command line options for Tez. If this is not specified, the MapReduce java opts settings ( will be used by default for map tasks.If this is not specified, the MapReduce java opts settings ( will be used by default. 
A comma-separated list of queues configured for the cluster.The default value is an empty string, which prevents execution of all queries. To enable query execution with Tez for HiveServer2, this parameter must configured. 
The number of sessions for each queue named in the hive.server2.tez.default.queues.1 Larger clusters may improve performance of HiveServer2 by increasing this number. 
Enables a user to use HiveServer2 without enabling Tez for HiveServer2. Users may potentially may want to run queries with Tez without a pool of sessions.false 
Required when the queue-related configurations above are used.false 
For each AM created during prewarm-up, create this many containers for each.  
ORC File Tuning    

Maximum fraction of heap that can be used by ORC file writers.

Can effect how "stripes" are written and effect the stripe size.

ORC File Versioning    
set hive.exec.orc.write.format="0.11"
 Set the specific ORC file version to write. 

Other Helpful Tips:

set -vShow all settings.