The table below is a listing of various Hive settings that can be used throughout this journey into Hive Optimization
|Skew Merge Bucket Joins|
set hive.optimize.skewjoin = true;
Join bottlenecked on the reducer who gets the
|Sort Merge Bucket Map Join|
set hive.optimize.bucketmapjoin = true;
|Bucket Map Join|
set hive.optimize.bucketmapjoin = true;
1. Work together with map join
|Prevent MapJoins for Large Tables|
|Controlling the CombinedHiveInputFormat Size|
|0.10||Use with above split size settings to help build larger processing blocks.|
|Dynamic Partition Creation|
set hive.auto.convert.join.noconditionaltask = true; set hive.auto.convert.join.noconditionaltask.size = 10000;
|Control the Output Compression|
|Control the Interme|
|The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.|
That is used in ORDER BY LIMIT queries for pushing down the LIMIT clause.
select * from table order by key1 limit 10;
would use 10% of memory to store a Top-K, which would mean that the impl will keep an in-memory ordered list of 10 rows & replace/discard rows which fall outside the top-10.
This means Top-K is worst case of O(log(K)*n), while the unoptimized sorter+LIMIT is O(log(n)*n), which is a few magnitudes of performance gains when K is something like 100 and n is ~1+ million.
|0.13 (Requires ORC Formatted Datasets)||Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory.|
The issue is with a single reducer vertex running both Shuffle and Map Joins in the same vertex.
As a step forward, you should try disabling the lazy hashtable implementation
When doAs is false, HS2 does NOT impersonate you on the cluster when running jobs, or accessing files on HDFS. Files created in Hive will be owned by Hive and not the user that ran the SQL. That's because the process is run as the "hive" user on the cluster. Hence, all files/directories created as owned by "hive".
This can cause considerable confusion for those that use both the "Hive CLI" and "HS2" to access the cluster. SQL issued via "hive cli" is run as the $USER. The underlying "authorization manager" is most likely "Storage Based". If the user is NOT a member of the "hdfs" group, they may not be able to access the underlying files.
This happens in reverse as well. If a table is created with the "hive cli", it's contents are owned by the user. Accessing the table via HS2 when in "mr" execution mode, will return permissions errors. When running "tez", the Tez Task will simply fail with no indication that it was permissions related (as of Hive 0.13.1)
|Setting this or the next value to false, will allow you to avoid HIVE-7147, but will cost in performance.|
|0.13||Establish the Queue, other than the "default" to run the Tez containers in.|
|0.13||The tez setting for multi-tenancy,|
which means that a container idling for more than 1 second will be killed.
This is ideal for re-use within a query, but will free up resources between queries. This has been fairly good for multi-tenancy and keeps reuse working within long-running queries (http://people.apache.org/~gopalv/query10.svg).
But for a single query perf run, you can set this to 2 minutes for most queries to reuse containers from a previous query in the same session.
|0.13 - Note: This feature doesn't work too well yet|
The items below will require the stats for each table. Set the stats and 'analyze' the table/partitions as directed.
analyze table store_sales partition(ss_sold_date) compute statistics partialscan;
|While data is inserted|
This optimizes "select count(1) from foo;" to run in ~1 second.
|0.13||This optimizes "select x from foo limit 10;" to run <1 second.|
|0.13||This optimizes "select x from foo where y = 10;" on ORC tables.|
hive.optimize.sort.dynamic.partition = [ **true**, false ]
|Hive Server 2||0.13|
|This setting determines whether Hive queries will be executed using Tez or MapReduce.||If this value is set to "mr", Hive queries will be executed using MapReduce. If this value is set to "tez", Hive queries will be executed using Tez. All queries executed through HiveServer2 will use the specified|
|The memory (in MB) to be used for Tez tasks. If this is not specified (-1), the memory settings from the MapReduce configurations (|
|Java command line options for Tez. If this is not specified, the MapReduce java opts settings (||If this is not specified, the MapReduce java opts settings (|
|A comma-separated list of queues configured for the cluster.||The default value is an empty string, which prevents execution of all queries. To enable query execution with Tez for HiveServer2, this parameter must configured.|
|The number of sessions for each queue named in the |
|Enables a user to use HiveServer2 without enabling Tez for HiveServer2. Users may potentially may want to run queries with Tez without a pool of sessions.|
|Required when the queue-related configurations above are used.|
|For each AM created during prewarm-up, create this many containers for each.|
|ORC File Tuning|
Maximum fraction of heap that can be used by ORC file writers.
Can effect how "stripes" are written and effect the stripe size.
|ORC File Versioning|
|Set the specific ORC file version to write.|