Managing Immutable Data in Hive with Windowing Rank Function (Update)

An blog I wrote last year needs some updating.  See Managing Mutable Data in an Immutable Big Data World.

I'm often asked how to "update" records in Hive.  We'll you can't, at least not yet and not in the sense you maybe accustom in an RDBMS.

Store your record updates in a transactional table that includes some identifier that represents the most resent update.  For example:

Table Schema
create external table fitness (
	firstname string,
	updated date,
	weight double
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '...';
Data
fitness.firstname  fitness.updated  fitness.weight  
-----------------  ---------------  --------------  
david              2015-01-12       210             
david              2015-04-03       205             
david              2015-06-02       200             
david              2015-08-23       195   

 

Hive now support "Windowing" functions which make it really simply to extract the "most recent" record from this table.  Use those results to build your new "entity" table.

Rank to Find Most Current
SELECT
  sub.firstname firstname,
  sub.updated updated,
  sub.weight weight
FROM
  (
    SELECT
      firstname,
      updated,
      weight,
      rank() over(partition BY firstname ORDER BY updated desc) rank
    FROM
      fitness
  ) sub
WHERE
  sub.rank=1;
Results
firstname  updated     weight  
---------  ----------  ------  
david      2015-08-23  195