Tuesday, May 30, 2017

skew table in Hive

If you know a column is going to have heavy skew, you can specify this in the table's schema, for example:
CREATE TABLE Customers (
id int,
username string,
zip int
)
SKEWED BY (zip) ON (57701, 57702)
STORED as DIRECTORIES;


By specifying the values with heavy skew, Hive will split those out into separate files automatically and
take this fact into account during queries so that it can skip whole files if possible.

In the Customers table above, records with a zip of 57701 or 57702 will be stored in separate files
because the assumption is that there will be a large number of customers in those two ZIP codes.

Examine the properties of table in the detailed table information obtained from DESCRIBE FORMATTED Customers command.