Thursday, July 11, 2019

Parquet issues - Hive vs Spark

Reading and writing a parquet file is same in Hive and spark till spark 1.4

Post that spark complied with the latest parquet upgrades and hence if one creates a parquet file in the latest spark version and built and external hive table on top of that file, we face the below kind of errors.

org.apache.parquet.io.parquetdecodingexception: can not read value at 0 in block -1

Spark documentation (link here) has given workarounds to get over these issues.
There are two main properties

spark.sql.parquet.writeLegacyFormat (default: false)   -- If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Apache Parquet's fixed-length byte array format, which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. If Parquet output is intended for use with systems that do not support this newer format, set to true.

spark.sql.hive.convertMetastoreParquet (default: true)  -- When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support.


Set the values to true/false based on the requirement.

No comments:

Post a Comment