My Tech Tips

Monday, July 29, 2019

Build a Scala FAT Jar Using SBT

Make a directory structure as below

~/sbt_build - This is the root directory
- src/main/scala/ - Place the .scala files here
- target - This will get created once the jar is built automatically
- project - This will get created once the jar is built automatically
- build.sbt - Prepare this file and place it here

Contents of build.sbt

name := "SampleApp"

version := "1.0"
scalaVersion := "2.11.6"
val sparkVersion = "2.2.0"

libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
libraryDependencies += "oracle" % "ojdbc6" % "11.1.0.6"

libraryDependencies += "com.cedarsoftware" % "java-util" % "1.8.0"

See above that there is an oracle jar dependency that is added and that is not available from any maven repository. The ojdbc jar has to be placed in the below directory for SBT to automatically add it to the final JAR file. Any Jar that is placed under the lib directory will get added to the final JAR.

~/sbt_build/project/lib

Also, there are few ways to create JAR files using sbt. We will be using "sbt assembly" plugin to create one FAT jar. In order to add the plugin to the sbt place a file by name "plugins.sbt" in the project folder.

~/sbt_build/project/plugins.sbt

Add the below line in the plugins.sbt file
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")

At the parent directory run the below
~/sbt_build$ sbt assembly

Once the build is successful the jar file can be found at
~/sbt_build/target/scala-2.11/SimpleApp-assembly-1.0.jar ...

Submit the jar on the cluster to run the application

spark-submit --class SimpleApp--master yarn target/scala-2.11/SimpleApp-assembly-1.0.jar

Wednesday, July 24, 2019

Zeppelin - Additional Dependencies(Jars/Packages)

Zeppelin Interpreter properties to add additional JARs, package dependencies

Upload the needed JAR files to HDFS.

Add the property in the Livy interpreter settings

livy.spark.jars - This is comma separated list of JAR files available on HDFS

livy.spark.jars.packages - This is a comma separated list of dependencies in the maven repository. Maven central repository is available by default in the Zeppelin. One can add other repositories too to add additional dependencies.

Zeppelin Livy Interpreter settings

Thursday, July 11, 2019

Parquet issues - Hive vs Spark

Reading and writing a parquet file is same in Hive and spark till spark 1.4

Post that spark complied with the latest parquet upgrades and hence if one creates a parquet file in the latest spark version and built and external hive table on top of that file, we face the below kind of errors.

org.apache.parquet.io.parquetdecodingexception: can not read value at 0 in block -1

Spark documentation (link here) has given workarounds to get over these issues.
There are two main properties

spark.sql.parquet.writeLegacyFormat (default: false) -- If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Apache Parquet's fixed-length byte array format, which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. If Parquet output is intended for use with systems that do not support this newer format, set to true.

spark.sql.hive.convertMetastoreParquet (default: true) -- When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support.

Set the values to true/false based on the requirement.

Wednesday, July 10, 2019

Google BigQuery Connection From a Spark-Shell Environment

Set the environment variable as below pointing to the json credential file.

export GOOGLE_APPLICATION_CREDENTIALS="<path>/bigquery-auth.json"

The latest version of the jar (spark-bigquery-latest.jar) is available in the google cloud storage.
Download the same from this link and include the same in the spark shell.

Launch the spark-shell and try few commands as shown below.

spark-shell --master local[*] --jars <path>/bigquery_spark-bigquery-latest.jar

scala> import com.google.cloud.spark.bigquery._

scala> val df = spark.read.bigquery("<project_id>:<dataset_id>.<table_name>")

scala> df.count

scala> df.write.mode("Overwrite").format("com.databricks.spark.avro").save("adl:///testpath/bigquerydata")

The link below has few more details explained about setting up the connection
https://github.com/GoogleCloudPlatform/spark-bigquery-connector