Scala and Spark Notes (1)

In order to learn the important lessons in life, one must, each day, surmount a fear.

Ralph Waldo Emerson

I laid out the the basic usage of scala in my github. In this article, I will mainly talk about how to use scala to write Spark jobs with some examples. There are four parts in this blog. The first part is introduction of how to build a scala project by using IntelliJ IDEA. Second part is an explanation of the main concepts of Spark. Thirdly, I will show an example of using Spark with Scala and how to deploy the code to a production environment. Finally, I will list out some references that could optimize the performance of the spark code.

Create executable JAR file with dependencies

Pre-installation

Make sure the version of sbt and scala are compatible. In this blog, I’m using sbt version 1.3.8 and scala version of 2.12.3

  • Install Java

    1
    2
    sudo apt-get update
    sudo apt-get install default-jdk
  • Install sbt

    1
    2
    3
    4
    echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
    curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add
    sudo apt-get update
    sudo apt-get install sbt
  • Install Scala

    1
    2
    3
    4
    5
    sudo apt-get remove scala-library scala
    sudo wget http://scala-lang.org/files/archive/scala-2.12.1.deb
    sudo dpkg -i scala-2.12.1.deb
    sudo apt-get update
    sudo apt-get install scala

Installation by using sbt assembly

  • Create a file named plugins.sbt and add below line inside the file

    1
    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "x.y.z")
  • Add below block into your build.sbt file

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    val defaultMergeStrategy: String => MergeStrategy = { 
    case x if Assembly.isConfigFile(x) =>
    MergeStrategy.concat
    case PathList(ps @ _*) if Assembly.isReadme(ps.last) || Assembly.isLicenseFile(ps.last) =>
    MergeStrategy.rename
    case PathList("META-INF", xs @ _*) =>
    (xs map {_.toLowerCase}) match {
    case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
    MergeStrategy.discard
    case ps @ (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
    MergeStrategy.discard
    case "plexus" :: xs =>
    MergeStrategy.discard
    case "services" :: xs =>
    MergeStrategy.filterDistinctLines
    case ("spring.schemas" :: Nil) | ("spring.handlers" :: Nil) =>
    MergeStrategy.filterDistinctLines
    case _ => MergeStrategy.deduplicate
    }
    case _ => MergeStrategy.deduplicate
    }
  • Run sbt assembly