Spark is written in Scala as it can be quite fast because it's statically typed and it compiles in a known way to the JVM.
Though Spark has API’s for Scala, Python, Java and R but the popularly used languages are the former two.
Java does not support Read-Evaluate-Print-Loop(REPL). R is not a general purpose language.
The data science community is divided in two camps;
Each has its pros and cons and the final choice should depend on the outcome application.
Scala is frequently over 10 times faster than Python.
Scala uses Java Virtual Machine (JVM) during runtime which gives is some speed over Python in most cases.
Python is dynamically typed and this reduces the speed.
Compiled languages are faster than interpreted.
In case of Python, Spark libraries are called which require a lot of code processing and hence slower performance.
Scala works well for limited cores.
Moreover Scala is native for Hadoop as its based on JVM.
Scala interacts with Hadoop via native Hadoop's API in Java. It's very easy to write native Hadoop applications in Scala.
Both are functional and object oriented languages which have similar syntax in addition to a thriving support communities.
Scala may be a bit more complex to learn in comparison to Python due to its high-level functional features.
Python is preferable for simple intuitive logic. Python has simple syntax and good standard libraries in the data science community.
Scala is more useful for complex workflows.
Scala has multiple standard libraries and cores which allows quick integration of the databases in Big Data ecosystems.
Scala allows writing of code with multiple concurrency primitives.
Python has limited support concurrency or multithreading.
Scala allows better memory management and data processing.
Technically, Python does support heavyweight process forking but only one thread is active at a time. So whenever a new code is deployed, more processes must be restarted which increases the memory overhead.
Both are expressive and we can achieve high functionality level with them.
Python is more user friendly and concise.
Scala is always more powerful in terms of framework, libraries, implicit, macros etc.
Scala works well within the MapReduce framework because of its functional nature.
Many Scala data frameworks follow similar abstract data types that are consistent with Scala’s collection of APIs.
Developers just need to learn the basic standard collections, which allow them to easily get acquainted with other libraries.
Spark is written in Scala so knowing Scala will let you understand and modify what Spark does internally.
Moreover many upcoming features will first have their APIs in Scala and Java and the Python APIs evolve in the later versions.
But for NLP, Python is preferred as Scala doesn’t have many tools for machine learning or NLP.
Moreover for using standard Spark library, like MLLib, Python is preferred for the data science community.
Python’s visualization libraries complement Pyspark as neither Spark nor Scala have anything comparable.
Scala was created by Martin Odersky, who studied under Niklaus Wirth, who created Pascal and several other languages.
Mr. Odersky is one of the co-designers of Generic Java, and is also known as the “father” of the javac compiler.
Before we jump into the examples, here are a few important things to know about Scala:
One good thing to know up front is that comments in Scala are just like comments in Java (and many other languages):
// a single line comment
/*
* a multiline comment
*/
/**
* also a multiline comment
*/
val
defines a fixed immutable value — should be preferred.var
creates a mutable variable, and should only be used when there is a specific reason to use it.val x = 1 //immutable
var y = 0 //mutable
x = 1 y = 0
0
In Scala, you typically create variables without declaring their type. When you do this, Scala can usually infer the data type for you. You could check this with Scala REPL or Spark Shell
val x = 1
val s = "a string"
x = 1 s = a string
a string
This feature is known as type inference, and it’s a great way to help keep your code concise. You can also explicitly declare a variable’s type, but that’s not usually necessary:
val x: Int = 1
val s: String = "a string"
x = 1 s = a string
a string
Scala comes with the standard numeric data types you’d expect. In Scala all of these data types are full-blown objects (not primitive data types).
val b: Byte = 1
val x: Int = 1
val l: Long = 1
val s: Short = 1
val d: Double = 2.0
b = 1 x = 1 l = 1 s = 1 d = 2.0
2.0
Because Int and Double are the default numeric types, you typically create them without explicitly declaring the data type:
val i = 123 // defaults to Int
val x = 1.0 // defaults to Double
i = 123 x = 1.0
1.0
For large numbers Scala also includes the types BigInt
and BigDecimal
.
A great thing about BigInt and BigDecimal is that they support all the operators you’re used to using with numeric types.
var b0 = BigInt(987654321)
var b1 = BigInt(1234567890)
var b2 = BigDecimal(123456.789)
b0 + b1
b0 += 1
b1 * b1
b0 = 987654322 b1 = 1234567890 b2 = 123456.789
1524157875019052100
The List class is a linear, immutable sequence.
All this means is that it’s a linked-list that you can’t modify.
Any time you want to add or remove List elements, you create a new List from an existing List.
val ints = List(1, 2, 3)
val names = List("Joel", "Chris", "Ed")
ints = List(1, 2, 3) names = List(Joel, Chris, Ed)
List(Joel, Chris, Ed)
You prepend elements to a List
like this:
val b = 0 +: ints
b = List(0, 1, 2, 3)
List(0, 1, 2, 3)
val b2 = List(-1, 0) ++: ints
b2 = List(-1, 0, 1, 2, 3)
List(-1, 0, 1, 2, 3)
:
character represents the side that the sequence is on.
so when you use +:
you know that the list needs to be on the right
val c1 = ints :+ 4
c1 = List(1, 2, 3, 4)
List(1, 2, 3, 4)
val c2 = ints :+ List(4,5)
c2 = List(1, 2, 3, List(4, 5))
List(1, 2, 3, List(4, 5))
if/else
Scala’s if/else control structure is similar to other languages:
if (test1) {
doA()
} else if (test2) {
doB()
} else if (test3) {
doC()
} else {
doD()
}
However, unlike Java and many other languages, the if/else construct returns a value, so, among other things, you can use it as a ternary operator:
val a = 5
val b = 7
val x = if (a < b) a else b
a = 5 b = 7 x = 5
5
match
expressions¶Scala has a match expression, which in its most basic use is like a Java switch statement:
// val i = 1
val i = 5
val result = i match {
case 1 => "one"
case 2 => "two"
case _ => "not 1 or 2"
}
i = 5 result = not 1 or 2
not 1 or 2
The match expression isn’t limited to just integers, it can be used with any data type, including booleans:
val test = true
val boolean2String = test match {
case true => "it is true"
case false => "it is false"
}
test = true boolean2String = it is true
it is true
for
loops and expressions¶// "x to y" syntax
for (i <- 0 to 5) println(i)
0 1 2 3 4 5
// "x to y by" syntax
for (i <- 0 to 10 by 2) println(i)
0 2 4 6 8 10
You can also add the yield
keyword to for-loops to create for-expressions that yield a result. Similar to the list comprehension in Python. Here’s a for-expression that doubles each value in the sequence 1 to 5:
val x = for (i <- 1 to 5) yield i * 2
x = Vector(2, 4, 6, 8, 10)
Vector(2, 4, 6, 8, 10)
Here’s another for-expression that iterates over a list of strings:
val fruits = List("apple", "banana", "lime", "orange")
val fruitLengths = for {
f <- fruits
if f.length > 4
} yield f.length
fruits = List(apple, banana, lime, orange) fruitLengths = List(5, 6, 6)
List(5, 6, 6)
while
and do/while
¶Scala also has while and do/while loops. Here’s their general syntax:
// while loop
while(condition) {
statement(a)
statement(b)
}
// do-while
do {
statement(a)
statement(b)
}
while(condition)
Here’s an example of a Scala class:
class Person(var firstName: String, var lastName: String) {
def printFullName() = println(s"$firstName $lastName")
}
defined class Person
val myname = new Person("Feng", "Li")
myname = Person@7aeeb4fc
Person@7aeeb4fc
println(myname.firstName)
myname.lastName = "LI"
myname.printFullName()
myname.lastName: String = LI
Feng Feng LI
Just like other OOP languages, Scala classes have methods, and this is what the Scala method syntax looks like:
def sum(a: Int, b: Int): Int = a + b
def concatenate(s1: String, s2: String): String = s1 + s2
sum: (a: Int, b: Int)Int concatenate: (s1: String, s2: String)String
You don’t have to declare a method’s return type, so it’s perfectly legal to write those two methods like this, if you prefer:
def sum(a: Int, b: Int) = a + b
def concatenate(s1: String, s2: String) = s1 + s2
sum: (a: Int, b: Int)Int concatenate: (s1: String, s2: String)String
This is how you call those methods:
val x = sum(1, 2)
val y = concatenate("foo", "bar")
x = 3 y = foobar
foobar
“Scala is faster and moderately easy to use, while Python is slower but very easy to use.”
More about Scala https://docs.scala-lang.org/overviews/scala-book/introduction.html
Scala library for Spark https://index.scala-lang.org/search?q=*&topics=spark