{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to Scala\n", "\n", "## Feng Li\n", "\n", "### Central University of Finance and Economics\n", "\n", "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n", "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Spark programming in Scala or Python? \n", "\n", "- Spark is written in Scala as it can be quite fast because it's statically typed and it compiles in a known way to the JVM. \n", "\n", "- Though Spark has API’s for Scala, Python, Java and R but the popularly used languages are the former two. \n", "\n", "- Java does not support **Read-Evaluate-Print-Loop**(REPL). R is not a general purpose language. \n", "\n", "- The data science community is divided in two camps; \n", "\n", " - one which prefers Scala \n", " - whereas the other preferring Python. \n", " \n", "- Each has its pros and cons and the final choice should depend on the outcome application." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Performance\n", "\n", "- Scala is frequently over 10 times faster than Python. \n", "\n", "- Scala uses Java Virtual Machine (JVM) during runtime which gives is some speed over Python in most cases. \n", "\n", "- Python is dynamically typed and this reduces the speed. \n", "\n", "- Compiled languages are faster than interpreted. \n", "\n", "- In case of Python, Spark libraries are called which require a lot of code processing and hence slower performance. \n", "\n", "- Scala works well for limited cores. \n", "\n", "- Moreover Scala is native for Hadoop as its based on JVM. \n", "\n", "- Scala interacts with Hadoop via native Hadoop's API in Java. It's very easy to write native Hadoop applications in Scala." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Learning Curve\n", "\n", "- Both are functional and object oriented languages which have similar syntax in addition to a thriving support communities. \n", "\n", "- Scala may be a bit more complex to learn in comparison to Python due to its high-level functional features. \n", "\n", "- Python is preferable for simple intuitive logic. Python has simple syntax and good standard libraries in the data science community.\n", "\n", "- Scala is more useful for complex workflows. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Concurrency\n", "\n", "- Scala has multiple standard libraries and cores which allows quick integration of the databases in Big Data ecosystems. \n", "\n", "- Scala allows writing of code with multiple concurrency primitives. \n", "\n", "- Python has limited support concurrency or multithreading. \n", "\n", "- Scala allows better memory management and data processing. \n", "\n", "- Technically, Python does support heavyweight process forking but only one thread is active at a time. So whenever a new code is deployed, more processes must be restarted which increases the memory overhead." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Usability\n", "\n", "- Both are expressive and we can achieve high functionality level with them. \n", "\n", "- Python is more user friendly and concise. \n", "\n", "- Scala is always more powerful in terms of framework, libraries, implicit, macros etc. \n", "\n", "- Scala works well within the MapReduce framework because of its functional nature. \n", "\n", "- Many Scala data frameworks follow similar abstract data types that are consistent with Scala’s collection of APIs. \n", "\n", "- Developers just need to learn the basic standard collections, which allow them to easily get acquainted with other libraries. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Some advices\n", "\n", "\n", "- Spark is written in Scala so knowing Scala will let you understand and modify what Spark does internally. \n", "\n", "- Moreover many upcoming features will first have their APIs in Scala and Java and the Python APIs evolve in the later versions. \n", "\n", "- But for NLP, Python is preferred as Scala doesn’t have many tools for machine learning or NLP. \n", "\n", "- Moreover for using standard Spark library, like MLLib, Python is preferred for the data science community. \n", "\n", "- Python’s visualization libraries complement Pyspark as neither Spark nor Scala have anything comparable." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A bit of background\n", "- Scala was created by Martin Odersky, who studied under Niklaus Wirth, who created Pascal and several other languages. \n", "\n", "- Mr. Odersky is one of the co-designers of Generic Java, and is also known as the “father” of the javac compiler." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Overview of Scala\n", "\n", "Before we jump into the examples, here are a few important things to know about Scala:\n", "\n", "- It’s a high-level language\n", "- It’s statically typed\n", "- Its syntax is concise but still readable — we call it expressive\n", "- It supports the object-oriented programming (OOP) paradigm -- Every variable is an object, and every “operator” is a method.\n", "- It supports the functional programming (FP) paradigm -- Functions are also variables, and you can pass them into other functions. You can write your code using OOP, FP, or combine them in a hybrid style.\n", "- It has a sophisticated type inference system\n", "- Scala code results in .class files that run on the Java Virtual Machine (JVM)\n", "- It’s easy to use Java libraries in Scala" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Comments\n", "\n", "One good thing to know up front is that comments in Scala are just like comments in Java (and many other languages):\n", "\n", " // a single line comment\n", "\n", " /*\n", " * a multiline comment\n", " */\n", "\n", " /**\n", " * also a multiline comment\n", " */" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Create variables:\n", "\n", "- `val` defines a fixed **immutable value** — should be preferred.\n", "- `var` creates a **mutable variable**, and should only be used when there is a specific reason to use it." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "x = 1\n", "y = 0\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val x = 1 //immutable\n", "var y = 0 //mutable" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In Scala, you typically create variables without declaring their type. When you do this, Scala can usually infer the data type for you. You could check this with Scala REPL or Spark Shell" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "x = 1\n", "s = a string\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "a string" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val x = 1\n", "val s = \"a string\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This feature is known as **type inference**, and it’s a great way to help keep your code concise. You can also explicitly declare a variable’s type, but that’s not usually necessary:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "x = 1\n", "s = a string\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "a string" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val x: Int = 1\n", "val s: String = \"a string\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Scala build-in data types\n", "\n", "Scala comes with the standard numeric data types you’d expect. In Scala all of these data types are full-blown objects (not primitive data types)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "b = 1\n", "x = 1\n", "l = 1\n", "s = 1\n", "d = 2.0\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "2.0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val b: Byte = 1\n", "val x: Int = 1\n", "val l: Long = 1\n", "val s: Short = 1\n", "val d: Double = 2.0" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Because Int and Double are the default numeric types, you typically create them without explicitly declaring the data type:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "i = 123\n", "x = 1.0\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "1.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val i = 123 // defaults to Int\n", "val x = 1.0 // defaults to Double" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- For large numbers Scala also includes the types `BigInt` and `BigDecimal`. \n", "\n", "- A great thing about BigInt and BigDecimal is that they support all the operators you’re used to using with numeric types." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "b0 = 987654322\n", "b1 = 1234567890\n", "b2 = 123456.789\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "1524157875019052100" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "var b0 = BigInt(987654321)\n", "var b1 = BigInt(1234567890)\n", "var b2 = BigDecimal(123456.789)\n", "\n", "b0 + b1\n", "b0 += 1 \n", "b1 * b1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## List\n", "\n", "- The List class is a linear, **immutable** sequence. \n", "\n", "- All this means is that it’s a linked-list that you can’t modify. \n", "\n", "- Any time you want to add or remove List elements, you create a new List from an existing List." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "ints = List(1, 2, 3)\n", "names = List(Joel, Chris, Ed)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "List(Joel, Chris, Ed)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val ints = List(1, 2, 3)\n", "val names = List(\"Joel\", \"Chris\", \"Ed\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You prepend elements to a `List` like this:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "b = List(0, 1, 2, 3)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "List(0, 1, 2, 3)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val b = 0 +: ints" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "b2 = List(-1, 0, 1, 2, 3)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "List(-1, 0, 1, 2, 3)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val b2 = List(-1, 0) ++: ints" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- `:` character represents the side that the sequence is on.\n", "\n", "- so when you use `+:` you know that the list needs to be on the right" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "c1 = List(1, 2, 3, 4)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "List(1, 2, 3, 4)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val c1 = ints :+ 4" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "c2 = List(1, 2, 3, List(4, 5))\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "List(1, 2, 3, List(4, 5))" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val c2 = ints :+ List(4,5)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Control structures\n", "\n", "- `if/else` Scala’s if/else control structure is similar to other languages:\n", "\n", " if (test1) {\n", " doA()\n", " } else if (test2) {\n", " doB()\n", " } else if (test3) {\n", " doC()\n", " } else {\n", " doD()\n", " }" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "However, unlike Java and many other languages, the if/else construct returns a value, so, among other things, you can use it as a ternary operator:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "a = 5\n", "b = 7\n", "x = 5\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "5" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val a = 5\n", "val b = 7\n", "val x = if (a < b) a else b" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## `match` expressions\n", "\n", "Scala has a match expression, which in its most basic use is like a Java switch statement:\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "i = 5\n", "result = not 1 or 2\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "not 1 or 2" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "// val i = 1\n", "val i = 5\n", "\n", "val result = i match {\n", " case 1 => \"one\"\n", " case 2 => \"two\"\n", " case _ => \"not 1 or 2\"\n", "}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The match expression isn’t limited to just integers, it can be used with any data type, including booleans:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "test = true\n", "boolean2String = it is true\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "it is true" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val test = true\n", "\n", "val boolean2String = test match {\n", " case true => \"it is true\"\n", " case false => \"it is false\"\n", "}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## `for` loops and expressions" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "2\n", "3\n", "4\n", "5\n" ] } ], "source": [ "// \"x to y\" syntax\n", "for (i <- 0 to 5) println(i)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "2\n", "4\n", "6\n", "8\n", "10\n" ] } ], "source": [ "// \"x to y by\" syntax\n", "for (i <- 0 to 10 by 2) println(i)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "You can also add the `yield` keyword to for-loops to create for-expressions that yield a result. Similar to the list comprehension in Python. Here’s a for-expression that doubles each value in the sequence 1 to 5:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "x = Vector(2, 4, 6, 8, 10)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Vector(2, 4, 6, 8, 10)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val x = for (i <- 1 to 5) yield i * 2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here’s another for-expression that iterates over a list of strings:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "fruits = List(apple, banana, lime, orange)\n", "fruitLengths = List(5, 6, 6)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "List(5, 6, 6)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val fruits = List(\"apple\", \"banana\", \"lime\", \"orange\")\n", "\n", "val fruitLengths = for {\n", " f <- fruits\n", " if f.length > 4\n", "} yield f.length" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## `while` and `do/while`\n", "Scala also has while and do/while loops. Here’s their general syntax:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " // while loop\n", " while(condition) {\n", " statement(a)\n", " statement(b)\n", " }\n", "\n", " // do-while\n", " do {\n", " statement(a)\n", " statement(b)\n", " } \n", " while(condition)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Classes\n", "\n", "Here’s an example of a Scala class:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "defined class Person\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "class Person(var firstName: String, var lastName: String) {\n", " def printFullName() = println(s\"$firstName $lastName\")\n", "}" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "myname = Person@7aeeb4fc\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Person@7aeeb4fc" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val myname = new Person(\"Feng\", \"Li\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "myname.lastName: String = LI\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Feng\n", "Feng LI\n" ] } ], "source": [ "println(myname.firstName)\n", "myname.lastName = \"LI\"\n", "myname.printFullName()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Scala methods\n", "\n", "Just like other OOP languages, Scala classes have methods, and this is what the Scala method syntax looks like:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "sum: (a: Int, b: Int)Int\n", "concatenate: (s1: String, s2: String)String\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def sum(a: Int, b: Int): Int = a + b\n", "def concatenate(s1: String, s2: String): String = s1 + s2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You don’t have to declare a method’s return type, so it’s perfectly legal to write those two methods like this, if you prefer:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "sum: (a: Int, b: Int)Int\n", "concatenate: (s1: String, s2: String)String\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def sum(a: Int, b: Int) = a + b\n", "def concatenate(s1: String, s2: String) = s1 + s2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is how you call those methods:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "x = 3\n", "y = foobar\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "foobar" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val x = sum(1, 2)\n", "val y = concatenate(\"foo\", \"bar\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Bottom-Line: Scala for Apache Spark\n", "\n", "- “Scala is faster and moderately easy to use, while Python is slower but very easy to use.”\n", "\n", "- More about Scala https://docs.scala-lang.org/overviews/scala-book/introduction.html\n", "\n", "- Scala library for Spark https://index.scala-lang.org/search?q=*&topics=spark" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "rise": { "auto_select": "first", "autolaunch": false, "chalkboard": { "chalkEffect": 1, "chalkWidth": 4, "theme": "whiteboard", "transition": 800 }, "enable_chalkboard": true, "reveal_shortcuts": { "chalkboard": { "clear": "ctrl-k" } }, "start_slideshow_at": "selected", "theme": "white" } }, "nbformat": 4, "nbformat_minor": 4 }