Welcome to our new series of articles about Apache Spark and BigData ecosystem.
Before we start our journey, the first question is “why”? – Internet is already full of Spark tutorials, articles and manuals. Spark is another big thing in the world of data analysis. So why to bother and write another post or series about it? Aren’t there already enough?
In Codete we believe that only detail-oriented and deeper understanding of given technology and problem matters. Nowadays we tend to google through a bunch of simple tutorials and quick answers to our problems, but without deeper understanding of a problem.
So if you ever wanted to check out Apache Spark but haven’t found step-by-step guide, this series might be a place for you!


Although we will do our best to explain everything, minimal knowledge of Functional Programming is required. In series most of the time we will use native-language of Spark – Scala. But if you are a Backend Developer, who knows basic features of Stream API from Java 8 and you are not panicking after seeing another language syntax, you’ll probably be fine :).
To emulate Spark’s cluster in the first parts we will use Docker to achieve it. So getting up to date with it is also a good idea.
We assume that you use Linux. But different system shouldn’t be a problem either.Spark Apache

Setting up cluster in Docker

Before we get into coding, let’s prepare our environment for the future parts. Web is full of different Spark images, but we honestly recommend the following: https://github.com/gettyimages/docker-spark
It contains docker-compose enabling Spark Master node and 1 Worker node. That is an absolute minimum for our purposes. So how to set it? Type in the console:

And that’s all. Magic of Docker. Of course, this is just a temporary approach and we will get into details later. To verify that everything works, type localhost:8080 in the browser or alternatively {docker-machine-ip}:8080 on non Linux machines.
We should see Spark preview with one alive worker:
That’s it for the docker for now. We will get more into it in later episodes. Now, let’s code.

Code overview

Now, we’re gonna jump into Scala code and review prepared for this course repository. It’s an example of simple analytics and a word count example. We implemented it in TDD manner to also show how easy it is to test our Spark code. In the end we will also review few facts about BigData analytics and their application.


Initializing SparkContext

At the beginning let’s look at App$Test.scala and how we initialize our local SparkContext:

First, we inform that we want to use local instance of our Spark and use all available cores on our machine by adding star in brackets. We also have an option of typing a remote cluster address. This option will be explained in the further courses.

We can also set a name for our application, what is not mandatory, but simplifies a lot of things. We also set logging level to WARN – Spark by it’s nature is pretty verbose and we do not need it at this level.

In the and we return SparkContext for further purposes. Note that it’s a heavy-weight object and we should reuse it. We also can have only one opened SparkContext per JVM.

Thanks to that short method, we are able to start local, embedded spark cluster with we be removed after program stop. This approach is awesome for unit tests and eases up local development stage.

What is RDD ?

Let’s pause for a moment and talk about the Spark’s main abstraction – RDD. What is it ?

Resilient distributed dataset (RDD), is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Each RDD is an immutable set, which we can transform using map, filter, ei. to obtain new RDD.

First touch of RDD

Let’s go back to reviewing and see App$Test.scala again. Where we create our first RDD.

As we can see, creating RDD is pretty straightforward. Yet, it hides powerful abstraction. Thanks to it we have to worry about many things related to distributed systems world, such as:

Managing machines Every Spark’s query is managed by Spark master and we must worry about distributing the query to many machines. Our query would simply be invoked on every machine and it’s a part of dataset and then results will be gathered in the master node.
In case of any new worker joining the cluster it will be automatically discovered.
Fault-tolerance In case of node’s failure a query will be rerun or invoked on replica of our data.
Elastic The system stays responsive under varying workload and the master node is responsible for even the distribution of work.

All we need to do is just to operate on our data as it was a single stream.

Our first ‘query’

Now let’s move to App.scala and look at our query. People who know Java 8 Stream API may find many similarities.

What we do here is split lines of our file to words, filter empty words, then map them to pairs and sum occurrence. After that we sort the summed data descending.

Next we are good to take the first element of the list, which will be the most frequently used word. Note that collecting a large dataset to the list is not the right choice as we might get the OutOfMemory exception. But, in our case, the result is fairly small.

Interesting thing to show here is a difference between transformations and actions:

  • Transformations – they are invoked on every machine and it’s subset of data individually. They are lazy and will be invoked when an action occurs. That way we prevent ourselves from unnecessary operations and a lot of network traffic.
  • Actions – pull the data back to master.
Transformations map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, sample, union, intersection, distinct, groupByKey, reduceByKey, aggregateByKey, sortByKey, join, cogroup, cartesian, pipe, coalesce, repartition, repartitionAndSortWithinPartitions
Actions reduce, collect, count, first, take, takeSample, takeOrdered, saveAsTextFile, saveAsSequenceFile, saveAsObjectFile, countByKey, foreach
via: https://spark.apache.org/docs/latest/programming-guide.html#transformations

Note that mixing invocations of transformation and actions is not a good practice as they cause unnecessary network traffic and the data is literary flying all over the cluster.

Having longer chain of transformations is also important due to fault tolerance. Thanks to lazy loading feature and invoking only on action we can re-send and rerun the failed query on different nodes.

As we saw on example above, we invoked actions only at the end of our query.

Last note

When we review other parts of application by ourselves, we also discover important warning of a Big Data world. Everyone says that Spark is good because it’s rapid-fast, but when we invoke some unit tests and see plain Scala solution, it occurs that it’s faster. Why?

Bare in mind that Spark and all other Big Data tools in general where designed to handle a vast amount of data. They connect nodes into cluster to meet the challenge and use combined computation power, but, as every specialized solution, it costs:

  • We need to prepare the whole infrastructure
  • Network connectivity and sending data
  • Every task requires resources and boosts up at the beginning

Before applying Spark to our domain, it’s best to benchmark it well and honestly decide which is a right choice.

That’s all folks

Ok, folks that’s all for this episode. Next time we are going to explore the ways of reading our data into Spark, meet HDFS and play with Spark’s shell.

While waiting for a new episode, I recommend going through our code, playing with it and trying to implement your first own queries. There are a few more articles worth reading:


Experienced Java Developer interested in modern, high-availability applications and processing vast amounts of data. In private life enjoys sport climbing, cooking and his couch.