Spark - Cassandra Data Processing (Scala)

Faster data processing from Cassandra by leveraging Apache Spark's in-memory and distributed processing powers

Pavan Kulkarni

38 minute read

We will look into basic details of how to process data from Cassandra using Apache Spark. Data Processing from a NoSQL DB is very efficient when we use a distributed processing system like Spark in Scala

We will look into basic operations adn most used operations to process data from Cassandra. Cassandra, due to its data model, has a few limitations that would often obstruct Analysts to use CQLSH for running ad-hoc queries to retrieve data directly from Cassandra. Some of the widely seen limitations are :

  1. No join or subquery support
  2. Limited support for aggregation.
  3. Ordering is done by partition. The table design forces to use partitions for running simple CQL queries.

We can overcome these limitations by introducing… wait for it… Apache Spark

Apache Spark to the Rescue !!!

Using Spark’s in-memory and distributed computing, we can load either

  1. Load a huge chunk of data (or entire table) from Cassandra and perform all sorts of complex transformations, aggregations, data mulching one can imagine.
  2. Use the driver’s in-built server side processing abilities to pull only the required data. Thus by limiting the data, Spark jobs can be even more optimized.

Spark jobs can make use of Spark’s distributed nature to re-partition the data for better data distribution among the executors. I find the Datastax’ spark-cassandra-connector to be a optimized and robust driver to achieve seamless data processing between Cassandra and Spark.

Let’s look at some examples on basic read and write operations to and from Cassandra in Scala. Full code is available on my GitHub

Set up Spark-Cassandra Drivers

I will be using the latest version of spark-cassandra-connector We have already seen How to run spark in eclipse using Gradle. Add the spark-cassandra-connector driver to build.gradle as show below. You can check for the latest version here.

build.gradle

dependencies{

	provided 'org.apache.spark:spark-core_2.11:2.2.1'
	provided 'org.apache.spark:spark-sql_2.11:2.2.1'
	provided 'org.apache.spark:spark-catalyst_2.11:2.2.1'
	compile  group: 'com.datastax.spark', name: 'spark-cassandra-connector_2.11', version: '2.0.7'

}

I’m using Scala 2.11 and Spark 2.2.1 as these are the latest versions available as of the time when this post was published.

Initialize SparkSession

Starting Apache Spark 2.0.0 the entry point to Spark Job is changed from SparkContext to SparkSession. SparkSession gives combined customization options of SparkConext and SparkSQL.

 val spark = SparkSession
            .builder()
            .master("local")
            .appName("Spark_Cassandra")
            .config("spark.cassandra.connection.host", "localhost")
            .getOrCreate()

In a single line, I have

  1. Initialized the Spark Job entry point
  2. Set the Spark Master
  3. Initialized connection to Cassandra
  4. Set a name for my Spark job.

Read From Cassandra

Let’s use the spark SparkSession and read values from Cassandra. We have 2 tables in Cassandra and sample data can be seen in this post.

val studentsDF = spark
            .read
            .cassandraFormat("students", "test_keyspace")
            .options(ReadConf.SplitSizeInMBParam.option(32))
            .load()

studentsDF.show(false)

val coursesDF = spark
            .read
            .cassandraFormat("courses", "test_keyspace")
            .options(ReadConf.SplitSizeInMBParam.option(32))
            .load()

coursesDF.show()


This returns us with 2 Dataframes.

studentsDF.show(false)

+---+--------------+-------------------------------------------------------------+----------------+
|id |year_graduated|courses_registered                                           |name            |
+---+--------------+-------------------------------------------------------------+----------------+
|5  |2004          |[[CS004,Spring_2004], [CS005,Summer_2004], [CS003,Fall_2004]]|Sheldon Cooper  |
|5  |2010          |[[CS001,Spring_2010], [CS002,Summer_2010], [CS005,Fall_2010]]|Pika Achu       |
|10 |2009          |[[CS010,Spring_2009], [CS002,Summer_2009], [CS007,Fall_2009]]|Hermoine Granger|
|11 |2010          |[[CS001,Spring_2010], [CS002,Summer_2010], [CS005,Fall_2010]]|Peter Parker    |
|1  |2001          |[[CS001,Spring_2001], [CS002,Summer_2001], [CS001,Fall_2001]]|Tom Riddle      |
|8  |2007          |[[CS001,Spring_2007], [CS003,Summer_2007], [CS009,Fall_2007]]|Cerci Lannister |
|2  |2002          |[[CS003,Spring_2002], [CS004,Summer_2002], [CS005,Fall_2002]]|Ned Stark       |
|4  |2003          |[[CS009,Spring_2003], [CS010,Summer_2003], [CS004,Fall_2003]]|Frodo Baggins   |
|7  |2006          |[[CS004,Spring_2006], [CS005,Summer_2006], [CS003,Fall_2006]]|Stephan Hawkings|
|6  |2005          |[[CS009,Spring_2005], [CS006,Summer_2005], [CS004,Fall_2005]]|Tony Stark      |
|9  |2008          |[[CS006,Spring_2008], [CS007,Summer_2008], [CS009,Fall_2008]]|Wonder Woman    |
|3  |2002          |[[CS003,Spring_2002], [CS004,Summer_2002], [CS005,Fall_2002]]|Haan Solo       |
+---+--------------+-------------------------------------------------------------+----------------+

studentsDF.printSchema()

root
 |-- id: integer (nullable = true)
 |-- year_graduated: string (nullable = true)
 |-- courses_registered: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- cid: string (nullable = true)
 |    |    |-- sem: string (nullable = true)
 |-- name: string (nullable = true)

coursesDF.show()

+-----+--------------------+
|  cid|               cname|
+-----+--------------------+
|CS002|                DBMS|
|CS009|Machine Learning ...|
|CS004|   Intro to Big Data|
|CS008|Science of Progra...|
|CS007|    Network Security|
|CS003|      Basics of Java|
|CS006|System Software A...|
|CS001|        Basics of CS|
|CS005|Design and Analys...|
|CS010|     Cloud Computing|
+-----+--------------------+

As mentioned earlier, to get more out of out Spark, we can further tune the read by using the configuration SplitSizeInMBParam - size of Cassandra data to be read in a single Spark task; determines the number of partitions, but ignored if splitCount is set

Write Data to Cassandra.

Writing to Cassandra can become tricky in a lot of situations esp. if you have a lot of UDTs in Cassandra. Below I have demo’d ways of writing a simple data set (no UDT) and writing complex UDT to Cassandra.

Let’s see how to write simple data set.

E.g. : We wish to insert 2 new courses to our test_keyspace.courses table.

We first create a DataFrame of the desired data and specify the schema for this DataFrame. Please node that the column names of DataFrame should be same as the column names of Cassandra table

val newCourses = Seq(Row("CS011", "ML - Advance"), Row("CS012", "Sentiment Analysis"))
val courseSchema = List(StructField("cid", StringType, true), StructField("cname", StringType, true))
val newCoursesDF = spark.createDataFrame(spark.sparkContext.parallelize(newCourses), StructType(courseSchema))

Once we have the DataFrame ready, it’s just a matter of calling the write() API from the spark-cassandra-connector

 newCoursesDF.write
            .mode(SaveMode.Append)
            .cassandraFormat("courses", "test_keyspace")
            .save()

I have used the SaveMode.Append to append data to the table. You can explore the other modes as per your requirements.

Let’s verify if this worked.

val readNewCourse = spark
            .read
            .cassandraFormat("courses", "test_keyspace")
            .options(ReadConf.SplitSizeInMBParam.option(32))
            .load()
            .show()

+-----+--------------------+
|  cid|               cname|
+-----+--------------------+
|CS002|                DBMS|
|CS011|        ML - Advance|
|CS009|Machine Learning ...|
|CS004|   Intro to Big Data|
|CS008|Science of Progra...|
|CS007|    Network Security|
|CS012|  Sentiment Analysis|
|CS003|      Basics of Java|
|CS006|System Software A...|
|CS001|        Basics of CS|
|CS005|Design and Analys...|
|CS010|     Cloud Computing|
+-----+--------------------+

VoilĂ  !!!! We have successfully inserted 2 rows.

Let’s see how to insert a complex UDT to Cassandra. The concept remains the same. We need to create a DataFrame and write DataFrame to Cassandra. Creating DataFrame for complex UDT types is tricky. For this we use a beautiful Scala concept called case class. Since StructField does not support custom DataTye, we need to create case class to build the DataFrame with a user defined column. Let’s go ahead and create case classes and DataFrame for our data.

E.g. I want to insert a new row in our test_keyspace.students table.

Steps to follow are:

  1. Create case classes for the required UDT.

    case class students_cc(id : Int, year_graduated : String, courses_registered : List[cid_sem], name : String)
    case class cid_sem(cid : String, sem : String)
    
    
  2. Create DataFrame based of off the case class

    val listOfCoursesSem = List(cid_sem("CS003", "Spring_2011"),
                cid_sem("CS006", "Summer_2011"),
                cid_sem("CS009", "Fall_2011")
            )
    
    val newStudents = Seq(students_cc(12, "2011", listOfCoursesSem, "Black Panther"))
    
    val newStudetntsRDD = spark.sparkContext.parallelize(newStudents)
    val newStudentsDF = spark.createDataFrame(newStudetntsRDD)
    
    
  3. write the DataFrame to Cassandra.

    newStudentsDF.write
                .mode(SaveMode.Append)
                .cassandraFormat("students", "test_keyspace")
                .save()
    
    
  4. Verify if insert is successful

    val readNewStudents = spark
                            .read
                            .cassandraFormat("students", "test_keyspace")
                            .options(ReadConf.SplitSizeInMBParam.option(32))
                            .load()
                            .show()
    
    +---+--------------+-------------------------------------------------------------+----------------+
    |id |year_graduated|courses_registered                                           |name            |
    +---+--------------+-------------------------------------------------------------+----------------+
    |5  |2004          |[[CS004,Spring_2004], [CS005,Summer_2004], [CS003,Fall_2004]]|Sheldon Cooper  |
    |5  |2010          |[[CS001,Spring_2010], [CS002,Summer_2010], [CS005,Fall_2010]]|Pika Achu       |
    |10 |2009          |[[CS010,Spring_2009], [CS002,Summer_2009], [CS007,Fall_2009]]|Hermoine Granger|
    |11 |2010          |[[CS001,Spring_2010], [CS002,Summer_2010], [CS005,Fall_2010]]|Peter Parker    |
    |1  |2001          |[[CS001,Spring_2001], [CS002,Summer_2001], [CS001,Fall_2001]]|Tom Riddle      |
    |8  |2007          |[[CS001,Spring_2007], [CS003,Summer_2007], [CS009,Fall_2007]]|Cerci Lannister |
    |2  |2002          |[[CS003,Spring_2002], [CS004,Summer_2002], [CS005,Fall_2002]]|Ned Stark       |
    |4  |2003          |[[CS009,Spring_2003], [CS010,Summer_2003], [CS004,Fall_2003]]|Frodo Baggins   |
    |7  |2006          |[[CS004,Spring_2006], [CS005,Summer_2006], [CS003,Fall_2006]]|Stephan Hawkings|
    |6  |2005          |[[CS009,Spring_2005], [CS006,Summer_2005], [CS004,Fall_2005]]|Tony Stark      |
    |9  |2008          |[[CS006,Spring_2008], [CS007,Summer_2008], [CS009,Fall_2008]]|Wonder Woman    |
    |12 |2011          |[[CS003,Spring_2011], [CS006,Summer_2011], [CS009,Fall_2011]]|Black Panther   |
    |3  |2002          |[[CS003,Spring_2002], [CS004,Summer_2002], [CS005,Fall_2002]]|Haan Solo       |
    +---+--------------+-------------------------------------------------------------+----------------+
    

Here we go !! Black Panther is now a part of test_keyspace.students

Apart from using write API, you can also use saveToCassandra to write RDD’s to Cassandra. I prefer writing DataFrame to Cassandra as this ensures a data structure and reduces chances of inserting bad data.

For a more complex scenario, you can null checks while reading and writing to and from Cassandra using Scala’s Option[Any] type. This provides a strong mechanism to avoid writing and reading NULL values. Nobody likes NULLs ;) DataFrames are known to throw exceptions for NULL values unless handled carefully.

Fun with Joins

Let’s take a simple Use Case and to perform a simple join to show how easy and convenient it is to perform Data Analysis on the fly. Once we have the data loaded in our DataFrames, studentsDF and coursesDF.

Use Case : Get the names of students, courses taken, semester, course id, year_graudated.

Notice that studentsDF has a List() of user defined column. Let’s go ahead and flatten the data so that we have single and atomic values in each column. I use a the DataFrame function explode to flatten the DataFrame. Once flattened, I will drop the temporary flatDataCol and the UDT courses_registered columns.

val flatData = studentsDF.withColumn("flatDataCol", explode($"courses_registered"))
            .withColumn("cid", $"flatDataCol".getItem("cid"))
            .withColumn("sem", $"flatDataCol".getItem("sem"))
            .drop("flatDataCol")
            .drop("courses_registered")

flatData.show(false)

+---+--------------+----------------+-----+-----------+
|id |year_graduated|name            |cid  |sem        |
+---+--------------+----------------+-----+-----------+
|5  |2004          |Sheldon Cooper  |CS004|Spring_2004|
|5  |2004          |Sheldon Cooper  |CS005|Summer_2004|
|5  |2004          |Sheldon Cooper  |CS003|Fall_2004  |
|5  |2010          |Pika Achu       |CS001|Spring_2010|
|5  |2010          |Pika Achu       |CS002|Summer_2010|
|5  |2010          |Pika Achu       |CS005|Fall_2010  |
|10 |2009          |Hermoine Granger|CS010|Spring_2009|
|10 |2009          |Hermoine Granger|CS002|Summer_2009|
|10 |2009          |Hermoine Granger|CS007|Fall_2009  |
|11 |2010          |Peter Parker    |CS001|Spring_2010|
|11 |2010          |Peter Parker    |CS002|Summer_2010|
|11 |2010          |Peter Parker    |CS005|Fall_2010  |
|1  |2001          |Tom Riddle      |CS001|Spring_2001|
|1  |2001          |Tom Riddle      |CS002|Summer_2001|
|1  |2001          |Tom Riddle      |CS001|Fall_2001  |
|8  |2007          |Cerci Lannister |CS001|Spring_2007|
|8  |2007          |Cerci Lannister |CS003|Summer_2007|
|8  |2007          |Cerci Lannister |CS009|Fall_2007  |
|2  |2002          |Ned Stark       |CS003|Spring_2002|
|2  |2002          |Ned Stark       |CS004|Summer_2002|
+---+--------------+----------------+-----+-----------+
only showing top 20 rows

A single row

+---+--------------+-------------------------------------------------------------+----------------+
|id |year_graduated|courses_registered                                           |name            |
+---+--------------+-------------------------------------------------------------+----------------+
|10 |2009          |[[CS010,Spring_2009], [CS002,Summer_2009], [CS007,Fall_2009]]|Hermoine Granger|
+---+--------------+-------------------------------------------------------------+----------------+

is split after flattening the data.

+---+--------------+----------------+-----+-----------+
|id |year_graduated|name            |cid  |sem        |
+---+--------------+----------------+-----+-----------+
|10 |2009          |Hermoine Granger|CS010|Spring_2009|
|10 |2009          |Hermoine Granger|CS002|Summer_2009|
|10 |2009          |Hermoine Granger|CS007|Fall_2009  |
+---+--------------+----------------+-----+-----------+

Now, we can join the DataFrames flatData and coursesDF on cid column to get the desired output

val result = flatData.join(coursesDF, Seq("cid"))
            .select("name", "cname", "sem", "cid", "year_graduated")

result.show(false)


+----------------+---------------------------------+-----------+-----+--------------+
|name            |cname                            |sem        |cid  |year_graduated|
+----------------+---------------------------------+-----------+-----+--------------+
|Sheldon Cooper  |Intro to Big Data                |Spring_2004|CS004|2004          |
|Sheldon Cooper  |Design and Analysis of Algorithms|Summer_2004|CS005|2004          |
|Sheldon Cooper  |Basics of Java                   |Fall_2004  |CS003|2004          |
|Pika Achu       |Basics of CS                     |Spring_2010|CS001|2010          |
|Pika Achu       |DBMS                             |Summer_2010|CS002|2010          |
|Pika Achu       |Design and Analysis of Algorithms|Fall_2010  |CS005|2010          |
|Hermoine Granger|Cloud Computing                  |Spring_2009|CS010|2009          |
|Hermoine Granger|DBMS                             |Summer_2009|CS002|2009          |
|Hermoine Granger|Network Security                 |Fall_2009  |CS007|2009          |
|Peter Parker    |Basics of CS                     |Spring_2010|CS001|2010          |
|Peter Parker    |DBMS                             |Summer_2010|CS002|2010          |
|Peter Parker    |Design and Analysis of Algorithms|Fall_2010  |CS005|2010          |
|Tom Riddle      |Basics of CS                     |Spring_2001|CS001|2001          |
|Tom Riddle      |DBMS                             |Summer_2001|CS002|2001          |
|Tom Riddle      |Basics of CS                     |Fall_2001  |CS001|2001          |
|Cerci Lannister |Basics of CS                     |Spring_2007|CS001|2007          |
|Cerci Lannister |Basics of Java                   |Summer_2007|CS003|2007          |
|Cerci Lannister |Machine Learning - Intro         |Fall_2007  |CS009|2007          |
|Ned Stark       |Basics of Java                   |Spring_2002|CS003|2002          |
|Ned Stark       |Intro to Big Data                |Summer_2002|CS004|2002          |
+----------------+---------------------------------+-----------+-----+--------------+
only showing top 20 rows


In this particular example, we are bringing in the data from Cassandra to Spark and then performing joins. This is ideal if you have a small dataset or better Spark infrastructure. However, for bigger tables (limited Spark resources), you can consider performing a server-side join using the joinWithCassandraTable API provided by spark-cassandra-connector. Here’s an example.

Run with spark-submit

You can run the same using spark-submit

First, let’s go ahead and build the jar.

Pavans-MacBook-Pro:Spark_Cassandra_Example pavanpkulkarni$ pwd
/Users/pavanpkulkarni/Documents/workspace/Spark_Cassandra_Example
Pavans-MacBook-Pro:Spark_Cassandra_Example pavanpkulkarni$ gradle clean build

> Task :compileScala
Pruning sources from previous analysis, due to incompatible CompileSetup.


BUILD SUCCESSFUL in 11s
3 actionable tasks: 3 executed
Pavans-MacBook-Pro:Spark_Cassandra_Example pavanpkulkarni$

We now have build/libs/Spark_Cassandra_Example-1.0.jar ready for deployment.

Before running spark-submit make sure that you have the correct Spark and Scala versions. I had to downgrade my Spark to 2.2.1 and Scala to 2.11 to run the jar.

Pavans-MacBook-Pro:Spark_Cassandra_Example pavanpkulkarni$ /Users/pavanpkulkarni/Documents/spark/spark-2.2.1-bin-hadoop2.7/bin/spark-submit --master local[4] --verbose --class com.pavanpkulkarni.cassandra.SparkCassandra build/libs/Spark_Cassandra_Example-1.0.jar
Using properties file: null
Parsed arguments:
  master                  local[4]
  deployMode              null
  executorMemory          null
  executorCores           null
  totalExecutorCores      null
  propertiesFile          null
  driverMemory            null
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            null
  files                   null
  pyFiles                 null
  archives                null
  mainClass               com.pavanpkulkarni.cassandra.SparkCassandra
  primaryResource         file:/Users/pavanpkulkarni/Documents/workspace/Spark_Cassandra_Example/build/libs/Spark_Cassandra_Example-1.0.jar
  name                    com.pavanpkulkarni.cassandra.SparkCassandra
  childArgs               []
  jars                    null
  packages                null
  packagesExclusions      null
  repositories            null
  verbose                 true

Spark properties used, including those specified through
 --conf and those from the properties file null:



Main class:
com.pavanpkulkarni.cassandra.SparkCassandra
Arguments:

System properties:
(SPARK_SUBMIT,true)
(spark.app.name,com.pavanpkulkarni.cassandra.SparkCassandra)
(spark.jars,*********(redacted))
(spark.submit.deployMode,client)
(spark.master,local[4])
Classpath elements:
file:/Users/pavanpkulkarni/Documents/workspace/Spark_Cassandra_Example/build/libs/Spark_Cassandra_Example-1.0.jar


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/04/26 16:06:11 INFO SparkContext: Running Spark version 2.2.1
18/04/26 16:06:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/26 16:06:11 INFO SparkContext: Submitted application: Spark_Cassandra
18/04/26 16:06:11 INFO SecurityManager: Changing view acls to: pavanpkulkarni
18/04/26 16:06:11 INFO SecurityManager: Changing modify acls to: pavanpkulkarni
18/04/26 16:06:11 INFO SecurityManager: Changing view acls groups to:
18/04/26 16:06:11 INFO SecurityManager: Changing modify acls groups to:
18/04/26 16:06:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(pavanpkulkarni); groups with view permissions: Set(); users  with modify permissions: Set(pavanpkulkarni); groups with modify permissions: Set()
18/04/26 16:06:12 INFO Utils: Successfully started service 'sparkDriver' on port 60436.
18/04/26 16:06:12 INFO SparkEnv: Registering MapOutputTracker
18/04/26 16:06:12 INFO SparkEnv: Registering BlockManagerMaster
18/04/26 16:06:12 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/04/26 16:06:12 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/04/26 16:06:12 INFO DiskBlockManager: Created local directory at /private/var/folders/nb/ygmwx13x6y1_9pyzg1_82w440000gn/T/blockmgr-c84256e9-8a2c-443a-bbcc-c3d9cf98b1c9
18/04/26 16:06:12 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
18/04/26 16:06:12 INFO SparkEnv: Registering OutputCommitCoordinator
18/04/26 16:06:12 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/04/26 16:06:12 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.0.67:4040
18/04/26 16:06:12 INFO SparkContext: Added JAR file:/Users/pavanpkulkarni/Documents/workspace/Spark_Cassandra_Example/build/libs/Spark_Cassandra_Example-1.0.jar at spark://10.0.0.67:60436/jars/Spark_Cassandra_Example-1.0.jar with timestamp 1524773172651
18/04/26 16:06:12 INFO Executor: Starting executor ID driver on host localhost
18/04/26 16:06:12 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60437.
18/04/26 16:06:12 INFO NettyBlockTransferService: Server created on 10.0.0.67:60437
18/04/26 16:06:12 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/04/26 16:06:12 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.67, 60437, None)
18/04/26 16:06:12 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.67:60437 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.67, 60437, None)
18/04/26 16:06:12 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.67, 60437, None)
18/04/26 16:06:12 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.67, 60437, None)
18/04/26 16:06:13 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/pavanpkulkarni/Documents/workspace/Spark_Cassandra_Example/spark-warehouse/').
18/04/26 16:06:13 INFO SharedState: Warehouse path is 'file:/Users/pavanpkulkarni/Documents/workspace/Spark_Cassandra_Example/spark-warehouse/'.
18/04/26 16:06:14 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
18/04/26 16:06:14 INFO ClockFactory: Using native clock to generate timestamps.
18/04/26 16:06:14 WARN NettyUtil: Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead.
18/04/26 16:06:14 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
18/04/26 16:06:14 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
18/04/26 16:06:16 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:17 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:17 INFO CodeGenerator: Code generated in 306.125232 ms
18/04/26 16:06:17 INFO CodeGenerator: Code generated in 29.836876 ms
18/04/26 16:06:18 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:38
18/04/26 16:06:18 INFO DAGScheduler: Got job 0 (show at SparkScalaCassandra.scala:38) with 1 output partitions
18/04/26 16:06:18 INFO DAGScheduler: Final stage: ResultStage 0 (show at SparkScalaCassandra.scala:38)
18/04/26 16:06:18 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:18 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at show at SparkScalaCassandra.scala:38), which has no missing parents
18/04/26 16:06:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 15.5 KB, free 366.3 MB)
18/04/26 16:06:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.5 KB, free 366.3 MB)
18/04/26 16:06:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.0.67:60437 (size: 7.5 KB, free: 366.3 MB)
18/04/26 16:06:18 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:18 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at show at SparkScalaCassandra.scala:38) (first 15 tasks are for partitions Vector(0))
18/04/26 16:06:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
18/04/26 16:06:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, NODE_LOCAL, 17001 bytes)
18/04/26 16:06:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/04/26 16:06:18 INFO Executor: Fetching spark://10.0.0.67:60436/jars/Spark_Cassandra_Example-1.0.jar with timestamp 1524773172651
18/04/26 16:06:18 INFO TransportClientFactory: Successfully created connection to /10.0.0.67:60436 after 11 ms (0 ms spent in bootstraps)
18/04/26 16:06:18 INFO Utils: Fetching spark://10.0.0.67:60436/jars/Spark_Cassandra_Example-1.0.jar to /private/var/folders/nb/ygmwx13x6y1_9pyzg1_82w440000gn/T/spark-895883ed-54b7-4d41-ae05-7808595c73e2/userFiles-fbdf8a06-8f5b-4154-bd0f-9b0dbdffb9dd/fetchFileTemp6843564401530655139.tmp
18/04/26 16:06:18 INFO Executor: Adding file:/private/var/folders/nb/ygmwx13x6y1_9pyzg1_82w440000gn/T/spark-895883ed-54b7-4d41-ae05-7808595c73e2/userFiles-fbdf8a06-8f5b-4154-bd0f-9b0dbdffb9dd/Spark_Cassandra_Example-1.0.jar to class loader
18/04/26 16:06:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1777 bytes result sent to driver
18/04/26 16:06:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 716 ms on localhost (executor driver) (1/1)
18/04/26 16:06:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/04/26 16:06:19 INFO DAGScheduler: ResultStage 0 (show at SparkScalaCassandra.scala:38) finished in 0.741 s
18/04/26 16:06:19 INFO DAGScheduler: Job 0 finished: show at SparkScalaCassandra.scala:38, took 1.008902 s
18/04/26 16:06:19 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:38
18/04/26 16:06:19 INFO DAGScheduler: Got job 1 (show at SparkScalaCassandra.scala:38) with 3 output partitions
18/04/26 16:06:19 INFO DAGScheduler: Final stage: ResultStage 1 (show at SparkScalaCassandra.scala:38)
18/04/26 16:06:19 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:19 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:19 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[4] at show at SparkScalaCassandra.scala:38), which has no missing parents
18/04/26 16:06:19 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.5 KB, free 366.3 MB)
18/04/26 16:06:19 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 7.5 KB, free 366.3 MB)
18/04/26 16:06:19 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.0.67:60437 (size: 7.5 KB, free: 366.3 MB)
18/04/26 16:06:19 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:19 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[4] at show at SparkScalaCassandra.scala:38) (first 15 tasks are for partitions Vector(1, 2, 3))
18/04/26 16:06:19 INFO TaskSchedulerImpl: Adding task set 1.0 with 3 tasks
18/04/26 16:06:19 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 1, NODE_LOCAL, 15336 bytes)
18/04/26 16:06:19 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
18/04/26 16:06:19 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1494 bytes result sent to driver
18/04/26 16:06:19 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, localhost, executor driver, partition 2, NODE_LOCAL, 17240 bytes)
18/04/26 16:06:19 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 161 ms on localhost (executor driver) (1/3)
18/04/26 16:06:19 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)
18/04/26 16:06:19 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 1499 bytes result sent to driver
18/04/26 16:06:19 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, localhost, executor driver, partition 3, NODE_LOCAL, 6363 bytes)
18/04/26 16:06:19 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 169 ms on localhost (executor driver) (2/3)
18/04/26 16:06:19 INFO Executor: Running task 2.0 in stage 1.0 (TID 3)
18/04/26 16:06:19 INFO Executor: Finished task 2.0 in stage 1.0 (TID 3). 1093 bytes result sent to driver
18/04/26 16:06:19 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 15 ms on localhost (executor driver) (3/3)
18/04/26 16:06:19 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/04/26 16:06:19 INFO DAGScheduler: ResultStage 1 (show at SparkScalaCassandra.scala:38) finished in 0.344 s
18/04/26 16:06:19 INFO DAGScheduler: Job 1 finished: show at SparkScalaCassandra.scala:38, took 0.369561 s
+---+--------------+-------------------------------------------------------------+----------------+
|id |year_graduated|courses_registered                                           |name            |
+---+--------------+-------------------------------------------------------------+----------------+
|5  |2004          |[[CS004,Spring_2004], [CS005,Summer_2004], [CS003,Fall_2004]]|Sheldon Cooper  |
|5  |2010          |[[CS001,Spring_2010], [CS002,Summer_2010], [CS005,Fall_2010]]|Pika Achu       |
|10 |2009          |[[CS010,Spring_2009], [CS002,Summer_2009], [CS007,Fall_2009]]|Hermoine Granger|
|11 |2010          |[[CS001,Spring_2010], [CS002,Summer_2010], [CS005,Fall_2010]]|Peter Parker    |
|1  |2001          |[[CS001,Spring_2001], [CS002,Summer_2001], [CS001,Fall_2001]]|Tom Riddle      |
|8  |2007          |[[CS001,Spring_2007], [CS003,Summer_2007], [CS009,Fall_2007]]|Cerci Lannister |
|2  |2002          |[[CS003,Spring_2002], [CS004,Summer_2002], [CS005,Fall_2002]]|Ned Stark       |
|4  |2003          |[[CS009,Spring_2003], [CS010,Summer_2003], [CS004,Fall_2003]]|Frodo Baggins   |
|7  |2006          |[[CS004,Spring_2006], [CS005,Summer_2006], [CS003,Fall_2006]]|Stephan Hawkings|
|6  |2005          |[[CS009,Spring_2005], [CS006,Summer_2005], [CS004,Fall_2005]]|Tony Stark      |
|9  |2008          |[[CS006,Spring_2008], [CS007,Summer_2008], [CS009,Fall_2008]]|Wonder Woman    |
|12 |2011          |[[CS003,Spring_2011], [CS006,Summer_2011], [CS009,Fall_2011]]|Black Panther   |
|3  |2002          |[[CS003,Spring_2002], [CS004,Summer_2002], [CS005,Fall_2002]]|Haan Solo       |
+---+--------------+-------------------------------------------------------------+----------------+

root
 |-- id: integer (nullable = true)
 |-- year_graduated: string (nullable = true)
 |-- courses_registered: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- cid: string (nullable = true)
 |    |    |-- sem: string (nullable = true)
 |-- name: string (nullable = true)

Courses are :
18/04/26 16:06:19 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:19 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:19 INFO CodeGenerator: Code generated in 9.961075 ms
18/04/26 16:06:19 INFO CodeGenerator: Code generated in 7.665927 ms
18/04/26 16:06:19 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:49
18/04/26 16:06:19 INFO DAGScheduler: Got job 2 (show at SparkScalaCassandra.scala:49) with 1 output partitions
18/04/26 16:06:19 INFO DAGScheduler: Final stage: ResultStage 2 (show at SparkScalaCassandra.scala:49)
18/04/26 16:06:19 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:19 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:19 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[9] at show at SparkScalaCassandra.scala:49), which has no missing parents
18/04/26 16:06:19 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 11.2 KB, free 366.2 MB)
18/04/26 16:06:19 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 5.9 KB, free 366.2 MB)
18/04/26 16:06:19 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.0.0.67:60437 (size: 5.9 KB, free: 366.3 MB)
18/04/26 16:06:19 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:19 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[9] at show at SparkScalaCassandra.scala:49) (first 15 tasks are for partitions Vector(0))
18/04/26 16:06:19 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
18/04/26 16:06:19 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, executor driver, partition 0, NODE_LOCAL, 17001 bytes)
18/04/26 16:06:19 INFO Executor: Running task 0.0 in stage 2.0 (TID 4)
18/04/26 16:06:19 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1392 bytes result sent to driver
18/04/26 16:06:19 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 160 ms on localhost (executor driver) (1/1)
18/04/26 16:06:19 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
18/04/26 16:06:19 INFO DAGScheduler: ResultStage 2 (show at SparkScalaCassandra.scala:49) finished in 0.161 s
18/04/26 16:06:19 INFO DAGScheduler: Job 2 finished: show at SparkScalaCassandra.scala:49, took 0.172135 s
18/04/26 16:06:19 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:49
18/04/26 16:06:19 INFO DAGScheduler: Got job 3 (show at SparkScalaCassandra.scala:49) with 3 output partitions
18/04/26 16:06:19 INFO DAGScheduler: Final stage: ResultStage 3 (show at SparkScalaCassandra.scala:49)
18/04/26 16:06:19 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:19 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:19 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[9] at show at SparkScalaCassandra.scala:49), which has no missing parents
18/04/26 16:06:19 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 11.2 KB, free 366.2 MB)
18/04/26 16:06:19 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 5.9 KB, free 366.2 MB)
18/04/26 16:06:19 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.0.0.67:60437 (size: 5.9 KB, free: 366.3 MB)
18/04/26 16:06:19 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:19 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 3 (MapPartitionsRDD[9] at show at SparkScalaCassandra.scala:49) (first 15 tasks are for partitions Vector(1, 2, 3))
18/04/26 16:06:19 INFO TaskSchedulerImpl: Adding task set 3.0 with 3 tasks
18/04/26 16:06:19 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 5, localhost, executor driver, partition 1, NODE_LOCAL, 15336 bytes)
18/04/26 16:06:19 INFO Executor: Running task 0.0 in stage 3.0 (TID 5)
18/04/26 16:06:20 INFO Executor: Finished task 0.0 in stage 3.0 (TID 5). 1377 bytes result sent to driver
18/04/26 16:06:20 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 6, localhost, executor driver, partition 2, NODE_LOCAL, 17240 bytes)
18/04/26 16:06:20 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 5) in 105 ms on localhost (executor driver) (1/3)
18/04/26 16:06:20 INFO Executor: Running task 1.0 in stage 3.0 (TID 6)
18/04/26 16:06:20 INFO Executor: Finished task 1.0 in stage 3.0 (TID 6). 1229 bytes result sent to driver
18/04/26 16:06:20 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 7, localhost, executor driver, partition 3, NODE_LOCAL, 6363 bytes)
18/04/26 16:06:20 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 6) in 132 ms on localhost (executor driver) (2/3)
18/04/26 16:06:20 INFO Executor: Running task 2.0 in stage 3.0 (TID 7)
18/04/26 16:06:20 INFO Executor: Finished task 2.0 in stage 3.0 (TID 7). 1093 bytes result sent to driver
18/04/26 16:06:20 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 7) in 13 ms on localhost (executor driver) (3/3)
18/04/26 16:06:20 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
18/04/26 16:06:20 INFO DAGScheduler: ResultStage 3 (show at SparkScalaCassandra.scala:49) finished in 0.248 s
18/04/26 16:06:20 INFO DAGScheduler: Job 3 finished: show at SparkScalaCassandra.scala:49, took 0.253943 s
+-----+--------------------+
|  cid|               cname|
+-----+--------------------+
|CS002|                DBMS|
|CS011|        ML - Advance|
|CS009|Machine Learning ...|
|CS004|   Intro to Big Data|
|CS008|Science of Progra...|
|CS007|    Network Security|
|CS012|  Sentiment Analysis|
|CS003|      Basics of Java|
|CS006|System Software A...|
|CS001|        Basics of CS|
|CS005|Design and Analys...|
|CS010|     Cloud Computing|
+-----+--------------------+

18/04/26 16:06:20 INFO SparkContext: Starting job: runJob at RDDFunctions.scala:36
18/04/26 16:06:20 INFO DAGScheduler: Got job 4 (runJob at RDDFunctions.scala:36) with 1 output partitions
18/04/26 16:06:20 INFO DAGScheduler: Final stage: ResultStage 4 (runJob at RDDFunctions.scala:36)
18/04/26 16:06:20 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:20 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:20 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[15] at save at SparkScalaCassandra.scala:59), which has no missing parents
18/04/26 16:06:20 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 16.2 KB, free 366.2 MB)
18/04/26 16:06:20 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 8.0 KB, free 366.2 MB)
18/04/26 16:06:20 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 10.0.0.67:60437 (size: 8.0 KB, free: 366.3 MB)
18/04/26 16:06:20 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[15] at save at SparkScalaCassandra.scala:59) (first 15 tasks are for partitions Vector(0))
18/04/26 16:06:20 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
18/04/26 16:06:20 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, executor driver, partition 0, PROCESS_LOCAL, 5026 bytes)
18/04/26 16:06:20 INFO Executor: Running task 0.0 in stage 4.0 (TID 8)
18/04/26 16:06:20 INFO CodeGenerator: Code generated in 12.242964 ms
18/04/26 16:06:20 INFO CodeGenerator: Code generated in 36.992848 ms
18/04/26 16:06:20 INFO TableWriter: Wrote 2 rows to test_keyspace.courses in 0.105 s.
18/04/26 16:06:20 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 1099 bytes result sent to driver
18/04/26 16:06:20 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 156 ms on localhost (executor driver) (1/1)
18/04/26 16:06:20 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
18/04/26 16:06:20 INFO DAGScheduler: ResultStage 4 (runJob at RDDFunctions.scala:36) finished in 0.157 s
18/04/26 16:06:20 INFO DAGScheduler: Job 4 finished: runJob at RDDFunctions.scala:36, took 0.183104 s
18/04/26 16:06:20 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:20 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:20 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:66
18/04/26 16:06:20 INFO DAGScheduler: Got job 5 (show at SparkScalaCassandra.scala:66) with 1 output partitions
18/04/26 16:06:20 INFO DAGScheduler: Final stage: ResultStage 5 (show at SparkScalaCassandra.scala:66)
18/04/26 16:06:20 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:20 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:20 INFO DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[22] at show at SparkScalaCassandra.scala:66), which has no missing parents
18/04/26 16:06:20 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 11.2 KB, free 366.2 MB)
18/04/26 16:06:20 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 5.9 KB, free 366.2 MB)
18/04/26 16:06:20 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 10.0.0.67:60437 (size: 5.9 KB, free: 366.3 MB)
18/04/26 16:06:20 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[22] at show at SparkScalaCassandra.scala:66) (first 15 tasks are for partitions Vector(0))
18/04/26 16:06:20 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
18/04/26 16:06:20 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 9, localhost, executor driver, partition 0, NODE_LOCAL, 17001 bytes)
18/04/26 16:06:20 INFO Executor: Running task 0.0 in stage 5.0 (TID 9)
18/04/26 16:06:20 INFO Executor: Finished task 0.0 in stage 5.0 (TID 9). 1392 bytes result sent to driver
18/04/26 16:06:20 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 9) in 121 ms on localhost (executor driver) (1/1)
18/04/26 16:06:20 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
18/04/26 16:06:20 INFO DAGScheduler: ResultStage 5 (show at SparkScalaCassandra.scala:66) finished in 0.122 s
18/04/26 16:06:20 INFO DAGScheduler: Job 5 finished: show at SparkScalaCassandra.scala:66, took 0.132220 s
18/04/26 16:06:20 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:66
18/04/26 16:06:20 INFO DAGScheduler: Got job 6 (show at SparkScalaCassandra.scala:66) with 3 output partitions
18/04/26 16:06:20 INFO DAGScheduler: Final stage: ResultStage 6 (show at SparkScalaCassandra.scala:66)
18/04/26 16:06:20 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:20 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:20 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[22] at show at SparkScalaCassandra.scala:66), which has no missing parents
18/04/26 16:06:20 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 11.2 KB, free 366.2 MB)
18/04/26 16:06:20 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.9 KB, free 366.2 MB)
18/04/26 16:06:20 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.0.0.67:60437 (size: 5.9 KB, free: 366.3 MB)
18/04/26 16:06:20 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:20 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 6 (MapPartitionsRDD[22] at show at SparkScalaCassandra.scala:66) (first 15 tasks are for partitions Vector(1, 2, 3))
18/04/26 16:06:20 INFO TaskSchedulerImpl: Adding task set 6.0 with 3 tasks
18/04/26 16:06:20 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 10, localhost, executor driver, partition 1, NODE_LOCAL, 15336 bytes)
18/04/26 16:06:20 INFO Executor: Running task 0.0 in stage 6.0 (TID 10)
18/04/26 16:06:20 INFO Executor: Finished task 0.0 in stage 6.0 (TID 10). 1377 bytes result sent to driver
18/04/26 16:06:20 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 11, localhost, executor driver, partition 2, NODE_LOCAL, 17240 bytes)
18/04/26 16:06:20 INFO Executor: Running task 1.0 in stage 6.0 (TID 11)
18/04/26 16:06:20 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 10) in 96 ms on localhost (executor driver) (1/3)
18/04/26 16:06:20 INFO Executor: Finished task 1.0 in stage 6.0 (TID 11). 1229 bytes result sent to driver
18/04/26 16:06:20 INFO TaskSetManager: Starting task 2.0 in stage 6.0 (TID 12, localhost, executor driver, partition 3, NODE_LOCAL, 6363 bytes)
18/04/26 16:06:20 INFO TaskSetManager: Finished task 1.0 in stage 6.0 (TID 11) in 107 ms on localhost (executor driver) (2/3)
18/04/26 16:06:20 INFO Executor: Running task 2.0 in stage 6.0 (TID 12)
18/04/26 16:06:20 INFO Executor: Finished task 2.0 in stage 6.0 (TID 12). 1093 bytes result sent to driver
18/04/26 16:06:20 INFO TaskSetManager: Finished task 2.0 in stage 6.0 (TID 12) in 7 ms on localhost (executor driver) (3/3)
18/04/26 16:06:20 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
18/04/26 16:06:20 INFO DAGScheduler: ResultStage 6 (show at SparkScalaCassandra.scala:66) finished in 0.209 s
18/04/26 16:06:20 INFO DAGScheduler: Job 6 finished: show at SparkScalaCassandra.scala:66, took 0.222623 s
+-----+--------------------+
|  cid|               cname|
+-----+--------------------+
|CS002|                DBMS|
|CS011|        ML - Advance|
|CS009|Machine Learning ...|
|CS004|   Intro to Big Data|
|CS008|Science of Progra...|
|CS007|    Network Security|
|CS012|  Sentiment Analysis|
|CS003|      Basics of Java|
|CS006|System Software A...|
|CS001|        Basics of CS|
|CS005|Design and Analys...|
|CS010|     Cloud Computing|
+-----+--------------------+

18/04/26 16:06:21 INFO CodeGenerator: Code generated in 33.189671 ms
18/04/26 16:06:21 INFO SparkContext: Starting job: runJob at RDDFunctions.scala:36
18/04/26 16:06:21 INFO DAGScheduler: Got job 7 (runJob at RDDFunctions.scala:36) with 1 output partitions
18/04/26 16:06:21 INFO DAGScheduler: Final stage: ResultStage 7 (runJob at RDDFunctions.scala:36)
18/04/26 16:06:21 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:21 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:21 INFO DAGScheduler: Submitting ResultStage 7 (MapPartitionsRDD[28] at save at SparkScalaCassandra.scala:83), which has no missing parents
18/04/26 16:06:21 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 35.0 KB, free 366.1 MB)
18/04/26 16:06:21 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 13.5 KB, free 366.1 MB)
18/04/26 16:06:21 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.0.0.67:60437 (size: 13.5 KB, free: 366.2 MB)
18/04/26 16:06:21 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:21 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 7 (MapPartitionsRDD[28] at save at SparkScalaCassandra.scala:83) (first 15 tasks are for partitions Vector(0))
18/04/26 16:06:21 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
18/04/26 16:06:21 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 13, localhost, executor driver, partition 0, PROCESS_LOCAL, 5356 bytes)
18/04/26 16:06:21 INFO Executor: Running task 0.0 in stage 7.0 (TID 13)
18/04/26 16:06:21 INFO ContextCleaner: Cleaned accumulator 150
18/04/26 16:06:21 INFO CodeGenerator: Code generated in 22.929269 ms
18/04/26 16:06:21 INFO BlockManagerInfo: Removed broadcast_5_piece0 on 10.0.0.67:60437 in memory (size: 5.9 KB, free: 366.2 MB)
18/04/26 16:06:21 INFO ContextCleaner: Cleaned accumulator 75
18/04/26 16:06:21 INFO BlockManagerInfo: Removed broadcast_4_piece0 on 10.0.0.67:60437 in memory (size: 8.0 KB, free: 366.3 MB)
18/04/26 16:06:21 INFO ContextCleaner: Cleaned accumulator 74
18/04/26 16:06:21 INFO BlockManagerInfo: Removed broadcast_6_piece0 on 10.0.0.67:60437 in memory (size: 5.9 KB, free: 366.3 MB)
18/04/26 16:06:21 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.0.0.67:60437 in memory (size: 5.9 KB, free: 366.3 MB)
18/04/26 16:06:21 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 10.0.0.67:60437 in memory (size: 7.5 KB, free: 366.3 MB)
18/04/26 16:06:21 INFO ContextCleaner: Cleaned accumulator 124
18/04/26 16:06:21 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 10.0.0.67:60437 in memory (size: 5.9 KB, free: 366.3 MB)
18/04/26 16:06:21 INFO ContextCleaner: Cleaned accumulator 149
18/04/26 16:06:21 INFO TableWriter: Wrote 1 rows to test_keyspace.students in 0.033 s.
18/04/26 16:06:21 INFO Executor: Finished task 0.0 in stage 7.0 (TID 13). 1193 bytes result sent to driver
18/04/26 16:06:21 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 13) in 98 ms on localhost (executor driver) (1/1)
18/04/26 16:06:21 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool
18/04/26 16:06:21 INFO DAGScheduler: ResultStage 7 (runJob at RDDFunctions.scala:36) finished in 0.099 s
18/04/26 16:06:21 INFO DAGScheduler: Job 7 finished: runJob at RDDFunctions.scala:36, took 0.127054 s
18/04/26 16:06:21 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:21 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:21 INFO CodeGenerator: Code generated in 28.493441 ms
18/04/26 16:06:21 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:90
18/04/26 16:06:21 INFO DAGScheduler: Got job 8 (show at SparkScalaCassandra.scala:90) with 1 output partitions
18/04/26 16:06:21 INFO DAGScheduler: Final stage: ResultStage 8 (show at SparkScalaCassandra.scala:90)
18/04/26 16:06:21 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:21 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:21 INFO DAGScheduler: Submitting ResultStage 8 (MapPartitionsRDD[35] at show at SparkScalaCassandra.scala:90), which has no missing parents
18/04/26 16:06:21 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 15.5 KB, free 366.2 MB)
18/04/26 16:06:21 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 7.5 KB, free 366.2 MB)
18/04/26 16:06:21 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.0.0.67:60437 (size: 7.5 KB, free: 366.3 MB)
18/04/26 16:06:21 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:21 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 8 (MapPartitionsRDD[35] at show at SparkScalaCassandra.scala:90) (first 15 tasks are for partitions Vector(0))
18/04/26 16:06:21 INFO TaskSchedulerImpl: Adding task set 8.0 with 1 tasks
18/04/26 16:06:21 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 14, localhost, executor driver, partition 0, NODE_LOCAL, 17001 bytes)
18/04/26 16:06:21 INFO Executor: Running task 0.0 in stage 8.0 (TID 14)
18/04/26 16:06:21 INFO Executor: Finished task 0.0 in stage 8.0 (TID 14). 1734 bytes result sent to driver
18/04/26 16:06:21 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 14) in 114 ms on localhost (executor driver) (1/1)
18/04/26 16:06:21 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool
18/04/26 16:06:21 INFO DAGScheduler: ResultStage 8 (show at SparkScalaCassandra.scala:90) finished in 0.115 s
18/04/26 16:06:21 INFO DAGScheduler: Job 8 finished: show at SparkScalaCassandra.scala:90, took 0.122350 s
18/04/26 16:06:21 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:90
18/04/26 16:06:21 INFO DAGScheduler: Got job 9 (show at SparkScalaCassandra.scala:90) with 3 output partitions
18/04/26 16:06:21 INFO DAGScheduler: Final stage: ResultStage 9 (show at SparkScalaCassandra.scala:90)
18/04/26 16:06:21 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:21 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:21 INFO DAGScheduler: Submitting ResultStage 9 (MapPartitionsRDD[35] at show at SparkScalaCassandra.scala:90), which has no missing parents
18/04/26 16:06:21 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 15.5 KB, free 366.2 MB)
18/04/26 16:06:21 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 7.5 KB, free 366.2 MB)
18/04/26 16:06:21 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 10.0.0.67:60437 (size: 7.5 KB, free: 366.3 MB)
18/04/26 16:06:21 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:21 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 9 (MapPartitionsRDD[35] at show at SparkScalaCassandra.scala:90) (first 15 tasks are for partitions Vector(1, 2, 3))
18/04/26 16:06:21 INFO TaskSchedulerImpl: Adding task set 9.0 with 3 tasks
18/04/26 16:06:21 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID 15, localhost, executor driver, partition 1, NODE_LOCAL, 15336 bytes)
18/04/26 16:06:21 INFO Executor: Running task 0.0 in stage 9.0 (TID 15)
18/04/26 16:06:21 INFO Executor: Finished task 0.0 in stage 9.0 (TID 15). 1494 bytes result sent to driver
18/04/26 16:06:21 INFO TaskSetManager: Starting task 1.0 in stage 9.0 (TID 16, localhost, executor driver, partition 2, NODE_LOCAL, 17240 bytes)
18/04/26 16:06:21 INFO Executor: Running task 1.0 in stage 9.0 (TID 16)
18/04/26 16:06:21 INFO TaskSetManager: Finished task 0.0 in stage 9.0 (TID 15) in 93 ms on localhost (executor driver) (1/3)
18/04/26 16:06:22 INFO Executor: Finished task 1.0 in stage 9.0 (TID 16). 1499 bytes result sent to driver
18/04/26 16:06:22 INFO TaskSetManager: Starting task 2.0 in stage 9.0 (TID 17, localhost, executor driver, partition 3, NODE_LOCAL, 6363 bytes)
18/04/26 16:06:22 INFO TaskSetManager: Finished task 1.0 in stage 9.0 (TID 16) in 118 ms on localhost (executor driver) (2/3)
18/04/26 16:06:22 INFO Executor: Running task 2.0 in stage 9.0 (TID 17)
18/04/26 16:06:22 INFO Executor: Finished task 2.0 in stage 9.0 (TID 17). 1093 bytes result sent to driver
18/04/26 16:06:22 INFO TaskSetManager: Finished task 2.0 in stage 9.0 (TID 17) in 7 ms on localhost (executor driver) (3/3)
18/04/26 16:06:22 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool
18/04/26 16:06:22 INFO DAGScheduler: ResultStage 9 (show at SparkScalaCassandra.scala:90) finished in 0.218 s
18/04/26 16:06:22 INFO DAGScheduler: Job 9 finished: show at SparkScalaCassandra.scala:90, took 0.227824 s
+---+--------------+--------------------+----------------+
| id|year_graduated|  courses_registered|            name|
+---+--------------+--------------------+----------------+
|  5|          2004|[[CS004,Spring_20...|  Sheldon Cooper|
|  5|          2010|[[CS001,Spring_20...|       Pika Achu|
| 10|          2009|[[CS010,Spring_20...|Hermoine Granger|
| 11|          2010|[[CS001,Spring_20...|    Peter Parker|
|  1|          2001|[[CS001,Spring_20...|      Tom Riddle|
|  8|          2007|[[CS001,Spring_20...| Cerci Lannister|
|  2|          2002|[[CS003,Spring_20...|       Ned Stark|
|  4|          2003|[[CS009,Spring_20...|   Frodo Baggins|
|  7|          2006|[[CS004,Spring_20...|Stephan Hawkings|
|  6|          2005|[[CS009,Spring_20...|      Tony Stark|
|  9|          2008|[[CS006,Spring_20...|    Wonder Woman|
| 12|          2011|[[CS003,Spring_20...|   Black Panther|
|  3|          2002|[[CS003,Spring_20...|       Haan Solo|
+---+--------------+--------------------+----------------+

18/04/26 16:06:22 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:22 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:22 INFO CodeGenerator: Code generated in 9.09734 ms
18/04/26 16:06:22 INFO CodeGenerator: Code generated in 8.48585 ms
18/04/26 16:06:22 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:100
18/04/26 16:06:22 INFO DAGScheduler: Got job 10 (show at SparkScalaCassandra.scala:100) with 1 output partitions
18/04/26 16:06:22 INFO DAGScheduler: Final stage: ResultStage 10 (show at SparkScalaCassandra.scala:100)
18/04/26 16:06:22 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:22 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:22 INFO DAGScheduler: Submitting ResultStage 10 (MapPartitionsRDD[41] at show at SparkScalaCassandra.scala:100), which has no missing parents
18/04/26 16:06:22 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 21.6 KB, free 366.2 MB)
18/04/26 16:06:22 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 9.6 KB, free 366.2 MB)
18/04/26 16:06:22 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on 10.0.0.67:60437 (size: 9.6 KB, free: 366.3 MB)
18/04/26 16:06:22 INFO SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:22 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 10 (MapPartitionsRDD[41] at show at SparkScalaCassandra.scala:100) (first 15 tasks are for partitions Vector(0))
18/04/26 16:06:22 INFO TaskSchedulerImpl: Adding task set 10.0 with 1 tasks
18/04/26 16:06:22 INFO TaskSetManager: Starting task 0.0 in stage 10.0 (TID 18, localhost, executor driver, partition 0, NODE_LOCAL, 17001 bytes)
18/04/26 16:06:22 INFO Executor: Running task 0.0 in stage 10.0 (TID 18)
18/04/26 16:06:22 INFO CodeGenerator: Code generated in 16.150243 ms
18/04/26 16:06:22 INFO Executor: Finished task 0.0 in stage 10.0 (TID 18). 1948 bytes result sent to driver
18/04/26 16:06:22 INFO TaskSetManager: Finished task 0.0 in stage 10.0 (TID 18) in 158 ms on localhost (executor driver) (1/1)
18/04/26 16:06:22 INFO TaskSchedulerImpl: Removed TaskSet 10.0, whose tasks have all completed, from pool
18/04/26 16:06:22 INFO DAGScheduler: ResultStage 10 (show at SparkScalaCassandra.scala:100) finished in 0.160 s
18/04/26 16:06:22 INFO DAGScheduler: Job 10 finished: show at SparkScalaCassandra.scala:100, took 0.174911 s
+---+--------------+----------------+-----+-----------+
|id |year_graduated|name            |cid  |sem        |
+---+--------------+----------------+-----+-----------+
|5  |2004          |Sheldon Cooper  |CS004|Spring_2004|
|5  |2004          |Sheldon Cooper  |CS005|Summer_2004|
|5  |2004          |Sheldon Cooper  |CS003|Fall_2004  |
|5  |2010          |Pika Achu       |CS001|Spring_2010|
|5  |2010          |Pika Achu       |CS002|Summer_2010|
|5  |2010          |Pika Achu       |CS005|Fall_2010  |
|10 |2009          |Hermoine Granger|CS010|Spring_2009|
|10 |2009          |Hermoine Granger|CS002|Summer_2009|
|10 |2009          |Hermoine Granger|CS007|Fall_2009  |
|11 |2010          |Peter Parker    |CS001|Spring_2010|
|11 |2010          |Peter Parker    |CS002|Summer_2010|
|11 |2010          |Peter Parker    |CS005|Fall_2010  |
|1  |2001          |Tom Riddle      |CS001|Spring_2001|
|1  |2001          |Tom Riddle      |CS002|Summer_2001|
|1  |2001          |Tom Riddle      |CS001|Fall_2001  |
|8  |2007          |Cerci Lannister |CS001|Spring_2007|
|8  |2007          |Cerci Lannister |CS003|Summer_2007|
|8  |2007          |Cerci Lannister |CS009|Fall_2007  |
|2  |2002          |Ned Stark       |CS003|Spring_2002|
|2  |2002          |Ned Stark       |CS004|Summer_2002|
+---+--------------+----------------+-----+-----------+
only showing top 20 rows

18/04/26 16:06:22 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:22 INFO CassandraSourceRelation: Input Predicates: []
18/04/26 16:06:22 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(cid)]
18/04/26 16:06:22 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(cid)]
18/04/26 16:06:22 INFO CodeGenerator: Code generated in 14.307183 ms
18/04/26 16:06:22 INFO CodeGenerator: Code generated in 7.461505 ms
18/04/26 16:06:22 INFO SparkContext: Starting job: run at ThreadPoolExecutor.java:1149
18/04/26 16:06:22 INFO DAGScheduler: Got job 11 (run at ThreadPoolExecutor.java:1149) with 4 output partitions
18/04/26 16:06:22 INFO DAGScheduler: Final stage: ResultStage 11 (run at ThreadPoolExecutor.java:1149)
18/04/26 16:06:22 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:22 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:22 INFO DAGScheduler: Submitting ResultStage 11 (MapPartitionsRDD[47] at run at ThreadPoolExecutor.java:1149), which has no missing parents
18/04/26 16:06:22 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 12.1 KB, free 366.1 MB)
18/04/26 16:06:22 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 6.1 KB, free 366.1 MB)
18/04/26 16:06:22 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on 10.0.0.67:60437 (size: 6.1 KB, free: 366.2 MB)
18/04/26 16:06:22 INFO SparkContext: Created broadcast 11 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:22 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 11 (MapPartitionsRDD[47] at run at ThreadPoolExecutor.java:1149) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
18/04/26 16:06:22 INFO TaskSchedulerImpl: Adding task set 11.0 with 4 tasks
18/04/26 16:06:22 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 19, localhost, executor driver, partition 0, NODE_LOCAL, 17001 bytes)
18/04/26 16:06:22 INFO Executor: Running task 0.0 in stage 11.0 (TID 19)
18/04/26 16:06:22 INFO Executor: Finished task 0.0 in stage 11.0 (TID 19). 1521 bytes result sent to driver
18/04/26 16:06:22 INFO TaskSetManager: Starting task 1.0 in stage 11.0 (TID 20, localhost, executor driver, partition 1, NODE_LOCAL, 15336 bytes)
18/04/26 16:06:22 INFO TaskSetManager: Finished task 0.0 in stage 11.0 (TID 19) in 101 ms on localhost (executor driver) (1/4)
18/04/26 16:06:22 INFO Executor: Running task 1.0 in stage 11.0 (TID 20)
18/04/26 16:06:23 INFO Executor: Finished task 1.0 in stage 11.0 (TID 20). 1463 bytes result sent to driver
18/04/26 16:06:23 INFO TaskSetManager: Starting task 2.0 in stage 11.0 (TID 21, localhost, executor driver, partition 2, NODE_LOCAL, 17240 bytes)
18/04/26 16:06:23 INFO Executor: Running task 2.0 in stage 11.0 (TID 21)
18/04/26 16:06:23 INFO TaskSetManager: Finished task 1.0 in stage 11.0 (TID 20) in 97 ms on localhost (executor driver) (2/4)
18/04/26 16:06:23 INFO Executor: Finished task 2.0 in stage 11.0 (TID 21). 1315 bytes result sent to driver
18/04/26 16:06:23 INFO TaskSetManager: Starting task 3.0 in stage 11.0 (TID 22, localhost, executor driver, partition 3, NODE_LOCAL, 6363 bytes)
18/04/26 16:06:23 INFO Executor: Running task 3.0 in stage 11.0 (TID 22)
18/04/26 16:06:23 INFO TaskSetManager: Finished task 2.0 in stage 11.0 (TID 21) in 99 ms on localhost (executor driver) (3/4)
18/04/26 16:06:23 INFO Executor: Finished task 3.0 in stage 11.0 (TID 22). 1179 bytes result sent to driver
18/04/26 16:06:23 INFO TaskSetManager: Finished task 3.0 in stage 11.0 (TID 22) in 6 ms on localhost (executor driver) (4/4)
18/04/26 16:06:23 INFO TaskSchedulerImpl: Removed TaskSet 11.0, whose tasks have all completed, from pool
18/04/26 16:06:23 INFO DAGScheduler: ResultStage 11 (run at ThreadPoolExecutor.java:1149) finished in 0.300 s
18/04/26 16:06:23 INFO DAGScheduler: Job 11 finished: run at ThreadPoolExecutor.java:1149, took 0.312731 s
18/04/26 16:06:23 INFO CodeGenerator: Code generated in 5.810994 ms
18/04/26 16:06:23 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 16.0 MB, free 350.1 MB)
18/04/26 16:06:23 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 689.0 B, free 350.1 MB)
18/04/26 16:06:23 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on 10.0.0.67:60437 (size: 689.0 B, free: 366.2 MB)
18/04/26 16:06:23 INFO SparkContext: Created broadcast 12 from run at ThreadPoolExecutor.java:1149
18/04/26 16:06:23 INFO CodeGenerator: Code generated in 15.327347 ms
18/04/26 16:06:23 INFO CodeGenerator: Code generated in 8.675457 ms
18/04/26 16:06:23 INFO SparkContext: Starting job: show at SparkScalaCassandra.scala:107
18/04/26 16:06:23 INFO DAGScheduler: Got job 12 (show at SparkScalaCassandra.scala:107) with 1 output partitions
18/04/26 16:06:23 INFO DAGScheduler: Final stage: ResultStage 12 (show at SparkScalaCassandra.scala:107)
18/04/26 16:06:23 INFO DAGScheduler: Parents of final stage: List()
18/04/26 16:06:23 INFO DAGScheduler: Missing parents: List()
18/04/26 16:06:23 INFO DAGScheduler: Submitting ResultStage 12 (MapPartitionsRDD[51] at show at SparkScalaCassandra.scala:107), which has no missing parents
18/04/26 16:06:23 INFO MemoryStore: Block broadcast_13 stored as values in memory (estimated size 26.2 KB, free 350.1 MB)
18/04/26 16:06:23 INFO MemoryStore: Block broadcast_13_piece0 stored as bytes in memory (estimated size 11.1 KB, free 350.1 MB)
18/04/26 16:06:23 INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on 10.0.0.67:60437 (size: 11.1 KB, free: 366.2 MB)
18/04/26 16:06:23 INFO SparkContext: Created broadcast 13 from broadcast at DAGScheduler.scala:1006
18/04/26 16:06:23 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 12 (MapPartitionsRDD[51] at show at SparkScalaCassandra.scala:107) (first 15 tasks are for partitions Vector(0))
18/04/26 16:06:23 INFO TaskSchedulerImpl: Adding task set 12.0 with 1 tasks
18/04/26 16:06:23 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 23, localhost, executor driver, partition 0, NODE_LOCAL, 17001 bytes)
18/04/26 16:06:23 INFO Executor: Running task 0.0 in stage 12.0 (TID 23)
18/04/26 16:06:23 INFO CodeGenerator: Code generated in 10.479872 ms
18/04/26 16:06:23 INFO Executor: Finished task 0.0 in stage 12.0 (TID 23). 2509 bytes result sent to driver
18/04/26 16:06:23 INFO TaskSetManager: Finished task 0.0 in stage 12.0 (TID 23) in 138 ms on localhost (executor driver) (1/1)
18/04/26 16:06:23 INFO TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool
18/04/26 16:06:23 INFO DAGScheduler: ResultStage 12 (show at SparkScalaCassandra.scala:107) finished in 0.139 s
18/04/26 16:06:23 INFO DAGScheduler: Job 12 finished: show at SparkScalaCassandra.scala:107, took 0.150774 s
+----------------+---------------------------------+-----------+-----+--------------+
|name            |cname                            |sem        |cid  |year_graduated|
+----------------+---------------------------------+-----------+-----+--------------+
|Sheldon Cooper  |Intro to Big Data                |Spring_2004|CS004|2004          |
|Sheldon Cooper  |Design and Analysis of Algorithms|Summer_2004|CS005|2004          |
|Sheldon Cooper  |Basics of Java                   |Fall_2004  |CS003|2004          |
|Pika Achu       |Basics of CS                     |Spring_2010|CS001|2010          |
|Pika Achu       |DBMS                             |Summer_2010|CS002|2010          |
|Pika Achu       |Design and Analysis of Algorithms|Fall_2010  |CS005|2010          |
|Hermoine Granger|Cloud Computing                  |Spring_2009|CS010|2009          |
|Hermoine Granger|DBMS                             |Summer_2009|CS002|2009          |
|Hermoine Granger|Network Security                 |Fall_2009  |CS007|2009          |
|Peter Parker    |Basics of CS                     |Spring_2010|CS001|2010          |
|Peter Parker    |DBMS                             |Summer_2010|CS002|2010          |
|Peter Parker    |Design and Analysis of Algorithms|Fall_2010  |CS005|2010          |
|Tom Riddle      |Basics of CS                     |Spring_2001|CS001|2001          |
|Tom Riddle      |DBMS                             |Summer_2001|CS002|2001          |
|Tom Riddle      |Basics of CS                     |Fall_2001  |CS001|2001          |
|Cerci Lannister |Basics of CS                     |Spring_2007|CS001|2007          |
|Cerci Lannister |Basics of Java                   |Summer_2007|CS003|2007          |
|Cerci Lannister |Machine Learning - Intro         |Fall_2007  |CS009|2007          |
|Ned Stark       |Basics of Java                   |Spring_2002|CS003|2002          |
|Ned Stark       |Intro to Big Data                |Summer_2002|CS004|2002          |
+----------------+---------------------------------+-----------+-----+--------------+
only showing top 20 rows

18/04/26 16:06:30 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
18/04/26 16:06:31 INFO SparkContext: Invoking stop() from shutdown hook
18/04/26 16:06:31 INFO SerialShutdownHooks: Successfully executed shutdown hook: Clearing session cache for C* connector
18/04/26 16:06:31 INFO SparkUI: Stopped Spark web UI at http://10.0.0.67:4040
18/04/26 16:06:31 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/04/26 16:06:31 INFO MemoryStore: MemoryStore cleared
18/04/26 16:06:31 INFO BlockManager: BlockManager stopped
18/04/26 16:06:31 INFO BlockManagerMaster: BlockManagerMaster stopped
18/04/26 16:06:31 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/04/26 16:06:31 INFO SparkContext: Successfully stopped SparkContext
18/04/26 16:06:31 INFO ShutdownHookManager: Shutdown hook called
18/04/26 16:06:31 INFO ShutdownHookManager: Deleting directory /private/var/folders/nb/ygmwx13x6y1_9pyzg1_82w440000gn/T/spark-895883ed-54b7-4d41-ae05-7808595c73e2
Pavans-MacBook-Pro:Spark_Cassandra_Example pavanpkulkarni$

The jar build/libs/Spark_Cassandra_Example-1.0.jar is ready for deployment on servers.

References

  1. http://cassandra.apache.org/
  2. https://www.tutorialspoint.com/cassandra/cassandra_architecture.htm
  3. https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key
  4. https://github.com/datastax/spark-cassandra-connector
  5. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
  6. https://spark.apache.org/docs/latest/
comments powered by Disqus