Zeppeling tutorial example using Python instead of Scala for Spark SQL

Parallel Python / Scala version of the Zeppelin Tutorial

My latest notebook aims to mimic the original Scala-based Spark SQL tutorial with one that uses Python instead.  Above you can see the two parallel translations side-by-side.

Python Spark SQL Tutorial Code

Here is the resulting Python data loading code.  The SQL code is identical to the Tutorial notebook, so copy and paste if you need it.

I would have tried to make things look a little cleaner, but Python doesn’t easily allow multiline statements in a lambda function, so some lines get a little long.  Improvements invited!

%pyspark
from os import getcwd

# sqlContext = SQLContext(sc) # Removed with latest version I tested. 
# Note: Prior to 0.6.0 the sqlContext variable is called sqlc in %pyspark

zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")

bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))

bankdf = sqlContext.createDataFrame(bank,bankSchema)
bankdf.registerAsTable("bank")

Update: In a Zeppelin 0.6.0 snapshot I found that the “sqlContext = SQLContext(sc)” worked in the Python interpreter, but I had to remove it to allow Zeppelin to share the sqlContext object with a %sql interpreter. After all, Zeppelin already initiated it behind the scenes so you should probably not be overwriting it here.

If you don’t comment it out, it will tell you that:
Table "bank" does not exist
Or something similar. I assume this behaviour is newer than last time I used Zeppelin and will continue going forward, so I’ve commented it out to hopefully ease your pain. (Thanks Matt S. for the tip!)

About Tyler Mitchell

Director Product Marketing @ OmniSci.com GPU-accelerate data analytics | Sr. Product Manager @ Couchbase.com - next generation Data Platform for System of Engagement! Former Eng. Director @Actian.com, author and technology writer in NoSQL, big data, graph analytics, geospatial and Internet of Things. Follow me @1tylermitchell or get my book from http://locatepress.com/.