Zeppelin has the option to change the storage options of its notebook system to allow you to use AWS S3.  I’m not sure how long this has been around but I know it isn’t particularly new. However, I wanted to make a note of it as more users of cluster environments are spinning up resources to be used on-demand and only on-demand.

In this case higher performant persistent storage is not usually advised due to costs on Amazon AWS when the system is offline.  By leaving notebooks on S3 the system can launch, access notebooks, run analysis, save changes and shutdown, etc.

I came across Dominic Murphy’s very detailed and helpful tutorial that gets you up and running on AWS, in an EMR environment, with notebooks stored on S3.  I won’t dive into it here but the salient points taken from his blog that you need to know are:

  1. define the storage method in the zeppelin-env.sh startup environment
  2. add bucket access details into zeppelin-site.xml

Zeppelin-env.sh settings for S3 Access

export ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.S3NotebookRepo
export ZEPPELIN_NOTEBOOK_S3_BUCKET=<myZeppelinBucket>
export ZEPPELIN_NOTEBOOK_USER=<myZeppelinUser>

Zeppelin-site.xml settings for S3 Access

<!--If you use S3 for storage, the following folder structure is necessary: bucket_name/username/notebook/-->
<property>
  <name>zeppelin.notebook.s3.user</name>
  <value><myZeppelinUser></value>
  <description>user name for S3 folder structure</description>
</property>
<property>
  <name>zeppelin.notebook.s3.bucket</name>
  <value><myZeppelinBucket>/value>
   <description>bucket name for notebook storage</description>
</property>

Read all the details on Dominic’s blog and let me know if you get to give it a try.  I haven’t done much with EMR so am curious how easy you find it to spin up using his CloudFormation scripts along with the S3 accessibility.

I’m also interested to hear if you are doing your Spark-based analytics in a persistent manner or with an on-demand approach.  As mentioned above, I’ve seen increase in this approach particularly due to the demand to want both: large clusters and reduced costs.  What do you think?

 

About Tyler Mitchell

Director Product Marketing @ OmniSci.com GPU-accelerate data analytics | Sr. Product Manager @ Couchbase.com - next generation Data Platform for System of Engagement! Former Eng. Director @Actian.com, author and technology writer in NoSQL, big data, graph analytics, geospatial and Internet of Things. Follow me @1tylermitchell or get my book from http://locatepress.com/.