Storing Zeppelin Notebooks in AWS S3 Buckets

Zeppelin has the option to change the storage options of its notebook system to allow you to use AWS S3. I’m not sure how long this has been around but I know it isn’t particularly new. However, I wanted to make a note of it as more users of cluster environments are spinning up resources to be used on-demand and only on-demand.

In this case higher performant persistent storage is not usually advised due to costs on Amazon AWS when the system is offline. By leaving notebooks on S3 the system can launch, access notebooks, run analysis, save changes and shutdown, etc.

I came across Dominic Murphy’s very detailed and helpful tutorial that gets you up and running on AWS, in an EMR environment, with notebooks stored on S3. I won’t dive into it here but the salient points taken from his blog that you need to know are:

define the storage method in the zeppelin-env.sh startup environment
add bucket access details into zeppelin-site.xml

Zeppelin-env.sh settings for S3 Access

export ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.S3NotebookRepo
export ZEPPELIN_NOTEBOOK_S3_BUCKET=<myZeppelinBucket>
export ZEPPELIN_NOTEBOOK_USER=<myZeppelinUser>

Zeppelin-site.xml settings for S3 Access

<!--If you use S3 for storage, the following folder structure is necessary: bucket_name/username/notebook/-->
<property>
  <name>zeppelin.notebook.s3.user</name>
  <value><myZeppelinUser></value>
  <description>user name for S3 folder structure</description>
</property>
<property>
  <name>zeppelin.notebook.s3.bucket</name>
  <value><myZeppelinBucket>/value>
   <description>bucket name for notebook storage</description>
</property>

Read all the details on Dominic’s blog and let me know if you get to give it a try. I haven’t done much with EMR so am curious how easy you find it to spin up using his CloudFormation scripts along with the S3 accessibility.

I’m also interested to hear if you are doing your Spark-based analytics in a persistent manner or with an on-demand approach. As mentioned above, I’ve seen increase in this approach particularly due to the demand to want both: large clusters and reduced costs. What do you think?

About the Author
Latest Posts

About Tyler Mitchell

Director Product Marketing @ OmniSci.com GPU-accelerate data analytics | Sr. Product Manager @ Couchbase.com - next generation Data Platform for System of Engagement! Former Eng. Director @Actian.com, author and technology writer in NoSQL, big data, graph analytics, geospatial and Internet of Things. Follow me @1tylermitchell or get my book from http://locatepress.com/.

Geography + Data - July 15, 2021
DIY Battery – Weekend Project – Aluminum + Bleach? - January 17, 2021
It’s all about the ecosystem – build and nurture yours - May 1, 2020
Learnings from TigerGraph and Expero webinar - April 1, 2020
4 Webinars This Week – GPU, 5G, graph analytics, cloud - March 30, 2020
Diving into #NoSQL from the SQL Empire … - February 28, 2017
VID: Solving Performance Problems on Hadoop - July 5, 2016
Storing Zeppelin Notebooks in AWS S3 Buckets - June 7, 2016
VirtualBox extension pack update on OS X - April 11, 2016
Zeppelin Notebook Quick Start on OSX v0.5.6 - April 4, 2016

Storing Zeppelin Notebooks in AWS S3 Buckets

Zeppelin-env.sh settings for S3 Access

Zeppelin-site.xml settings for S3 Access

About Tyler Mitchell

Similar posts

VID: Solving Performance Problems on Hadoop

Zeppelin Notebook Quick Start on OSX v0.5.6

Hadoop Options for SQL Databases

No Comments Yet

Leave a Reply Cancel reply