Zeppelin has the option to change the storage options of its notebook system to allow you to use AWS S3. I’m not sure how long this has been around but I know it isn’t particularly new. However, I wanted to make a note of it as more users of cluster environments are spinning up resources to be used on-demand and only on-demand.
In this case higher performant persistent storage is not usually advised due to costs on Amazon AWS when the system is offline. By leaving notebooks on S3 the system can launch, access notebooks, run analysis, save changes and shutdown, etc.
I came across Dominic Murphy’s very detailed and helpful tutorial that gets you up and running on AWS, in an EMR environment, with notebooks stored on S3. I won’t dive into it here but the salient points taken from his blog that you need to know are:
- define the storage method in the zeppelin-env.sh startup environment
- add bucket access details into zeppelin-site.xml
Zeppelin-env.sh settings for S3 Access
export ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.S3NotebookRepo export ZEPPELIN_NOTEBOOK_S3_BUCKET=<myZeppelinBucket> export ZEPPELIN_NOTEBOOK_USER=<myZeppelinUser>
Zeppelin-site.xml settings for S3 Access
<!--If you use S3 for storage, the following folder structure is necessary: bucket_name/username/notebook/--> <property> <name>zeppelin.notebook.s3.user</name> <value><myZeppelinUser></value> <description>user name for S3 folder structure</description> </property> <property> <name>zeppelin.notebook.s3.bucket</name> <value><myZeppelinBucket>/value> <description>bucket name for notebook storage</description> </property>
Read all the details on Dominic’s blog and let me know if you get to give it a try. I haven’t done much with EMR so am curious how easy you find it to spin up using his CloudFormation scripts along with the S3 accessibility.
I’m also interested to hear if you are doing your Spark-based analytics in a persistent manner or with an on-demand approach. As mentioned above, I’ve seen increase in this approach particularly due to the demand to want both: large clusters and reduced costs. What do you think?
- Geography + Data - July 15, 2021
- DIY Battery – Weekend Project – Aluminum + Bleach? - January 17, 2021
- It’s all about the ecosystem – build and nurture yours - May 1, 2020
- Learnings from TigerGraph and Expero webinar - April 1, 2020
- 4 Webinars This Week – GPU, 5G, graph analytics, cloud - March 30, 2020
- Diving into #NoSQL from the SQL Empire … - February 28, 2017
- VID: Solving Performance Problems on Hadoop - July 5, 2016
- Storing Zeppelin Notebooks in AWS S3 Buckets - June 7, 2016
- VirtualBox extension pack update on OS X - April 11, 2016
- Zeppelin Notebook Quick Start on OSX v0.5.6 - April 4, 2016