Monday, August 15, 2016

Install iPython (Jupyter) Notebook on Amazon EMR


  1. Use the bootstrap script on this link to install iPython Notebook: https://github.com/awslabs/emr-bootstrap-actions/tree/master/ipython-notebook
  2. Although the iPython server is running, it's not integrated with Spark. Follow the instructions according to this blog post: https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
  3. Create the initial SparkContext and SQL context as follows:

from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Friday, August 12, 2016

MySQL Driver Error in Apache Spark

I was following the Spark example to load data from MySQL database. See "http://spark.apache.org/examples.html"

There was an error upon executing:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 233, ip-172-22-11-249.ap-southeast-1.compute.internal): java.lang.IllegalStateException: Did not find registered driver with class com.mysql.jdbc.Driver

To force Spark to load the "com.mysql.jdbc.Driver", add the following option as highlighted below
val df = sqlContext
  .read
  .format("jdbc")
  .option("url", url)
  .option("dbtable", "people") 
  .option("driver","com.mysql.jdbc.Driver").load()

Wednesday, August 10, 2016

Install New Interpreter in Zeppelin 0.6.x

In new Zeppelin 0.6.x, you can install new interpreters as follows:


  • List all available interpreter: 
  1. /usr/lib/zeppelin/bin/install-interpreter.sh --list
  • To install the specific interpreters: 
  1. /usr/lib/zeppelin/bin/install-interpreter.sh --name jdbc,hbase,postgresql

Friday, August 5, 2016

IAM Errors when Creating Amazon EMR

There are errors related to the lack of permissions in the EMR_EC2_DefaultRole whenever I launch a Amazon EMR cluster. After some searching on the support forum, the default EMR role may not be created automatically for you. Hence, I removed the old default role and created new one as follows:
  1. Create default role: 
    • aws emr create-default-roles
  2. Create instance profile: 
    • aws iam create-instance-profile --instance-profile-name EMR_EC2_DefaultRole
  3. Verify that instance profile exists but doesn't have any roles:
    • aws iam get-instance-profile --instance-profile-name EMR_EC2_DefaultRole
  4. Add the role using:
    • aws iam add-role-to-instance-profile --instance-profile-name EMR_EC2_DefaultRole --role-name EMR_EC2_DefaultRole

Thursday, July 7, 2016

Unstuck Spark/Zeppelin Jobs on Amazon EMR

Apache Zeppelin + Apache Spark is a perfect match. Basically, you can do the following in one console:

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration
As it's still under incubation, the error handling is still not as rock solid. Often, I have experienced Spark jobs being stuck for long time. Usually, restarting the Spark interpreter should do the trick. However, there are times that this simple trick won't work and the only way is to restart the Zeppelin daemon. On Amazon EMR console, do the following:
  1. /usr/lib/zeppelin/bin/zeppelin-daemon.sh stop
  2. /usr/lib/zeppelin/bin/zeppelin-daemon.sh start
If you wish to execute the scripts in zepplin account, which has a nologin shell. Execute following instead:
  1. sudo -s /bin/bash -c '/usr/lib/zeppelin/bin/zeppelin-daemon.sh stop' zeppelin
  2. sudo -s /bin/bash -c '/usr/lib/zeppelin/bin/zeppelin-daemon.sh start' zeppelin
If you encounter this Java connection error: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method), it's probably because Zeppelin starts the spark interpreter in a different process.

  1. Edit /etc/spark/conf/spark-defaults.conf
  2. Comment off the following line and restart Zeppelin

#spark.driver.memory              5g
Reference: http://stackoverflow.com/questions/32735645/hello-world-in-zeppelin-failed

    Tuesday, May 31, 2016

    Multiple JSON Configurations for Amazon EMR cluster

    To use multiple JSON configurations when you launch the new Amazon EMR cluster, I want to configure Spark to use dynamic allocation of executors and store Zeppelin notebook on S3 storage. Rename the bold red below according to your S3 bucket location. In the following example, create the folder '/user/notebook' under your-s3-bucket. You'll see new note.json under the S3 folder, as you create new Zeppelin notebooks.
    [
        {
            "classification":"spark-defaults", 
            "properties": {
                "spark.serializer":"org.apache.spark.serializer.KryoSerializer", 
                "spark.dynamicAllocation.enabled":"true"}, 
            "configurations":[]
        },
        {
            "configurations":[
             {
                "classification":"export",
                "properties":{
                   "ZEPPELIN_NOTEBOOK_S3_BUCKET":"your-s3-bucket",
                   "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
                   "ZEPPELIN_NOTEBOOK_USER":"user" 
                }
             }
          ],
          "classification":"zeppelin-env",
          "properties":{
          }
       }
    ]