Thursday, July 7, 2016

Unstuck Spark/Zeppelin Jobs on Amazon EMR

Apache Zeppelin + Apache Spark is a perfect match. Basically, you can do the following in one console:

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration
As it's still under incubation, the error handling is still not as rock solid. Often, I have experienced Spark jobs being stuck for long time. Usually, restarting the Spark interpreter should do the trick. However, there are times that this simple trick won't work and the only way is to restart the Zeppelin daemon. On Amazon EMR console, do the following:
  1. /usr/lib/zeppelin/bin/zeppelin-daemon.sh stop
  2. /usr/lib/zeppelin/bin/zeppelin-daemon.sh start
If you wish to execute the scripts in zepplin account, which has a nologin shell. Execute following instead:
  1. sudo -s /bin/bash -c '/usr/lib/zeppelin/bin/zeppelin-daemon.sh stop' zeppelin
  2. sudo -s /bin/bash -c '/usr/lib/zeppelin/bin/zeppelin-daemon.sh start' zeppelin
If you encounter this Java connection error: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method), probably due to Zeppelin starts the interpreter in a different process and tries to connect to using Thrift Protocol

  1. Edit /etc/spark/conf/spark-defaults.conf
  2. Comment off the following line and restart Zeppelin

#spark.driver.memory              5g
Reference: http://stackoverflow.com/questions/32735645/hello-world-in-zeppelin-failed

    Tuesday, May 31, 2016

    Java errors when using Apache Spark on Amazon EMR

    When I attempt the sentiment analysis tutorial on Apache Spark Streaming Twitter using Apache Zeppelin Notebook on Amazon EMR. I often encountered some strange Java error like this:

    java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method)

    After some searching on the AWS forum, you should enter the following JSON configuration when you launch the new Amazon EMR cluster. The below config is meant for m3.xlarge instance.
    [
    {
    "Classification": "spark-defaults",
    "Properties": {
    "spark.dynamicAllocation.enabled": "true",
    "spark.driver.memory":"5G",
    "spark.executor.memory": "9542M",
    "spark.executor.cores": "4"
    }
    },
    {
    "Classification": "zeppelin-env",
    "Configurations":[{
    "Classification":"export",
    "Properties": {
    "ZEPPELIN_MEM":"-Xmx5G"
    }
    }],
    "Properties": {}
    }
    ]

    Saturday, February 27, 2016

    Run Node.JS within Sublime Text editor

    Step 1: Open "Sublime Text 2" editor
    Step 2: Tools -> Build System -> New Build System
    Step 3: New tab appears. Replace the content with the following lines.

    {
    "cmd": ["node", "$file", "$file_base_name"],
    "working_dir": "${project_path:${folder}}",
    "selector": "*.js"
    }

    Step 4: Save the file and rename it with "NodeJS.sublime-build".
    Step 5: Select "Tools -> Build System -> NodeJS"
    Step 6: Go to your source program and press "Ctrl-B" to run code.

    Wednesday, February 3, 2016

    Connect to WS2012 WSUS Internal Database

    To connect to the Windows Internal Database (WID) WSUS, install SQL Management Studio and connect to the server using:


    \\.\pipe\MICROSOFT##WID\tsql\query

    You may confirm the SQL instance name in red against the startup service description using services.msc.

    As there is a constant error of "WSUS server is still processing a previous configuration change", this is what I need to execute on the database instance:

    USE SUSDB; 
    UPDATE tbSingletonData SET ResetStateMachineNeeded = 0

    Monday, August 24, 2015

    Making Sense of "User-Based Recommender in 5 minutes"

    Have you wondered how Amazon recommend new items to you? This is an  example of Machine Learning implementation, which is a type of Artificial Intelligence. I have followed an introductory example of Apache Mahout and shared this on LinkedIn.