Running distributed applications on local machine might seem like a tedious task, but apparently it’s pretty easy.
1. Download and extract the binary
The binary can be downloaded from here, extract it somewhere, let’s call it
2. Running the master
cd $SPARK_HOME/sbin ./start-master.sh
Verify permissions if you are not able to run it
The script will emit some log statement and a URL where master is running. On local machine this will be of format:
3. Verifying with Web UI
Spark ships with a UI as well and it is started automatically (default settings) at port
8080, so you can navigate to
localhost:8080 and see if things are running fine
4. Running Slaves
Run the slaves/workers by executing:
./start-slave.sh $SPARK_URL -m 512m -c 1
The first parameter is the URL at which spark master is running.
You can also provide the memory and CPU cores the worker will take, in this example its 512mb and 1 core. The default is
(total memory - 1gb) memory and all available cores which is not recommended in local mode.
You can verify on the UI if the slave got attached to the master.
To stop, execute
./stop-all.sh in the same directory. If things get crazy you can of course find the tasks using
ps -aux | grep "spark" and kill them individually.
Detailed information can be found here.