Single post

Hadoop MapReduce vs Spark

(1) Spark can not only utilize HDFS as a data source but also run inside Hadoop YARN. So, Spark is in fact a MapReduce alternative within the Hadoop ecosystem, not a “Hadoop alternative”.

(2) Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action, so Spark should outperform Hadoop MapReduce.

Nonetheless, Spark needs a lot of memory. Much like standard DBs, it loads a process into memory and keeps it there until further notice, for the sake of caching. If Spark runs on Hadoop YARN with other resource-demanding services, or if the data is too big to fit entirely into the memory, then there could be major performance degradations for Spark.

Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services.

(3) Spark is easier to program and includes an interactive mode; Hadoop MapReduce is more difficult to program but many tools are available to make it easier. Tools are Impala, Presto and Tez.

(4) There is a wide array of Hadoop-as-a-service offerings and Hadoop-based services, which help to skip the hardware and staffing requirements. In comparison, there are few Spark-as-a-service options and they are all very new.

(5) Spark and Hadoop MapReduce both have good failure tolerance, but Hadoop MapReduce is slightly more tolerant. That is due to the fact that Hadoop uses replication.

(6) Spark security is still in its infancy; Hadoop MapReduce has more security features and projects. Hadoop MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Sentry. Spark is a bit bare at the moment when it comes to security.



theme by teslathemes