{"id":2453,"date":"2024-02-21T20:05:05","date_gmt":"2024-02-21T15:05:05","guid":{"rendered":"https:\/\/afzalbadshah.com\/?p=2453"},"modified":"2024-02-21T20:05:08","modified_gmt":"2024-02-21T15:05:08","slug":"setting-up-apache-spark-in-google-colab","status":"publish","type":"post","link":"https:\/\/afzalbadshah.com\/index.php\/2024\/02\/21\/setting-up-apache-spark-in-google-colab\/","title":{"rendered":"Setting up Apache Spark in Google Colab"},"content":{"rendered":"\n<p>Apache Spark is a powerful distributed computing framework that is widely used for big data processing and analytics. In this tutorial, we will walk through the steps to set up and configure Apache Spark in Google Colab, a free cloud-based notebook environment provided by Google.<\/p>\n\n\n\n<p><strong>Step 1: Install Java Development Kit (JDK)<\/strong><\/p>\n\n\n\n<p>The first step is to install the Java Development Kit (JDK) which is required for running Apache Spark.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>!apt-get install openjdk-8-jdk-headless -qq &gt; \/dev\/null<\/code><\/pre>\n\n\n\n<p>This command installs the JDK silently without producing any output.<\/p>\n\n\n\n<p><strong>Step 2: Download and Extract Apache Spark<\/strong><\/p>\n\n\n\n<p>Next, we need to download the Apache Spark distribution and extract it. Here, we&#8217;ll use Spark version 2.2.1 with Hadoop version 2.7.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>!wget -q http:\/\/apache.osuosl.org\/spark\/spark-2.2.1\/spark-2.2.1-bin-hadoop2.7.tgz\n!tar xf spark-2.2.1-bin-hadoop2.7.tgz<\/code><\/pre>\n\n\n\n<p>If the above command fails to download the file, an alternative method to upload the Spark distribution manually is:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Download the Spark distribution from the <a href=\"http:\/\/spark.apache.org\/downloads.html\">Apache Spark website<\/a>.<\/li>\n\n\n\n<li>Upload the downloaded <code>spark-2.2.1-bin-hadoop2.7.tgz<\/code> file to Google Colab using the file upload feature.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>from google.colab import files\n\n# Upload the file\nuploaded = files.upload()<\/code><\/pre>\n\n\n\n<p>In the case of uploading you will need to extract the spark tgz file<\/p>\n\n\n\n<p>!tar xf spark-2.2.1-bin-hadoop2.7.tgz<\/p>\n\n\n\n<p><strong>Step 3: Install findspark<\/strong><\/p>\n\n\n\n<p>Now, we&#8217;ll install the <code>findspark<\/code> library which is used to locate the Spark installation and make it available in the Python environment.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>!pip install -q findspark<\/code><\/pre>\n\n\n\n<p><strong>Step 4: Initialize Spark Environment<\/strong><\/p>\n\n\n\n<p>We&#8217;ll use the <code>findspark<\/code> library to initialize the Spark environment. This will add the Spark binaries to the system path.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import findspark\nfindspark.init(\"spark-2.2.1-bin-hadoop2.7\")<\/code><\/pre>\n\n\n\n<p><strong>Step 5: Create Spark Session<\/strong><\/p>\n\n\n\n<p>Finally, we&#8217;ll create a SparkSession object which serves as the entry point to Spark.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from pyspark.sql import SparkSession\n\n# Create Spark session\nspark = SparkSession.builder \\\n    .appName(\"Spark_Colab\") \\\n    .getOrCreate()<\/code><\/pre>\n\n\n\n<p>If the above steps execute successfully without any errors, it means that Apache Spark has been successfully set up in Google Colab, and you can start using Spark for your data processing and analysis tasks.<\/p>\n\n\n\n<p>That&#8217;s it! You&#8217;ve now learned how to set up Apache Spark in Google Colab for beginners. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Spark is a powerful distributed computing framework that is widely used for big data processing and analytics. In this tutorial, we will walk through the steps to set up and configure Apache Spark in Google Colab, a free cloud-based notebook environment provided by Google. Step 1: Install Java Development Kit (JDK) The first step is to install the Java Development Kit (JDK) which is required for running Apache Spark. This command installs the JDK silently without producing any output&#8230;.<\/p>\n<p class=\"read-more\"><a class=\"btn btn-default\" href=\"https:\/\/afzalbadshah.com\/index.php\/2024\/02\/21\/setting-up-apache-spark-in-google-colab\/\"> Read More<span class=\"screen-reader-text\">  Read More<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":2455,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[486],"tags":[487,488],"class_list":["post-2453","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark-courses","tag-apache-spark","tag-google-colab"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/afzalbadshah.com\/wp-content\/uploads\/2024\/02\/download.png?fit=298%2C169&ssl=1","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pf3emP-Dz","jetpack-related-posts":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/posts\/2453","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/comments?post=2453"}],"version-history":[{"count":2,"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/posts\/2453\/revisions"}],"predecessor-version":[{"id":2456,"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/posts\/2453\/revisions\/2456"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/media\/2455"}],"wp:attachment":[{"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/media?parent=2453"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/categories?post=2453"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/afzalbadshah.com\/index.php\/wp-json\/wp\/v2\/tags?post=2453"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}