diff --git a/tutorial_gmail_automation/README.md b/tutorial_gmail_automation/README.md
new file mode 100644
index 0000000000..22f270de3d
--- /dev/null
+++ b/tutorial_gmail_automation/README.md
@@ -0,0 +1,32 @@
+# Gmail Automation
+
+This project is a Python-based utility to connect to Gmail, authenticate using OAuth 2.0, and retrieve emails based on a custom search query. The results are processed and displayed in a Pandas DataFrame. The utility also extracts unique email addresses from senders and email content.
+
+## Features
+
+- OAuth 2.0 authentication using `credentials.json`
+- Retrieve emails using Gmail API with custom search queries
+- Display email metadata (sender, subject, date, snippet) in a DataFrame
+- Extract and list unique email addresses from sender fields and message bodies
+- Designed to work inside a Jupyter Notebook environment
+
+---
+
+## Google Cloud Setup
+
+1. Go to [Google Cloud Console](https://console.cloud.google.com/).
+2. Create a new project (or select an existing one).
+3. Enable the **Gmail API** for the project.
+4. Navigate to **APIs & Services > Credentials**, click **Create Credentials > OAuth 2.0 Client ID**.
+5. Choose **Desktop App** as the application type.
+6. Download the `credentials.json` file and place it in the root directory of your project (same folder as the notebook).
+
+
+## Usage
+
+1. Clone this repository or download the notebook.
+2. Make sure `credentials.json` is in the project directory.
+3. Open tutorial_gmail_automation.ipynb in Jupyter Notebook and run all cells:
+ -The first time, you will be prompted to authorize access via a browser.
+ -Once authenticated, a token.json will be saved for future sessions.
+4. Modify the search_query variable inside the notebook to filter emails (e.g., 'subject:invoice', 'after:2024/01/01').
diff --git a/tutorial_gmail_automation/tutorial_gmail_automation.ipynb b/tutorial_gmail_automation/tutorial_gmail_automation.ipynb
new file mode 100644
index 0000000000..6caacee3a8
--- /dev/null
+++ b/tutorial_gmail_automation/tutorial_gmail_automation.ipynb
@@ -0,0 +1,939 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "CONTENTS:\n",
+ " - [Gmail Email Query & Processing](#gmail-email-query-&-processing)\n",
+ " - [Importing all the necessary libraries](#importing-all-the-necessary-libraries)\n",
+ " - [fetch_emails(query: str) supports flexible search queries](#fetch_emails(query:-str)-supports-flexible-search-queries)\n",
+ " - [Trying out with a keyword \"interview\"](#trying-out-with-a-keyword-\"interview\")\n",
+ " - [Trying out with a dates](#trying-out-with-a-dates)\n",
+ " - [Cleaning the dataset](#cleaning-the-dataset)\n",
+ " - [Extracting unique email address](#extracting-unique-email-address)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "20c23aeb",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Gmail Email Query & Processing"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dba266e8",
+ "metadata": {},
+ "source": [
+ "This notebook demonstrates a Python-based utility for connecting to Gmail, retrieving emails using flexible search queries and processing them within a Jupyter notebook environment.\n",
+ "\n",
+ "**Features:**\n",
+ "- Authenticate with Gmail using OAuth2 and the Gmail API\n",
+ "- Flexible search queries\n",
+ "- Results displayed in a Pandas DataFrame: Sender, Subject, Date, Body\n",
+ "- Clean email bodies (plain text/HTML)\n",
+ "- Extract unique email addresses from sender and body fields\n",
+ "- Notebook-friendly, modular code"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "0f5f4ab4",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Requirement already satisfied: google-api-python-client in c:\\users\\maithili\\anaconda3\\lib\\site-packages (2.169.0)\n",
+ "Collecting google-api-python-client\n",
+ " Downloading google_api_python_client-2.170.0-py3-none-any.whl (13.5 MB)\n",
+ " --------------------------------------- 13.5/13.5 MB 50.1 MB/s eta 0:00:00\n",
+ "Requirement already satisfied: google-auth-httplib2 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (0.2.0)\n",
+ "Requirement already satisfied: google-auth-oauthlib in c:\\users\\maithili\\anaconda3\\lib\\site-packages (1.2.2)\n",
+ "Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-python-client) (2.24.2)\n",
+ "Requirement already satisfied: uritemplate<5,>=3.0.1 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-python-client) (4.1.1)\n",
+ "Requirement already satisfied: google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-python-client) (2.17.3)\n",
+ "Requirement already satisfied: httplib2<1.0.0,>=0.19.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-python-client) (0.22.0)\n",
+ "Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth-oauthlib) (1.3.1)\n",
+ "Requirement already satisfied: requests<3.0.0,>=2.18.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (2.32.3)\n",
+ "Requirement already satisfied: proto-plus<2.0.0,>=1.22.3 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (1.23.0)\n",
+ "Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<7.0.0,>=3.19.5 in c:\\users\\maithili\\appdata\\roaming\\python\\python39\\site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (3.20.3)\n",
+ "Requirement already satisfied: googleapis-common-protos<2.0.0,>=1.56.2 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (1.70.0)\n",
+ "Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (0.2.8)\n",
+ "Requirement already satisfied: rsa<5,>=3.1.4 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (4.9)\n",
+ "Requirement already satisfied: six>=1.9.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (1.16.0)\n",
+ "Requirement already satisfied: cachetools<6.0,>=2.0.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (5.3.0)\n",
+ "Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from httplib2<1.0.0,>=0.19.0->google-api-python-client) (3.0.9)\n",
+ "Requirement already satisfied: oauthlib>=3.0.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib) (3.2.2)\n",
+ "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from pyasn1-modules>=0.2.1->google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (0.4.8)\n",
+ "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests<3.0.0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (2024.2.2)\n",
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests<3.0.0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (1.26.20)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests<3.0.0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (2.6)\n",
+ "Requirement already satisfied: charset-normalizer<4,>=2 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests<3.0.0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (2.0.4)\n",
+ "Installing collected packages: google-api-python-client\n",
+ " Attempting uninstall: google-api-python-client\n",
+ " Found existing installation: google-api-python-client 2.169.0\n",
+ " Uninstalling google-api-python-client-2.169.0:\n",
+ " Successfully uninstalled google-api-python-client-2.169.0\n",
+ "Successfully installed google-api-python-client-2.170.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b1f116a0",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Importing all the necessary libraries "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "caa0cb7c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os.path"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "0a736608",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from google.auth.transport.requests import Request\n",
+ "from google.oauth2.credentials import Credentials\n",
+ "from google_auth_oauthlib.flow import InstalledAppFlow\n",
+ "from googleapiclient.discovery import build\n",
+ "from googleapiclient.errors import HttpError"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "5b1ca840",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Requirement already satisfied: numexpr in c:\\users\\maithili\\anaconda3\\lib\\site-packages (2.10.2)\n",
+ "Requirement already satisfied: bottleneck in c:\\users\\maithili\\anaconda3\\lib\\site-packages (1.4.2)\n",
+ "Collecting bottleneck\n",
+ " Downloading bottleneck-1.5.0-cp39-cp39-win_amd64.whl (112 kB)\n",
+ " -------------------------------------- 112.1/112.1 kB 2.2 MB/s eta 0:00:00\n",
+ "Requirement already satisfied: numpy>=1.23.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from numexpr) (1.24.4)\n",
+ "Installing collected packages: bottleneck\n",
+ " Attempting uninstall: bottleneck\n",
+ " Found existing installation: Bottleneck 1.4.2\n",
+ " Uninstalling Bottleneck-1.4.2:\n",
+ " Successfully uninstalled Bottleneck-1.4.2\n",
+ "Successfully installed bottleneck-1.5.0\n",
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install --upgrade numexpr bottleneck\n",
+ "\n",
+ "#Upgrading numexpr and bottleneck ensures optimal performance and compatibility for pandas operations"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "f8426aad",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "import base64\n",
+ "import pandas as pd\n",
+ "from email import policy\n",
+ "import email\n",
+ "\n",
+ "from bs4 import BeautifulSoup\n",
+ "import re"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "916ed81c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "SCOPES=['https://www.googleapis.com/auth/gmail.readonly']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "79632735",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Refresh failed, regenerating token from scratch...\n",
+ "Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=265589690150-l4lpc8b29q6nb31afis0k72v7e0nbbld.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A62732%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fgmail.readonly&state=PWn22TXVDudt7nzzDCBgmzJz9y01IO&access_type=offline\n"
+ ]
+ }
+ ],
+ "source": [
+ "import os\n",
+ "from google_auth_oauthlib.flow import InstalledAppFlow\n",
+ "from google.oauth2.credentials import Credentials\n",
+ "from google.auth.transport.requests import Request\n",
+ "\n",
+ "SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']\n",
+ "\n",
+ "def authenticate_gmail():\n",
+ " creds = None\n",
+ "\n",
+ " if os.path.exists('token.json'):\n",
+ " try:\n",
+ " creds = Credentials.from_authorized_user_file('token.json', SCOPES)\n",
+ " except Exception as e:\n",
+ " print(\"Corrupted token.json, deleting and regenerating...\")\n",
+ " os.remove('token.json')\n",
+ " creds = None\n",
+ " \n",
+ " if not creds or not creds.valid:\n",
+ " if creds and creds.expired and creds.refresh_token:\n",
+ " try:\n",
+ " creds.refresh(Request())\n",
+ " except Exception as e:\n",
+ " print(\"Refresh failed, regenerating token from scratch...\")\n",
+ " os.remove('token.json')\n",
+ " creds = None\n",
+ "\n",
+ " if not creds:\n",
+ " flow = InstalledAppFlow.from_client_secrets_file('client.json', SCOPES)\n",
+ " creds = flow.run_local_server(port=0)\n",
+ " with open('token.json', 'w') as token:\n",
+ " token.write(creds.to_json())\n",
+ "\n",
+ " return creds \n",
+ "\n",
+ "creds = authenticate_gmail()\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "c69601f5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "service = build('gmail', 'v1', credentials=creds)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2e9d4321",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### fetch_emails(query: str) supports flexible search queries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "89dd5027",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import base64\n",
+ "import pandas as pd\n",
+ "from email import policy\n",
+ "import email\n",
+ "\n",
+ "def fetch_emails(service, query='', max_results=100):\n",
+ " all_emails = []\n",
+ " next_page_token = None\n",
+ " fetched = 0\n",
+ "\n",
+ " while True:\n",
+ " response = service.users().messages().list(\n",
+ " userId='me',\n",
+ " q=query, \n",
+ " pageToken=next_page_token,\n",
+ " maxResults=min(100, max_results - fetched)\n",
+ " ).execute()\n",
+ " messages = response.get('messages', [])\n",
+ " if not messages:\n",
+ " break\n",
+ "\n",
+ " for msg_meta in messages:\n",
+ " try:\n",
+ " msg = service.users().messages().get(\n",
+ " userId='me',\n",
+ " id=msg_meta['id'],\n",
+ " format='raw'\n",
+ " ).execute()\n",
+ " raw_bytes = base64.urlsafe_b64decode(msg['raw'])\n",
+ " mime_msg = email.message_from_bytes(raw_bytes, policy=policy.default)\n",
+ " sender = mime_msg.get('From', '')\n",
+ " subject = mime_msg.get('Subject', '')\n",
+ " date = mime_msg.get('Date', '')\n",
+ "\n",
+ " # Extracting body\n",
+ " body = \"\"\n",
+ " if mime_msg.is_multipart():\n",
+ " for part in mime_msg.walk():\n",
+ " ctype = part.get_content_type()\n",
+ " payload = part.get_payload(decode=True)\n",
+ " if ctype == 'text/plain' and payload:\n",
+ " body = payload.decode(errors='replace')\n",
+ " break\n",
+ " elif ctype == 'text/html' and payload and not body:\n",
+ " body = payload.decode(errors='replace')\n",
+ " else:\n",
+ " payload = mime_msg.get_payload(decode=True)\n",
+ " if payload:\n",
+ " body = payload.decode(errors='replace')\n",
+ "\n",
+ " all_emails.append([sender, subject, date, body])\n",
+ " fetched += 1\n",
+ " if fetched >= max_results:\n",
+ " break\n",
+ " except Exception as e:\n",
+ " print(f\"Error processing message {msg_meta['id']}: {e}\")\n",
+ " continue\n",
+ "\n",
+ " if fetched >= max_results:\n",
+ " break\n",
+ " next_page_token = response.get('nextPageToken')\n",
+ " if not next_page_token:\n",
+ " break\n",
+ "\n",
+ " df = pd.DataFrame(all_emails, columns=['Sender', 'Subject', 'Date', 'Body'])\n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "45acffec",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Trying out with a keyword \"interview\"\n",
+ "\n",
+ "* It will give us a dataframe where interview is in the subject of the mail"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "f4070e44",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Sender | \n",
+ " Subject | \n",
+ " Date | \n",
+ " Body | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " Citizens <noreply@mail.modernhire.com> | \n",
+ " Your Citizens interview link is a click away! ... | \n",
+ " Fri, 25 Apr 2025 20:47:19 +0000 | \n",
+ " <style type=\"text/css\">#EmailBody a {color:#02... | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " Quora Digest <english-personalized-digest@quor... | \n",
+ " How do I get a 3.6 in a Google interview? | \n",
+ " Thu, 24 Apr 2025 17:09:35 +0000 | \n",
+ " Top stories for Maithili\\r\\n\\r\\n-----\\r\\n\\r\\nQ... | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " Team Unstop <noreply@unstop.news> | \n",
+ " Top Companies are Hiring [2025]! - Ace Your In... | \n",
+ " Thu, 27 Feb 2025 13:33:03 +0530 | \n",
+ " <!DOCTYPE html>\\n\\n<html><head><title></title>... | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " \"Atria Convergence Technologies (ACT)\" <udita.... | \n",
+ " \u2709\ufe0f Walk-in interview | Collection Executive in... | \n",
+ " Mon, 24 Feb 2025 17:49:37 +0530 | \n",
+ " \\r\\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1... | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " Katie Warren from TopResume <katie@topresume.com> | \n",
+ " Your interview | \n",
+ " Thu, 13 Feb 2025 07:13:06 -0500 | \n",
+ " A few minutes of this will improve your chance... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Sender \\\n",
+ "0 Citizens \n",
+ "1 Quora Digest \n",
+ "3 \"Atria Convergence Technologies (ACT)\" \n",
+ "\n",
+ " Subject \\\n",
+ "0 Your Citizens interview link is a click away! ... \n",
+ "1 How do I get a 3.6 in a Google interview? \n",
+ "2 Top Companies are Hiring [2025]! - Ace Your In... \n",
+ "3 \u2709\ufe0f Walk-in interview | Collection Executive in... \n",
+ "4 Your interview \n",
+ "\n",
+ " Date \\\n",
+ "0 Fri, 25 Apr 2025 20:47:19 +0000 \n",
+ "1 Thu, 24 Apr 2025 17:09:35 +0000 \n",
+ "2 Thu, 27 Feb 2025 13:33:03 +0530 \n",
+ "3 Mon, 24 Feb 2025 17:49:37 +0530 \n",
+ "4 Thu, 13 Feb 2025 07:13:06 -0500 \n",
+ "\n",
+ " Body \n",
+ "0 \n",
+ "\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Sender | \n",
+ " Subject | \n",
+ " Date | \n",
+ " Body | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " Caleb Ralphs <caleb.ralphs@valisinsights.com> | \n",
+ " Data Science Intern - VALIS Insights | \n",
+ " Wed, 30 Apr 2025 23:24:51 +0000 | \n",
+ " Hello Maithili,\\r\\n\\r\\nThank you for applying ... | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " Workable <noreply@candidates.workablemail.com> | \n",
+ " Thanks for applying to VALIS Insights | \n",
+ " Wed, 30 Apr 2025 22:24:30 +0000 | \n",
+ " VALIS Insights\\r\\n\\r\\n------------------------... | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " \"AncestryRecruiting@ancestry.com\" <AncestryRec... | \n",
+ " Thank you for applying to Ancestry! | \n",
+ " Wed, 30 Apr 2025 14:34:03 -0700 | \n",
+ " <!doctype html><html lang=en xmlns=\"http://www... | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " Stack Sports <do-not-reply@mail.paylocity.com> | \n",
+ " Thank you for applying! | \n",
+ " Wed, 30 Apr 2025 21:27:12 +0000 | \n",
+ " <html style=\"background-color: #F4F6F8;\"><head... | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " no-reply@us.greenhouse-mail.io | \n",
+ " Thank you for applying to DeepIntent | \n",
+ " Wed, 30 Apr 2025 21:23:04 +0000 | \n",
+ " Maithili,\\r\\n\\r\\nThanks for applying to DeepIn... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ ""
+ ],
+ "text/plain": [
+ " Sender \\\n",
+ "0 Caleb Ralphs \n",
+ "1 Workable \n",
+ "2 \"AncestryRecruiting@ancestry.com\" \n",
+ "4 no-reply@us.greenhouse-mail.io \n",
+ "\n",
+ " Subject Date \\\n",
+ "0 Data Science Intern - VALIS Insights Wed, 30 Apr 2025 23:24:51 +0000 \n",
+ "1 Thanks for applying to VALIS Insights Wed, 30 Apr 2025 22:24:30 +0000 \n",
+ "2 Thank you for applying to Ancestry! Wed, 30 Apr 2025 14:34:03 -0700 \n",
+ "3 Thank you for applying! Wed, 30 Apr 2025 21:27:12 +0000 \n",
+ "4 Thank you for applying to DeepIntent Wed, 30 Apr 2025 21:23:04 +0000 \n",
+ "\n",
+ " Body \n",
+ "0 Hello Maithili,\\r\\n\\r\\nThank you for applying ... \n",
+ "1 VALIS Insights\\r\\n\\r\\n------------------------... \n",
+ "2 \n",
+ "### Cleaning the dataset "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "59d53423",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from bs4 import BeautifulSoup\n",
+ "import re\n",
+ "\n",
+ "def is_css_or_junk(text):\n",
+ " t = text.strip()\n",
+ " if not t:\n",
+ " return True\n",
+ " if re.match(r'^([.#]?\\w+\\s*\\{)', t):\n",
+ " return True\n",
+ " if len(t) < 40 and all(c in '{};:. \\r\\n\\t' for c in t):\n",
+ " return True\n",
+ " lines = t.splitlines()\n",
+ " if lines:\n",
+ " css_lines = [l for l in lines if l.strip().endswith('{') or l.strip().endswith('}')]\n",
+ " if len(css_lines) / len(lines) > 0.7:\n",
+ " return True\n",
+ " return False\n",
+ "\n",
+ "def clean_html_body(html_body):\n",
+ " soup = BeautifulSoup(html_body, \"html.parser\")\n",
+ " for s in soup([\"script\", \"style\"]):\n",
+ " s.decompose()\n",
+ " text = soup.get_text(separator=\"\\n\", strip=True)\n",
+ " text = re.sub(r'\\n+', '\\n', text)\n",
+ " text = re.sub(r'[ \\t]+', ' ', text)\n",
+ " return text.strip()\n",
+ "\n",
+ "def clean_body(text):\n",
+ " if not isinstance(text, str):\n",
+ " return \"\"\n",
+ " if is_css_or_junk(text):\n",
+ " return \"\"\n",
+ " if '\n",
+ "\n",
+ "\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Sender | \n",
+ " Subject | \n",
+ " Date | \n",
+ " Body | \n",
+ " Body_Clean | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " Caleb Ralphs <caleb.ralphs@valisinsights.com> | \n",
+ " Data Science Intern - VALIS Insights | \n",
+ " Wed, 30 Apr 2025 23:24:51 +0000 | \n",
+ " Hello Maithili,\\r\\n\\r\\nThank you for applying ... | \n",
+ " Hello Maithili,\\r\\n\\r\\nThank you for applying ... | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " Workable <noreply@candidates.workablemail.com> | \n",
+ " Thanks for applying to VALIS Insights | \n",
+ " Wed, 30 Apr 2025 22:24:30 +0000 | \n",
+ " VALIS Insights\\r\\n\\r\\n------------------------... | \n",
+ " VALIS Insights\\r\\n\\r\\n------------------------... | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " \"AncestryRecruiting@ancestry.com\" <AncestryRec... | \n",
+ " Thank you for applying to Ancestry! | \n",
+ " Wed, 30 Apr 2025 14:34:03 -0700 | \n",
+ " <!doctype html><html lang=en xmlns=\"http://www... | \n",
+ " Hi Maithili,\\nThank you for taking the time to... | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " Stack Sports <do-not-reply@mail.paylocity.com> | \n",
+ " Thank you for applying! | \n",
+ " Wed, 30 Apr 2025 21:27:12 +0000 | \n",
+ " <html style=\"background-color: #F4F6F8;\"><head... | \n",
+ " Dear Maithili,Thank you for your interest in a... | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " no-reply@us.greenhouse-mail.io | \n",
+ " Thank you for applying to DeepIntent | \n",
+ " Wed, 30 Apr 2025 21:23:04 +0000 | \n",
+ " Maithili,\\r\\n\\r\\nThanks for applying to DeepIn... | \n",
+ " Maithili,\\r\\n\\r\\nThanks for applying to DeepIn... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ ""
+ ],
+ "text/plain": [
+ " Sender \\\n",
+ "0 Caleb Ralphs \n",
+ "1 Workable \n",
+ "2 \"AncestryRecruiting@ancestry.com\" \n",
+ "4 no-reply@us.greenhouse-mail.io \n",
+ "\n",
+ " Subject Date \\\n",
+ "0 Data Science Intern - VALIS Insights Wed, 30 Apr 2025 23:24:51 +0000 \n",
+ "1 Thanks for applying to VALIS Insights Wed, 30 Apr 2025 22:24:30 +0000 \n",
+ "2 Thank you for applying to Ancestry! Wed, 30 Apr 2025 14:34:03 -0700 \n",
+ "3 Thank you for applying! Wed, 30 Apr 2025 21:27:12 +0000 \n",
+ "4 Thank you for applying to DeepIntent Wed, 30 Apr 2025 21:23:04 +0000 \n",
+ "\n",
+ " Body \\\n",
+ "0 Hello Maithili,\\r\\n\\r\\nThank you for applying ... \n",
+ "1 VALIS Insights\\r\\n\\r\\n------------------------... \n",
+ "2 \n",
+ "### Extracting unique email address "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "a23559da",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def extract_unique_email(text):\n",
+ " if not isinstance(text, str):\n",
+ " return []\n",
+ " return list(set(re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}', text)))\n",
+ "\n",
+ "\n",
+ "def add_extracted_emails_column(df):\n",
+ " def get_emails(row):\n",
+ " emails = extract_unique_email(row['Sender']) + extract_unique_email(row['Body_Clean'])\n",
+ " return list(set(emails))\n",
+ " df['Extracted_Emails'] = df.apply(get_emails, axis=1)\n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "6fa984f8",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Sender | \n",
+ " Subject | \n",
+ " Date | \n",
+ " Body | \n",
+ " Body_Clean | \n",
+ " Extracted_Emails | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " Caleb Ralphs <caleb.ralphs@valisinsights.com> | \n",
+ " Data Science Intern - VALIS Insights | \n",
+ " Wed, 30 Apr 2025 23:24:51 +0000 | \n",
+ " Hello Maithili,\\r\\n\\r\\nThank you for applying ... | \n",
+ " Hello Maithili,\\r\\n\\r\\nThank you for applying ... | \n",
+ " [caleb.ralphs@valisinsights.com] | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " Workable <noreply@candidates.workablemail.com> | \n",
+ " Thanks for applying to VALIS Insights | \n",
+ " Wed, 30 Apr 2025 22:24:30 +0000 | \n",
+ " VALIS Insights\\r\\n\\r\\n------------------------... | \n",
+ " VALIS Insights\\r\\n\\r\\n------------------------... | \n",
+ " [noreply@candidates.workablemail.com, maithili... | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " \"AncestryRecruiting@ancestry.com\" <AncestryRec... | \n",
+ " Thank you for applying to Ancestry! | \n",
+ " Wed, 30 Apr 2025 14:34:03 -0700 | \n",
+ " <!doctype html><html lang=en xmlns=\"http://www... | \n",
+ " Hi Maithili,\\nThank you for taking the time to... | \n",
+ " [maithili.a7@gmail.com, AncestryRecruiting@anc... | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " Stack Sports <do-not-reply@mail.paylocity.com> | \n",
+ " Thank you for applying! | \n",
+ " Wed, 30 Apr 2025 21:27:12 +0000 | \n",
+ " <html style=\"background-color: #F4F6F8;\"><head... | \n",
+ " Dear Maithili,Thank you for your interest in a... | \n",
+ " [do-not-reply@mail.paylocity.com] | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " no-reply@us.greenhouse-mail.io | \n",
+ " Thank you for applying to DeepIntent | \n",
+ " Wed, 30 Apr 2025 21:23:04 +0000 | \n",
+ " Maithili,\\r\\n\\r\\nThanks for applying to DeepIn... | \n",
+ " Maithili,\\r\\n\\r\\nThanks for applying to DeepIn... | \n",
+ " [no-reply@us.greenhouse-mail.io] | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Sender \\\n",
+ "0 Caleb Ralphs \n",
+ "1 Workable \n",
+ "2 \"AncestryRecruiting@ancestry.com\" \n",
+ "4 no-reply@us.greenhouse-mail.io \n",
+ "\n",
+ " Subject Date \\\n",
+ "0 Data Science Intern - VALIS Insights Wed, 30 Apr 2025 23:24:51 +0000 \n",
+ "1 Thanks for applying to VALIS Insights Wed, 30 Apr 2025 22:24:30 +0000 \n",
+ "2 Thank you for applying to Ancestry! Wed, 30 Apr 2025 14:34:03 -0700 \n",
+ "3 Thank you for applying! Wed, 30 Apr 2025 21:27:12 +0000 \n",
+ "4 Thank you for applying to DeepIntent Wed, 30 Apr 2025 21:23:04 +0000 \n",
+ "\n",
+ " Body \\\n",
+ "0 Hello Maithili,\\r\\n\\r\\nThank you for applying ... \n",
+ "1 VALIS Insights\\r\\n\\r\\n------------------------... \n",
+ "2