From d1beb28c41da6426a41cfa8d7ee3673771cdd04c Mon Sep 17 00:00:00 2001 From: maithili74 Date: Mon, 26 May 2025 13:27:40 -0400 Subject: [PATCH] Added gmail automation code MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pre-commit checks: All checks passed ✅ --- tutorial_gmail_automation/README.md | 32 + .../tutorial_gmail_automation.ipynb | 939 ++++++++++++++++++ 2 files changed, 971 insertions(+) create mode 100644 tutorial_gmail_automation/README.md create mode 100644 tutorial_gmail_automation/tutorial_gmail_automation.ipynb diff --git a/tutorial_gmail_automation/README.md b/tutorial_gmail_automation/README.md new file mode 100644 index 0000000000..22f270de3d --- /dev/null +++ b/tutorial_gmail_automation/README.md @@ -0,0 +1,32 @@ +# Gmail Automation + +This project is a Python-based utility to connect to Gmail, authenticate using OAuth 2.0, and retrieve emails based on a custom search query. The results are processed and displayed in a Pandas DataFrame. The utility also extracts unique email addresses from senders and email content. + +## Features + +- OAuth 2.0 authentication using `credentials.json` +- Retrieve emails using Gmail API with custom search queries +- Display email metadata (sender, subject, date, snippet) in a DataFrame +- Extract and list unique email addresses from sender fields and message bodies +- Designed to work inside a Jupyter Notebook environment + +--- + +## Google Cloud Setup + +1. Go to [Google Cloud Console](https://console.cloud.google.com/). +2. Create a new project (or select an existing one). +3. Enable the **Gmail API** for the project. +4. Navigate to **APIs & Services > Credentials**, click **Create Credentials > OAuth 2.0 Client ID**. +5. Choose **Desktop App** as the application type. +6. Download the `credentials.json` file and place it in the root directory of your project (same folder as the notebook). + + +## Usage + +1. Clone this repository or download the notebook. +2. Make sure `credentials.json` is in the project directory. +3. Open tutorial_gmail_automation.ipynb in Jupyter Notebook and run all cells: + -The first time, you will be prompted to authorize access via a browser. + -Once authenticated, a token.json will be saved for future sessions. +4. Modify the search_query variable inside the notebook to filter emails (e.g., 'subject:invoice', 'after:2024/01/01'). diff --git a/tutorial_gmail_automation/tutorial_gmail_automation.ipynb b/tutorial_gmail_automation/tutorial_gmail_automation.ipynb new file mode 100644 index 0000000000..6caacee3a8 --- /dev/null +++ b/tutorial_gmail_automation/tutorial_gmail_automation.ipynb @@ -0,0 +1,939 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "CONTENTS:\n", + " - [Gmail Email Query & Processing](#gmail-email-query-&-processing)\n", + " - [Importing all the necessary libraries](#importing-all-the-necessary-libraries)\n", + " - [fetch_emails(query: str) supports flexible search queries](#fetch_emails(query:-str)-supports-flexible-search-queries)\n", + " - [Trying out with a keyword \"interview\"](#trying-out-with-a-keyword-\"interview\")\n", + " - [Trying out with a dates](#trying-out-with-a-dates)\n", + " - [Cleaning the dataset](#cleaning-the-dataset)\n", + " - [Extracting unique email address](#extracting-unique-email-address)" + ] + }, + { + "cell_type": "markdown", + "id": "20c23aeb", + "metadata": {}, + "source": [ + "\n", + "### Gmail Email Query & Processing" + ] + }, + { + "cell_type": "markdown", + "id": "dba266e8", + "metadata": {}, + "source": [ + "This notebook demonstrates a Python-based utility for connecting to Gmail, retrieving emails using flexible search queries and processing them within a Jupyter notebook environment.\n", + "\n", + "**Features:**\n", + "- Authenticate with Gmail using OAuth2 and the Gmail API\n", + "- Flexible search queries\n", + "- Results displayed in a Pandas DataFrame: Sender, Subject, Date, Body\n", + "- Clean email bodies (plain text/HTML)\n", + "- Extract unique email addresses from sender and body fields\n", + "- Notebook-friendly, modular code" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "0f5f4ab4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: google-api-python-client in c:\\users\\maithili\\anaconda3\\lib\\site-packages (2.169.0)\n", + "Collecting google-api-python-client\n", + " Downloading google_api_python_client-2.170.0-py3-none-any.whl (13.5 MB)\n", + " --------------------------------------- 13.5/13.5 MB 50.1 MB/s eta 0:00:00\n", + "Requirement already satisfied: google-auth-httplib2 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (0.2.0)\n", + "Requirement already satisfied: google-auth-oauthlib in c:\\users\\maithili\\anaconda3\\lib\\site-packages (1.2.2)\n", + "Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-python-client) (2.24.2)\n", + "Requirement already satisfied: uritemplate<5,>=3.0.1 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-python-client) (4.1.1)\n", + "Requirement already satisfied: google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-python-client) (2.17.3)\n", + "Requirement already satisfied: httplib2<1.0.0,>=0.19.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-python-client) (0.22.0)\n", + "Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth-oauthlib) (1.3.1)\n", + "Requirement already satisfied: requests<3.0.0,>=2.18.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (2.32.3)\n", + "Requirement already satisfied: proto-plus<2.0.0,>=1.22.3 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (1.23.0)\n", + "Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<7.0.0,>=3.19.5 in c:\\users\\maithili\\appdata\\roaming\\python\\python39\\site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (3.20.3)\n", + "Requirement already satisfied: googleapis-common-protos<2.0.0,>=1.56.2 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (1.70.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (0.2.8)\n", + "Requirement already satisfied: rsa<5,>=3.1.4 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (4.9)\n", + "Requirement already satisfied: six>=1.9.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (1.16.0)\n", + "Requirement already satisfied: cachetools<6.0,>=2.0.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (5.3.0)\n", + "Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from httplib2<1.0.0,>=0.19.0->google-api-python-client) (3.0.9)\n", + "Requirement already satisfied: oauthlib>=3.0.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib) (3.2.2)\n", + "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from pyasn1-modules>=0.2.1->google-auth!=2.24.0,!=2.25.0,<3.0.0,>=1.32.0->google-api-python-client) (0.4.8)\n", + "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests<3.0.0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (2024.2.2)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests<3.0.0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (1.26.20)\n", + "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests<3.0.0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (2.6)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from requests<3.0.0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0,>=1.31.5->google-api-python-client) (2.0.4)\n", + "Installing collected packages: google-api-python-client\n", + " Attempting uninstall: google-api-python-client\n", + " Found existing installation: google-api-python-client 2.169.0\n", + " Uninstalling google-api-python-client-2.169.0:\n", + " Successfully uninstalled google-api-python-client-2.169.0\n", + "Successfully installed google-api-python-client-2.170.0\n" + ] + } + ], + "source": [ + "!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib" + ] + }, + { + "cell_type": "markdown", + "id": "b1f116a0", + "metadata": {}, + "source": [ + "\n", + "### Importing all the necessary libraries " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "caa0cb7c", + "metadata": {}, + "outputs": [], + "source": [ + "import os.path" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "0a736608", + "metadata": {}, + "outputs": [], + "source": [ + "from google.auth.transport.requests import Request\n", + "from google.oauth2.credentials import Credentials\n", + "from google_auth_oauthlib.flow import InstalledAppFlow\n", + "from googleapiclient.discovery import build\n", + "from googleapiclient.errors import HttpError" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "5b1ca840", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: numexpr in c:\\users\\maithili\\anaconda3\\lib\\site-packages (2.10.2)\n", + "Requirement already satisfied: bottleneck in c:\\users\\maithili\\anaconda3\\lib\\site-packages (1.4.2)\n", + "Collecting bottleneck\n", + " Downloading bottleneck-1.5.0-cp39-cp39-win_amd64.whl (112 kB)\n", + " -------------------------------------- 112.1/112.1 kB 2.2 MB/s eta 0:00:00\n", + "Requirement already satisfied: numpy>=1.23.0 in c:\\users\\maithili\\anaconda3\\lib\\site-packages (from numexpr) (1.24.4)\n", + "Installing collected packages: bottleneck\n", + " Attempting uninstall: bottleneck\n", + " Found existing installation: Bottleneck 1.4.2\n", + " Uninstalling Bottleneck-1.4.2:\n", + " Successfully uninstalled Bottleneck-1.4.2\n", + "Successfully installed bottleneck-1.5.0\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --upgrade numexpr bottleneck\n", + "\n", + "#Upgrading numexpr and bottleneck ensures optimal performance and compatibility for pandas operations" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "f8426aad", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "import base64\n", + "import pandas as pd\n", + "from email import policy\n", + "import email\n", + "\n", + "from bs4 import BeautifulSoup\n", + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "916ed81c", + "metadata": {}, + "outputs": [], + "source": [ + "SCOPES=['https://www.googleapis.com/auth/gmail.readonly']" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "79632735", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Refresh failed, regenerating token from scratch...\n", + "Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=265589690150-l4lpc8b29q6nb31afis0k72v7e0nbbld.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A62732%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fgmail.readonly&state=PWn22TXVDudt7nzzDCBgmzJz9y01IO&access_type=offline\n" + ] + } + ], + "source": [ + "import os\n", + "from google_auth_oauthlib.flow import InstalledAppFlow\n", + "from google.oauth2.credentials import Credentials\n", + "from google.auth.transport.requests import Request\n", + "\n", + "SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']\n", + "\n", + "def authenticate_gmail():\n", + " creds = None\n", + "\n", + " if os.path.exists('token.json'):\n", + " try:\n", + " creds = Credentials.from_authorized_user_file('token.json', SCOPES)\n", + " except Exception as e:\n", + " print(\"Corrupted token.json, deleting and regenerating...\")\n", + " os.remove('token.json')\n", + " creds = None\n", + " \n", + " if not creds or not creds.valid:\n", + " if creds and creds.expired and creds.refresh_token:\n", + " try:\n", + " creds.refresh(Request())\n", + " except Exception as e:\n", + " print(\"Refresh failed, regenerating token from scratch...\")\n", + " os.remove('token.json')\n", + " creds = None\n", + "\n", + " if not creds:\n", + " flow = InstalledAppFlow.from_client_secrets_file('client.json', SCOPES)\n", + " creds = flow.run_local_server(port=0)\n", + " with open('token.json', 'w') as token:\n", + " token.write(creds.to_json())\n", + "\n", + " return creds \n", + "\n", + "creds = authenticate_gmail()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "c69601f5", + "metadata": {}, + "outputs": [], + "source": [ + "service = build('gmail', 'v1', credentials=creds)" + ] + }, + { + "cell_type": "markdown", + "id": "2e9d4321", + "metadata": {}, + "source": [ + "\n", + "### fetch_emails(query: str) supports flexible search queries" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "89dd5027", + "metadata": {}, + "outputs": [], + "source": [ + "import base64\n", + "import pandas as pd\n", + "from email import policy\n", + "import email\n", + "\n", + "def fetch_emails(service, query='', max_results=100):\n", + " all_emails = []\n", + " next_page_token = None\n", + " fetched = 0\n", + "\n", + " while True:\n", + " response = service.users().messages().list(\n", + " userId='me',\n", + " q=query, \n", + " pageToken=next_page_token,\n", + " maxResults=min(100, max_results - fetched)\n", + " ).execute()\n", + " messages = response.get('messages', [])\n", + " if not messages:\n", + " break\n", + "\n", + " for msg_meta in messages:\n", + " try:\n", + " msg = service.users().messages().get(\n", + " userId='me',\n", + " id=msg_meta['id'],\n", + " format='raw'\n", + " ).execute()\n", + " raw_bytes = base64.urlsafe_b64decode(msg['raw'])\n", + " mime_msg = email.message_from_bytes(raw_bytes, policy=policy.default)\n", + " sender = mime_msg.get('From', '')\n", + " subject = mime_msg.get('Subject', '')\n", + " date = mime_msg.get('Date', '')\n", + "\n", + " # Extracting body\n", + " body = \"\"\n", + " if mime_msg.is_multipart():\n", + " for part in mime_msg.walk():\n", + " ctype = part.get_content_type()\n", + " payload = part.get_payload(decode=True)\n", + " if ctype == 'text/plain' and payload:\n", + " body = payload.decode(errors='replace')\n", + " break\n", + " elif ctype == 'text/html' and payload and not body:\n", + " body = payload.decode(errors='replace')\n", + " else:\n", + " payload = mime_msg.get_payload(decode=True)\n", + " if payload:\n", + " body = payload.decode(errors='replace')\n", + "\n", + " all_emails.append([sender, subject, date, body])\n", + " fetched += 1\n", + " if fetched >= max_results:\n", + " break\n", + " except Exception as e:\n", + " print(f\"Error processing message {msg_meta['id']}: {e}\")\n", + " continue\n", + "\n", + " if fetched >= max_results:\n", + " break\n", + " next_page_token = response.get('nextPageToken')\n", + " if not next_page_token:\n", + " break\n", + "\n", + " df = pd.DataFrame(all_emails, columns=['Sender', 'Subject', 'Date', 'Body'])\n", + " return df" + ] + }, + { + "cell_type": "markdown", + "id": "45acffec", + "metadata": {}, + "source": [ + "\n", + "### Trying out with a keyword \"interview\"\n", + "\n", + "* It will give us a dataframe where interview is in the subject of the mail" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "f4070e44", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SenderSubjectDateBody
0Citizens <noreply@mail.modernhire.com>Your Citizens interview link is a click away! ...Fri, 25 Apr 2025 20:47:19 +0000<style type=\"text/css\">#EmailBody a {color:#02...
1Quora Digest <english-personalized-digest@quor...How do I get a 3.6 in a Google interview?Thu, 24 Apr 2025 17:09:35 +0000Top stories for Maithili\\r\\n\\r\\n-----\\r\\n\\r\\nQ...
2Team Unstop <noreply@unstop.news>Top Companies are Hiring [2025]! - Ace Your In...Thu, 27 Feb 2025 13:33:03 +0530<!DOCTYPE html>\\n\\n<html><head><title></title>...
3\"Atria Convergence Technologies (ACT)\" <udita....\u2709\ufe0f Walk-in interview | Collection Executive in...Mon, 24 Feb 2025 17:49:37 +0530\\r\\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1...
4Katie Warren from TopResume <katie@topresume.com>Your interviewThu, 13 Feb 2025 07:13:06 -0500A few minutes of this will improve your chance...
\n", + "
" + ], + "text/plain": [ + " Sender \\\n", + "0 Citizens \n", + "1 Quora Digest \n", + "3 \"Atria Convergence Technologies (ACT)\" \n", + "\n", + " Subject \\\n", + "0 Your Citizens interview link is a click away! ... \n", + "1 How do I get a 3.6 in a Google interview? \n", + "2 Top Companies are Hiring [2025]! - Ace Your In... \n", + "3 \u2709\ufe0f Walk-in interview | Collection Executive in... \n", + "4 Your interview \n", + "\n", + " Date \\\n", + "0 Fri, 25 Apr 2025 20:47:19 +0000 \n", + "1 Thu, 24 Apr 2025 17:09:35 +0000 \n", + "2 Thu, 27 Feb 2025 13:33:03 +0530 \n", + "3 Mon, 24 Feb 2025 17:49:37 +0530 \n", + "4 Thu, 13 Feb 2025 07:13:06 -0500 \n", + "\n", + " Body \n", + "0 \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SenderSubjectDateBody
0Caleb Ralphs <caleb.ralphs@valisinsights.com>Data Science Intern - VALIS InsightsWed, 30 Apr 2025 23:24:51 +0000Hello Maithili,\\r\\n\\r\\nThank you for applying ...
1Workable <noreply@candidates.workablemail.com>Thanks for applying to VALIS InsightsWed, 30 Apr 2025 22:24:30 +0000VALIS Insights\\r\\n\\r\\n------------------------...
2\"AncestryRecruiting@ancestry.com\" <AncestryRec...Thank you for applying to Ancestry!Wed, 30 Apr 2025 14:34:03 -0700<!doctype html><html lang=en xmlns=\"http://www...
3Stack Sports <do-not-reply@mail.paylocity.com>Thank you for applying!Wed, 30 Apr 2025 21:27:12 +0000<html style=\"background-color: #F4F6F8;\"><head...
4no-reply@us.greenhouse-mail.ioThank you for applying to DeepIntentWed, 30 Apr 2025 21:23:04 +0000Maithili,\\r\\n\\r\\nThanks for applying to DeepIn...
\n", + "" + ], + "text/plain": [ + " Sender \\\n", + "0 Caleb Ralphs \n", + "1 Workable \n", + "2 \"AncestryRecruiting@ancestry.com\" \n", + "4 no-reply@us.greenhouse-mail.io \n", + "\n", + " Subject Date \\\n", + "0 Data Science Intern - VALIS Insights Wed, 30 Apr 2025 23:24:51 +0000 \n", + "1 Thanks for applying to VALIS Insights Wed, 30 Apr 2025 22:24:30 +0000 \n", + "2 Thank you for applying to Ancestry! Wed, 30 Apr 2025 14:34:03 -0700 \n", + "3 Thank you for applying! Wed, 30 Apr 2025 21:27:12 +0000 \n", + "4 Thank you for applying to DeepIntent Wed, 30 Apr 2025 21:23:04 +0000 \n", + "\n", + " Body \n", + "0 Hello Maithili,\\r\\n\\r\\nThank you for applying ... \n", + "1 VALIS Insights\\r\\n\\r\\n------------------------... \n", + "2 \n", + "### Cleaning the dataset " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "59d53423", + "metadata": {}, + "outputs": [], + "source": [ + "from bs4 import BeautifulSoup\n", + "import re\n", + "\n", + "def is_css_or_junk(text):\n", + " t = text.strip()\n", + " if not t:\n", + " return True\n", + " if re.match(r'^([.#]?\\w+\\s*\\{)', t):\n", + " return True\n", + " if len(t) < 40 and all(c in '{};:. \\r\\n\\t' for c in t):\n", + " return True\n", + " lines = t.splitlines()\n", + " if lines:\n", + " css_lines = [l for l in lines if l.strip().endswith('{') or l.strip().endswith('}')]\n", + " if len(css_lines) / len(lines) > 0.7:\n", + " return True\n", + " return False\n", + "\n", + "def clean_html_body(html_body):\n", + " soup = BeautifulSoup(html_body, \"html.parser\")\n", + " for s in soup([\"script\", \"style\"]):\n", + " s.decompose()\n", + " text = soup.get_text(separator=\"\\n\", strip=True)\n", + " text = re.sub(r'\\n+', '\\n', text)\n", + " text = re.sub(r'[ \\t]+', ' ', text)\n", + " return text.strip()\n", + "\n", + "def clean_body(text):\n", + " if not isinstance(text, str):\n", + " return \"\"\n", + " if is_css_or_junk(text):\n", + " return \"\"\n", + " if '\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SenderSubjectDateBodyBody_Clean
0Caleb Ralphs <caleb.ralphs@valisinsights.com>Data Science Intern - VALIS InsightsWed, 30 Apr 2025 23:24:51 +0000Hello Maithili,\\r\\n\\r\\nThank you for applying ...Hello Maithili,\\r\\n\\r\\nThank you for applying ...
1Workable <noreply@candidates.workablemail.com>Thanks for applying to VALIS InsightsWed, 30 Apr 2025 22:24:30 +0000VALIS Insights\\r\\n\\r\\n------------------------...VALIS Insights\\r\\n\\r\\n------------------------...
2\"AncestryRecruiting@ancestry.com\" <AncestryRec...Thank you for applying to Ancestry!Wed, 30 Apr 2025 14:34:03 -0700<!doctype html><html lang=en xmlns=\"http://www...Hi Maithili,\\nThank you for taking the time to...
3Stack Sports <do-not-reply@mail.paylocity.com>Thank you for applying!Wed, 30 Apr 2025 21:27:12 +0000<html style=\"background-color: #F4F6F8;\"><head...Dear Maithili,Thank you for your interest in a...
4no-reply@us.greenhouse-mail.ioThank you for applying to DeepIntentWed, 30 Apr 2025 21:23:04 +0000Maithili,\\r\\n\\r\\nThanks for applying to DeepIn...Maithili,\\r\\n\\r\\nThanks for applying to DeepIn...
\n", + "" + ], + "text/plain": [ + " Sender \\\n", + "0 Caleb Ralphs \n", + "1 Workable \n", + "2 \"AncestryRecruiting@ancestry.com\" \n", + "4 no-reply@us.greenhouse-mail.io \n", + "\n", + " Subject Date \\\n", + "0 Data Science Intern - VALIS Insights Wed, 30 Apr 2025 23:24:51 +0000 \n", + "1 Thanks for applying to VALIS Insights Wed, 30 Apr 2025 22:24:30 +0000 \n", + "2 Thank you for applying to Ancestry! Wed, 30 Apr 2025 14:34:03 -0700 \n", + "3 Thank you for applying! Wed, 30 Apr 2025 21:27:12 +0000 \n", + "4 Thank you for applying to DeepIntent Wed, 30 Apr 2025 21:23:04 +0000 \n", + "\n", + " Body \\\n", + "0 Hello Maithili,\\r\\n\\r\\nThank you for applying ... \n", + "1 VALIS Insights\\r\\n\\r\\n------------------------... \n", + "2 \n", + "### Extracting unique email address " + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "a23559da", + "metadata": {}, + "outputs": [], + "source": [ + "def extract_unique_email(text):\n", + " if not isinstance(text, str):\n", + " return []\n", + " return list(set(re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}', text)))\n", + "\n", + "\n", + "def add_extracted_emails_column(df):\n", + " def get_emails(row):\n", + " emails = extract_unique_email(row['Sender']) + extract_unique_email(row['Body_Clean'])\n", + " return list(set(emails))\n", + " df['Extracted_Emails'] = df.apply(get_emails, axis=1)\n", + " return df" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "6fa984f8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SenderSubjectDateBodyBody_CleanExtracted_Emails
0Caleb Ralphs <caleb.ralphs@valisinsights.com>Data Science Intern - VALIS InsightsWed, 30 Apr 2025 23:24:51 +0000Hello Maithili,\\r\\n\\r\\nThank you for applying ...Hello Maithili,\\r\\n\\r\\nThank you for applying ...[caleb.ralphs@valisinsights.com]
1Workable <noreply@candidates.workablemail.com>Thanks for applying to VALIS InsightsWed, 30 Apr 2025 22:24:30 +0000VALIS Insights\\r\\n\\r\\n------------------------...VALIS Insights\\r\\n\\r\\n------------------------...[noreply@candidates.workablemail.com, maithili...
2\"AncestryRecruiting@ancestry.com\" <AncestryRec...Thank you for applying to Ancestry!Wed, 30 Apr 2025 14:34:03 -0700<!doctype html><html lang=en xmlns=\"http://www...Hi Maithili,\\nThank you for taking the time to...[maithili.a7@gmail.com, AncestryRecruiting@anc...
3Stack Sports <do-not-reply@mail.paylocity.com>Thank you for applying!Wed, 30 Apr 2025 21:27:12 +0000<html style=\"background-color: #F4F6F8;\"><head...Dear Maithili,Thank you for your interest in a...[do-not-reply@mail.paylocity.com]
4no-reply@us.greenhouse-mail.ioThank you for applying to DeepIntentWed, 30 Apr 2025 21:23:04 +0000Maithili,\\r\\n\\r\\nThanks for applying to DeepIn...Maithili,\\r\\n\\r\\nThanks for applying to DeepIn...[no-reply@us.greenhouse-mail.io]
\n", + "
" + ], + "text/plain": [ + " Sender \\\n", + "0 Caleb Ralphs \n", + "1 Workable \n", + "2 \"AncestryRecruiting@ancestry.com\" \n", + "4 no-reply@us.greenhouse-mail.io \n", + "\n", + " Subject Date \\\n", + "0 Data Science Intern - VALIS Insights Wed, 30 Apr 2025 23:24:51 +0000 \n", + "1 Thanks for applying to VALIS Insights Wed, 30 Apr 2025 22:24:30 +0000 \n", + "2 Thank you for applying to Ancestry! Wed, 30 Apr 2025 14:34:03 -0700 \n", + "3 Thank you for applying! Wed, 30 Apr 2025 21:27:12 +0000 \n", + "4 Thank you for applying to DeepIntent Wed, 30 Apr 2025 21:23:04 +0000 \n", + "\n", + " Body \\\n", + "0 Hello Maithili,\\r\\n\\r\\nThank you for applying ... \n", + "1 VALIS Insights\\r\\n\\r\\n------------------------... \n", + "2