Skip to content

leapcell/puppeteer-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper with Puppeteer

This project demonstrates how to deploy a web scraper that collects all the links from a given webpage using Puppeteer in a Node.js environment. It's designed to be used with Leapcell (leapcell.io), and the goal is to help users learn how to deploy projects that depend on web scraping.

Prerequisites

Before running the application, you need to prepare the Puppeteer environment. To do so, execute the following script:

sh prepare_puppeteer_env.sh

This will:

  1. Install Puppeteer and its dependencies (without downloading Chromium, as we will use Google Chrome).
  2. Install Google Chrome on your environment.
  3. Set up the necessary dependencies for running Puppeteer.

Project Structure

.
├── LICENSE                           # License file for the project
├── package.json                      # Contains metadata and dependencies for the Node.js project
├── prepare_puppeteer_env.sh           # Script for setting up the Puppeteer environment
└── src
    ├── app.js                        # Main application entry point using Express and Puppeteer
    └── views
        ├── error.ejs                 # Error page template displayed when something goes wrong
        ├── partials
        │   └── header.ejs            # Header template shared across pages
        └── success.ejs               # Success page template, showing the scraped links

Running the Application

Once you've prepared the environment, you can start the web service with the following command:

npm start

The service will be available on http://localhost:3000, and you can input the URL of the page you want to scrape. It will return a list of all links on that page.


Explanation of prepare_puppeteer_env.sh

This script is responsible for setting up the environment necessary for Puppeteer to run. Here's a breakdown of what each line does:

#!/bin/bash

# Exit immediately if a command exits with a non-zero status
set -e

# --- 1. Common Setup ---
# Install Puppeteer without downloading its bundled Chromium
PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true npm install puppeteer

# Update apt list and install common fonts and libraries required by both browsers
echo "INFO: Installing common fonts and libraries..."
apt-get update
apt-get install -y \
    fonts-ipafont-gothic \
    fonts-wqy-zenhei \
    fonts-thai-tlwg \
    fonts-kacst \
    fonts-freefont-ttf \
    libxss1 \
    --no-install-recommends

# --- 2. Install Browser Based on Architecture ---
ARCH=$(dpkg --print-architecture)
echo "INFO: Detected architecture: $ARCH"

if [ "$ARCH" = "amd64" ]; then
    # For amd64 (x86_64) architecture, install Google Chrome
    echo "INFO: Installing Google Chrome for amd64..."
    apt-get install -y wget gnupg
    wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
    echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google.list
    apt-get update
    apt-get install -y google-chrome-stable --no-install-recommends
    BROWSER_EXEC="google-chrome-stable"

elif [ "$ARCH" = "arm64" ]; then
    # For arm64 architecture, install Chromium
    # Google Chrome is not available for arm64, so we install the open-source version, Chromium
    echo "INFO: Installing Chromium for arm64..."
    apt-get install -y chromium --no-install-recommends
    BROWSER_EXEC="chromium"

else
    echo "ERROR: Unsupported architecture: $ARCH" >&2
    exit 1
fi

# --- 3. Cleanup and Verification ---
# Clean up apt cache to reduce image size
echo "INFO: Cleaning up apt cache..."
rm -rf /var/lib/apt/lists/*

# Find the path of the installed browser executable
chrome_path=$(which "$BROWSER_EXEC")

# Verify if the browser was installed successfully and move the executable
if [ -n "$chrome_path" ]; then
    echo "INFO: Browser executable found at: $chrome_path"
    
    # --- START: MODIFICATION ---
    # On arm64, rename 'chromium' to 'google-chrome-stable' for compatibility with the JS code.
    # On amd64, this just moves 'google-chrome-stable' to the current directory.
    mv "$chrome_path" ./google-chrome-stable
    echo "INFO: Moved executable to ./google-chrome-stable"
    # --- END: MODIFICATION ---

else
    echo "ERROR: Browser executable '$BROWSER_EXEC' not found in PATH." >&2
    exit 1
fi

echo "✅ Setup complete. The browser executable is now available at ./google-chrome-stable"
  • PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true npm install puppeteer: This installs Puppeteer without downloading Chromium, as Google Chrome will be used instead.

  • The subsequent commands update the system package list, install the necessary tools (like wget and gnupg), and add Google's signing key and repository for installing Google Chrome.

  • apt-get install -y google-chrome-stable: This installs Google Chrome along with necessary fonts and libraries to ensure Puppeteer runs properly with the browser.

  • The script then finds and moves the installed google-chrome-stable executable to the current directory for Puppeteer to use.


Contact Support

If you have any issues or questions, feel free to reach out to support@leapcell.io.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors