Content Factory is a flexible scraping and automation Actor that turns arbitrary web URLs into structured content output via the Apify platform. Whether you need to fetch HTML, parse data, or run custom extraction logic, this tool makes it easy β and you can run it programmatically via the API or integrate it as part of larger workflows.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for Content Factory you've just found your team β Let's Chat. ππ
Sometimes you donβt need a scraper tailored to one site β you need a versatile tool that can fetch and extract from very different pages depending on your use case. Content Factory does exactly that. It wraps generic web-fetching and content extraction into one reusable Actor, giving you a straightforward way to retrieve data from arbitrary URLs.
Perfect for automation pipelines, data ingestion, or building content-driven apps that rely on dynamic web sources.
- Works with almost any website or URL β not limited to a specific domain.
- Lets you programmatically trigger extraction via Apify API (Python, JavaScript, CLI, HTTP, β¦) :contentReference
- Outputs structured dataset results, ready for analysis or further processing.
- Great foundational tool β can be extended or combined with other Actors or AI workflows.
| Feature | Description |
|---|---|
| Generic Web Fetching | Loads given URLs using headless browser or HTTP requests depending on configuration. |
| Flexible Content Extraction | Returns page content, metadata or structured data depending on the site and use case. |
| API Integration | Easily invoked via Apify HTTP API, CLI, or official SDKs (Python / JavaScript). :contentReference |
| Dataset Output | Stores results in Apify dataset; can be exported to JSON, CSV, or other supported formats. |
| Multipurpose | Can be used for data collection, content scraping, web monitoring, or as part of larger automation workflows. |
| Field Name | Field Description |
|---|---|
| url | The URL that was fetched. |
| content | Raw HTML / text content of the page (or processed data if custom logic applied). |
| metadata | Optional β page metadata like title, headers, status code, etc. |
When using custom parsing logic or downstream processing, the output may include additional structured fields as needed.
[
{
"url": "https://example.com/article/123",
"content": "<html>β¦full HTML of pageβ¦</html>",
"metadata": {
"statusCode": 200,
"retrievedAt": "2025-12-05T10:15:23Z",
"title": "Example Article"
}
}
]
content-factory/
βββ src/
β βββ main.js
β βββ fetcher/
β β βββ http_fetch.js
β β βββ browser_fetch.js
β βββ parsers/ # optional custom parsing logic
β βββ utils/
β β βββ logger.js
β β βββ proxy_handler.js # for proxy support if used
β βββ config/
β βββ settings.example.json
βββ package.json # or requirements.txt depending on SDK
βββ README.md
- Data ingestion pipelines β automatically pull content from arbitrary websites to feed into your database or data warehouse.
- Content monitoring β track webpages for changes, scrape updates, or archive page snapshots.
- Web-driven automation workflows β integrate as a first step before running site-specific parsing, AI analysis, or transformations.
- Rapid prototyping β test on random URLs before building dedicated scrapers.
- Research & analysis β collect raw HTML or content across varied sites for text analysis, NLP pipelines, or scraping experiments.
Can I call Content Factory programmatically?
Yes β you can trigger it via HTTP API, or using Apify SDKs (Python or JavaScript). :contentReference
Does it require specifying site-specific parsing logic?
No β by default it fetches raw content. If you need structured output, you can add your own parser logic based on your needs.
Which output formats are supported?
Since data goes into the Apify dataset, you can export it as JSON, CSV, Excel, or other supported formats.
Is it suitable for dynamic or JS-heavy sites?
Yes β with proper configuration or by using browser-based fetching, it can handle sites requiring JavaScript execution.
Primary Metric:
Fetches and outputs raw page content in under 2 seconds per URL (assuming standard site and network conditions).
Reliability Metric:
Handles common network errors and retries automatically, ensuring high success rate across varied web pages.
Efficiency Metric:
Lightweight β minimal overhead compared to full-fledged site-specific scrapers, making it efficient for bulk URL processing.
Quality Metric:
Consistently returns full page content, metadata, and ensures stable output format suitable for downstream pipelines.
