Smart Article Extractor extracts articles from any academic, scientific, or news website with a single click. It automatically identifies article pages and pulls structured content, saving time and improving research efficiency.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for Smart Article Extractor you've just found your team — Let’s Chat. 👆👆
Smart Article Extractor is a powerful tool for automatically scraping articles from websites, blogs, and academic sources. It solves the challenge of manually gathering content from multiple sources, making it ideal for researchers, journalists, and analysts.
- Extracts articles with one click from any website.
- Recognizes which pages contain articles automatically.
- Supports custom scraping and additional filters like date and word count.
- Can bypass paywalls using Google Bot headers.
- Returns data in multiple formats such as JSON, CSV, XML, Excel, and RSS.
| Feature | Description |
|---|---|
| Browser Support | Opens pages with a browser (Puppeteer) to scrape dynamic content. |
| Multi-URL Extraction | Allows scraping of articles from any number of URLs. |
| Smart Article Recognition | Automatically detects which pages are articles, customizable. |
| Advanced Filters | Filters by date, minimum word count, and other criteria. |
| Custom Scraping | Add or overwrite fields using your own parsing logic. |
| Google Bot Headers | Bypass paywalls by simulating Google Bot access. |
| Field Name | Field Description |
|---|---|
| url | Original article URL. |
| loadedUrl | Fully loaded page URL after any redirects. |
| title | Full title of the article. |
| softTitle | Simplified or cleaned-up version of the title. |
| date | Publication date of the article. |
| author | List of authors. |
| publisher | Publisher name if available. |
| copyright | Copyright information. |
| favicon | Website favicon path. |
| description | Short summary of the article content. |
| lang | Language code of the article. |
| canonicalLink | Canonical URL of the article. |
| tags | Tags or keywords associated with the article. |
| image | Main image URL of the article. |
| videos | Embedded video URLs if any. |
| links | Internal or external links found in the article. |
| text | Full text content of the article. |
| pageTitle | Page HTML title, added via custom output function. |
| originalDate | Original parsed date before any transformation. |
[
{
"url": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
"loadedUrl": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
"title": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
"softTitle": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
"date": "2020-07-07T12:13:00.000Z",
"author": ["Fariha Karim"],
"publisher": null,
"copyright": "Times Newspapers Limited 2020",
"favicon": "/d/img/icons/favicon-ab3ea01fbe.ico",
"description": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.The woman, who cannot be identified for legal reasons, told",
"lang": "en",
"canonicalLink": "https://www.thetimes.co.uk/article/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
"tags": [],
"image": "https://www.thetimes.co.uk/imageserver/image/%2Fmethode%2Ftimes%2Fprod%2Fweb%2Fbin%2Fdfdec16c-bf85-11ea-bb37-3d3cce807650.jpg?crop=3023%2C1700%2C238%2C316&resize=685",
"videos": [],
"links": [],
"text": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.\n\nThe woman, who cannot be identified for legal reasons, told Southwark crown court that Charlie Elphicke had invited her for a drink in 2007 while his wife Natalie was away on a business trip.\n\nShe said that the children were in bed and she had a cup of tea while Mr Elphicke drank wine in the garden and they chatted.\n\nAfter about an hour, she said, “the weather changed so he suggested they go inside to the lounge” and they shared a £40 bottle of wine.\n\nShe said they carried on talking in the living room",
"pageTitle": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ - The Times",
"originalDate": "2020-07-07T12:13:00.000Z"
}
]
smart-article-extractor-scraper/
├── src/
│ ├── runner.js
│ ├── extractors/
│ │ ├── article_parser.js
│ │ └── utils.js
│ ├── outputs/
│ │ └── exporters.js
│ └── config/
│ └── settings.example.json
├── data/
│ ├── inputs.sample.txt
│ └── sample.json
├── package.json
└── README.md
- Researchers use it to collect multiple academic articles, so they can build datasets and corpora efficiently.
- Journalists use it to gather news articles, so they can perform text analysis and monitor media trends.
- Analysts use it to track online content, so they can identify misinformation or trends quickly.
- Students use it to download reference material, so they can save time in research and citation tasks.
Q: Can I scrape articles behind paywalls? A: The tool supports Google Bot headers that can bypass some paywalls, but this may vary depending on the website's restrictions.
Q: What formats can I get the data in? A: The scraper supports JSON, CSV, XML, Excel, RSS, and other standard formats for easy integration.
Q: Is it legal to use this tool? A: Scraping publicly available articles is legal, but you should respect copyright and terms of use before publishing any extracted content.
Q: How many articles can I scrape at once? A: Thousands of results can be returned, but the actual number may vary based on website limitations, input complexity, and dynamic content.
Primary Metric: Average scraping speed of 15–20 articles per second for standard news websites. Reliability Metric: 95% success rate on stable websites with dynamic content handling. Efficiency Metric: Low CPU usage when using Puppeteer headless mode with concurrency limits. Quality Metric: 99% accuracy in article detection, with smart recognition reducing false positives significantly.
