Skip to content

brutuscat/medusa-crawler

Repository files navigation

Medusa is a framework for the ruby language to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.

  • Choose the links to follow on each page with focus_crawl

  • Multi-threaded design for high performance

  • Tracks 301 HTTP redirects

  • Allows exclusion of URLs based on regular expressions

  • Records response time for each page

  • Obey robots.txt directives (optional, but recommended)

  • In-memory or persistent storage of pages during crawl, provided by Moneta

  • Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).

Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]

Medusa is versatile and to be used programatically, you can start with one or multiple URIs:

require 'medusa'

Medusa.crawl('https://www.example.com', depth_limit: 2)

Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:

require 'medusa'

Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
  crawler.discard_page_bodies = some_flag

  # Persist all the pages state across crawl-runs.
  crawler.clear_on_startup = false
  crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')

  crawler.skip_links_like(/private/)

  crawler.on_pages_like(/public/) do |page|
    logger.debug "[public page]  #{page.url} took #{page.response_time} found #{page.links.count}"
  end

  # Use an arbitrary logic, page by page, to continue customize the crawling.
  crawler.focus_crawl(/public/) do |page|
    page.links.first
  end
end

moneta

for the key/value storage adapters

nokogiri

for parsing the HTML of webpages

robotex

for support of the robots.txt directives

To test and develop this gem, additional requirements are:

  • rspec

  • webmock

Medusa is a revamped version of the defunk anemone gem.

Copyright © 2009 Vertive, Inc.

Copyright © 2020 Mauro Asprea

Released under the MIT License

About

The Official Medusa Crawler gem

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 15

Languages