b2w-atech · ramonduarte · May 16, 2021 · May 16, 2021 · May 16, 2021 · May 16, 2021
diff --git a/Pipfile b/Pipfile
@@ -0,0 +1,16 @@
+[[source]]
+url = "https://pypi.org/simple"
+verify_ssl = true
+name = "pypi"
+
+[packages]
+scrapy = "*"
+scrapyd = "*"
+pymongo = "*"
+attrs = "*"
+scrapy-fake-useragent = "*"
+
+[dev-packages]
+
+[requires]
+python_version = "3.9"
diff --git a/Pipfile.lock b/Pipfile.lock
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,46 @@
+#
+# These requirements were autogenerated by pipenv
+# To regenerate from the project's Pipfile, run:
+#
+#    pipenv lock --requirements
+#
+
+-i https://pypi.org/simple
+attrs==21.2.0
+automat==20.2.0
+cffi==1.14.5
+constantly==15.1.0
+cryptography==3.4.7; python_version >= '3.6'
+cssselect==1.1.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
+fake-useragent==0.1.11
+faker==8.1.4; python_version >= '3.6'
+h2==3.2.0
+hpack==3.0.0
+hyperframe==5.2.0
+hyperlink==21.0.0
+idna==3.1; python_version >= '3.4'
+incremental==21.3.0
+itemadapter==0.2.0; python_version >= '3.6'
+itemloaders==1.0.4; python_version >= '3.6'
+jmespath==0.10.0; python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'
+lxml==4.6.3; platform_python_implementation == 'CPython'
+parsel==1.6.0
+priority==1.3.0
+protego==0.1.16; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'
+pyasn1-modules==0.2.8
+pyasn1==0.4.8
+pycparser==2.20; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
+pydispatcher==2.0.5; platform_python_implementation == 'CPython'
+pymongo==3.11.4
+pyopenssl==20.0.1; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'
+python-dateutil==2.8.1; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
+queuelib==1.6.1
+scrapy-fake-useragent==1.4.4
+scrapy==2.5.0
+scrapyd==1.2.1
+service-identity==21.1.0
+six==1.16.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
+text-unidecode==1.3
+twisted[http2]==21.2.0; python_full_version >= '3.5.4'
+w3lib==1.22.0
+zope.interface==5.4.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'
diff --git a/scrawlinkinpark/README.md b/scrawlinkinpark/README.md
@@ -0,0 +1,55 @@
+>_Scrawling in my scheme_
+_These quotes, they will not yield..._
+
+# Desafio Webcrawler BIT
+
+## Requisitos Funcionais
+
+1. Construir um web crawler com o _framework_ [Scrapy](https://scrapy.org/).
+2. Coletar as citações na página _http://quotes.toscrape.com_.
+3. Armazenar citação (string), autor (dicionário) com nome (string) e url da bio (string) e _tags_ (array) numa base de dados _MongoDB_.
+4. Responder, através de _queries_ no _MongoDB_:
+   - Quantas citações foram coletadas?
+   - Quantas tags distintas foram coletadas?
+   - Quantas citações por autor foram coletadas? _(exemplo abaixo)_
+
+![](https://github.com/b2w-atech/desafio-webcrawler/raw/master/mongodb_aggregate.png)
+
+## Requisitos Não-Funcionais
+
+1. Construir um crawler robusto, ágil e resistente a mudanças da matriz tecnológica.
+2. Estruturar os componentes de forma coesa, sem lhes comprometer com acoplamento excessivo.
+3. Manter uma documentação em código concisa, legível e cuidadosa.
+
+## Instalação
+
+Embora não haja um impeditivo imperativo sobre o uso da instalação global, o uso de um ambiente virtual Python é extremamente encorajado. O mecanismo mais adequado é o [Pipenv](https://pipenv.pypa.io/en/latest/install/), através do comando `pipenv shell && pipenv install`, que o criará e instalará todas as dependências (exceto o _MongoDB_) automaticamente. Também é possível usar o `pip3` através do comando `pip3 install -r `[`requirements.txt`](../requirements.txt). Apenas certifique-se de que a versão do Python é compatível.
+
+## Execução
+
+Usando a versão 2.5.0 do [Scrapy](https://scrapy.org/) sobre [Python 3.9.5](https://www.python.org/downloads/release/python-395/) e ao lado do [MongoDB 4.4.2](https://docs.mongodb.com/manual/installation/), um _pipeline_ simplificado foi construído para raspar o site de citações e armazená-las num banco de dados de nome `quotestoscrape`, sob coleção denominada `ramon_melo`. O robô armazena o texto da citação como uma string; o autor como um dicionário de chaves `name` e `url`; e as etiquetas numa lista encadeada.
+
+As consultas de teste estão gravadas em [`queries.js`](queries.js). Os resultados foram obtidos através do comando:
+
+> [`scrawlinkinpark/scrawlinkinpark/spiders`](scrawlinkinpark/spiders/quotes_spyder.py)`$  scrapy runspider quotes_spyder.py -O `[`../../../tests/quotes.json`](../tests/quotes.json)
+
+
+É importante observar com atenção o diretório onde o programa é executado. Ele grava os resultados sequencial e similarmente no arquivo [`tests/quotes.json`](../tests/quotes.json) e no banco de dados localizado no endereço padrão `mongodb://127.0.0.1:27017`, permitindo uma comparação direta e até automatizada dos mesmos. Caso outro endereço seja utilizado, é importante modificar a variável `MONGO_URI` ao final do arquivo [`settings.py`](scrawlinkinpark/settings.py). Utilizando a ferramenta _MongoDB Shell_ e execuções sucessivas, foi observada total concordância entre ambos.
+
+O comando acima inclui, ainda, a possibilidade de varrer somente uma categoria do site, passando o parâmetro `-a tag=CATEGORIA`. As configurações incluem, também, customizações específicas de privacidade para o robô, já que é comum que sites tentem impedir a varredura quando notado. O programa tem a capacidade de circular entre diferentes configurações de strings de _user-agent_, cabeçalhos de navegadores comuns e, por fim, de retornar à configuração padrão.
+
+## Dependências
+
+- [Python 3.9](https://www.python.org/downloads/release/python-395/)
+- [Scrapy 2.5](https://scrapy.org/)
+- [MongoDB 4.4](https://docs.mongodb.com/manual/installation/)
+- [PyMongo 3.11](https://pymongo.readthedocs.io/en/stable/installation.html)
+- [Attrs 21.2](https://www.attrs.org/en/stable/index.html#getting-started) (melhoria de legibilidade)
+- [scrapy-fake-useragent](https://github.com/alecxe/scrapy-fake-useragent) (privacidade)
+
+## Conclusão
+
+O projeto foi uma oportunidade bem aproveitada de usar um domingo frio para aquecer minhas engrenagens e praticar uma _reciclagem_. Agradeço desde já o desafio e toda a atenção dispensada.
+
+>_This spyder endlessly has pulled the Web upon me_
+_Distracting... Overwriting..._
diff --git a/scrawlinkinpark/queries.js b/scrawlinkinpark/queries.js
@@ -0,0 +1,20 @@
+// Question #1: How many quotes were collected?
+// NOTE: official docs recommend avoiding `.count()` 2021-05-16 18:29:28
+// https://docs.mongodb.com/manual/reference/method/db.collection.count/
+db.ramon_melo.countDocuments({})
+
+// Question #2: How many distinct tags were collected?
+db.ramon_melo.distinct('tags').length
+
+// Question #3: How many quotes per author were collected?
+db.ramon_melo.aggregate([
+    {
+        $group: {
+            _id: '$author.name',
+            qtd: { $sum: 1 }
+        }
+    },
+    {
+        $sort: { qtd: -1 }
+    }
+])
diff --git a/scrawlinkinpark/scrapy.cfg b/scrawlinkinpark/scrapy.cfg
@@ -0,0 +1,11 @@
+# Automatically created by: scrapy startproject
+#
+# For more information about the [deploy] section see:
+# https://scrapyd.readthedocs.io/en/latest/deploy.html
+
+[settings]
+default = scrawlinkinpark.settings
+
+[deploy]
+#url = http://localhost:6800/
+project = scrawlinkinpark
diff --git a/scrawlinkinpark/scrawlinkinpark/__init__.py b/scrawlinkinpark/scrawlinkinpark/__init__.py
diff --git a/scrawlinkinpark/scrawlinkinpark/items.py b/scrawlinkinpark/scrawlinkinpark/items.py
@@ -0,0 +1,13 @@
+# Define here the models for your scraped items
+#
+# See documentation in:
+# https://docs.scrapy.org/en/latest/topics/items.html
+
+import attrs
+
+
+@attr.s
+class QuoteItem:
+    title: str = attr.ib(default='')
+    author: dict = attr.ib(default={'name': '', 'url': ''})
+    tags: list = attr.ib(default=[])
diff --git a/scrawlinkinpark/scrawlinkinpark/middlewares.py b/scrawlinkinpark/scrawlinkinpark/middlewares.py
@@ -0,0 +1,103 @@
+# Define here the models for your spider middleware
+#
+# See documentation in:
+# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
+
+from scrapy import signals
+
+# useful for handling different item types with a single interface
+from itemadapter import is_item, ItemAdapter
+
+
+class ScrawlinkinparkSpiderMiddleware:
+    # Not all methods need to be defined. If a method is not defined,
+    # scrapy acts as if the spider middleware does not modify the
+    # passed objects.
+
+    @classmethod
+    def from_crawler(cls, crawler):
+        # This method is used by Scrapy to create your spiders.
+        s = cls()
+        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
+        return s
+
+    def process_spider_input(self, response, spider):
+        # Called for each response that goes through the spider
+        # middleware and into the spider.
+
+        # Should return None or raise an exception.
+        return None
+
+    def process_spider_output(self, response, result, spider):
+        # Called with the results returned from the Spider, after
+        # it has processed the response.
+
+        # Must return an iterable of Request, or item objects.
+        for i in result:
+            yield i
+
+    def process_spider_exception(self, response, exception, spider):
+        # Called when a spider or process_spider_input() method
+        # (from other spider middleware) raises an exception.
+
+        # Should return either None or an iterable of Request or item objects.
+        pass
+
+    def process_start_requests(self, start_requests, spider):
+        # Called with the start requests of the spider, and works
+        # similarly to the process_spider_output() method, except
+        # that it doesn’t have a response associated.
+
+        # Must return only requests (not items).
+        for r in start_requests:
+            yield r
+
+    def spider_opened(self, spider):
+        spider.logger.info('Spider opened: %s' % spider.name)
+
+
+class ScrawlinkinparkDownloaderMiddleware:
+    # Not all methods need to be defined. If a method is not defined,
+    # scrapy acts as if the downloader middleware does not modify the
+    # passed objects.
+
+    @classmethod
+    def from_crawler(cls, crawler):
+        # This method is used by Scrapy to create your spiders.
+        s = cls()
+        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
+        return s
+
+    def process_request(self, request, spider):
+        # Called for each request that goes through the downloader
+        # middleware.
+
+        # Must either:
+        # - return None: continue processing this request
+        # - or return a Response object
+        # - or return a Request object
+        # - or raise IgnoreRequest: process_exception() methods of
+        #   installed downloader middleware will be called
+        return None
+
+    def process_response(self, request, response, spider):
+        # Called with the response returned from the downloader.
+
+        # Must either;
+        # - return a Response object
+        # - return a Request object
+        # - or raise IgnoreRequest
+        return response
+
+    def process_exception(self, request, exception, spider):
+        # Called when a download handler or a process_request()
+        # (from other downloader middleware) raises an exception.
+
+        # Must either:
+        # - return None: continue processing this exception
+        # - return a Response object: stops process_exception() chain
+        # - return a Request object: stops process_exception() chain
+        pass
+
+    def spider_opened(self, spider):
+        spider.logger.info('Spider opened: %s' % spider.name)
diff --git a/scrawlinkinpark/scrawlinkinpark/pipelines.py b/scrawlinkinpark/scrawlinkinpark/pipelines.py
@@ -0,0 +1,35 @@
+# Define your item pipelines here
+#
+# Don't forget to add your pipeline to the ITEM_PIPELINES setting
+# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
+
+
+# useful for handling different item types with a single interface
+from itemadapter import ItemAdapter
+import pymongo
+
+
+class ScrawlinkinparkPipeline:
+    collection_name = 'ramon_melo'
+
+    def __init__(self, mongo_uri, mongo_db):
+        self.mongo_uri = mongo_uri
+        self.mongo_db = mongo_db
+
+    @classmethod
+    def from_crawler(cls, crawler):
+        return cls(
+            mongo_uri=crawler.settings.get('MONGO_URI'),
+            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
+        )
+
+    def open_spider(self, spider):
+        self.client = pymongo.MongoClient(self.mongo_uri)
+        self.db = self.client[self.mongo_db]
+
+    def close_spider(self, spider):
+        self.client.close()
+
+    def process_item(self, item, spider):
+        self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
+        return item