diff --git a/README.md b/README.md index dc8bae8..2ef4a60 100644 --- a/README.md +++ b/README.md @@ -4,43 +4,6 @@ ## Overview -The goal of this project is for you to practice what you have learned in the APIs and Web Scraping chapter of this program. For this project, you will choose both an API to obtain data from and a web page to scrape. For the API portion of the project will need to make calls to your chosen API, successfully obtain a response, request data, convert it into a Pandas data frame, and export it as a CSV file. For the web scraping portion of the project, you will need to scrape the HTML from your chosen page, parse the HTML to extract the necessary information, and either save the results to a text (txt) file if it is text or into a CSV file if it is tabular data. +The following project which I called Scrapper, uses the Wiki API's for listing the details of a Fortune 500 company. The output will be contained in a .json file, the larger the number of companies that you choose, the larger the output of the file will be. For this particular case I chose 50 companies from the top 500 list. -**You will be working individually for this project**, but we'll be guiding you along the process and helping you as you go. Show us what you've got! - ---- - -## Technical Requirements - -The technical requirements for this project are as follows: - -* You must obtain data from an API using Python. -* You must scrape and clean HTML from a web page using Python. -* The results should be two files - one containing the tabular results of your API request and the other containing the results of your web page scrape. -* Your code should be saved in a Jupyter Notebook and your results should be saved in a folder named output. -* You should include a README.md file that describes the steps you took and your thought process for obtaining data from the API and web page. - -## Necessary Deliverables - -The following deliverables should be pushed to your Github repo for this chapter. - -* **A Jupyter Notebook (.ipynb) file** that contains the code used to work with your API and scrape your web page. -* **An output folder** containing the outputs of your API and scraping efforts. -* **A ``README.md`` file** containing a detailed explanation of your approach and code for retrieving data from the API and scraping the web page as well as your results, obstacles encountered, and lessons learned. - -## Suggested Ways to Get Started - -* **Find an API to work with** - a great place to start looking would be [API List](https://apilist.fun/) and [Public APIs](https://github.com/toddmotto/public-apis). If you need authorization for your chosen API, make sure to give yourself enough time for the service to review and accept your application. Have a couple back-up APIs chosen just in case! -* **Find a web page to scrape** and determine the content you would like to scrape from it - blogs and news sites are typically good candidates for scraping text content, and [Wikipedia](https://www.wikipedia.org/) is usually a good source for HTML tables (search for "list of..."). -* **Break the project down into different steps** - note the steps covered in the API and web scraping lessons, try to follow them, and make adjustments as you encounter the obstacles that are inevitable due to all APIs and web pages being different. -* **Use the tools in your tool kit** - your knowledge of intermediate Python as well as some of the things you've learned in previous chapters. This is a great way to start tying everything you've learned together! -* **Work through the lessons in class** & ask questions when you need to! Think about adding relevant code to your project each night, instead of, you know... _procrastinating_. -* **Commit early, commit often**, don’t be afraid of doing something incorrectly because you can always roll back to a previous version. -* **Consult documentation and resources provided** to better understand the tools you are using and how to accomplish what you want. - -## Useful Resources - -* [Requests Library Documentation: Quickstart](http://docs.python-requests.org/en/master/user/quickstart/) -* [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) -* [Stack Overflow Python Requests Questions](https://stackoverflow.com/questions/tagged/python-requests) -* [StackOverflow BeautifulSoup Questions](https://stackoverflow.com/questions/tagged/beautifulsoup) +I called the listed details as wiki_data and contains several key components suchs as the founder, net worth, equity, HQ and so on. Based that this project is linked to the Fortune 500 the number of companies is limited, however, obtaining a wiki_data page from say, like a football player would also be possible, provided the API worked at the same level. It was a fun project! Had to read a lot but I enjoyed it nonetheless :). \ No newline at end of file diff --git a/Scrapper.ipynb b/Scrapper.ipynb new file mode 100644 index 0000000..6a712d9 --- /dev/null +++ b/Scrapper.ipynb @@ -0,0 +1,896 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "#antes de este paso hay que hacer pip install wptools y wikipedia para poder ejecutar el programa\n", + "\n", + "import json\n", + "import wptools\n", + "import wikipedia\n", + "import pandas as pd\n", + "import urllib.request" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('fortune_500_companies.csv', )" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#importando un archivo csv con la lista de companias Fortune 500\n", + "\n", + "url = 'https://raw.githubusercontent.com/MonashDataFluency/python-web-scraping/master/data/fortune_500_companies.csv'\n", + "\n", + "urllib.request.urlretrieve(url, 'fortune_500_companies.csv')\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "ncompany = 'fortune_500_companies.csv'\n", + "df= pd.read_csv(ncompany)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "#especificamos el numero de companias y utilizamos iloc para enumarar y copiar en un data frame nuevo, convirtiendo a lista el data frame resultante\n", + "\n", + "num_of_companies=50\n", + "df2=df.iloc[:num_of_companies,:].copy()\n", + "companies=df2['company_name'].tolist()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "#ahora queremos anexar el articulo de wikipedia a cada nombre de la company\n", + "\n", + "wiki_search = [{company : wikipedia.search(company)} for company in companies]\n", + "\n", + "#mientras estaba haciendo el proyecto me di cuenta que es posible que Wikipedia cuente a Apple literalmente como la fruta, entonces para asegurarnos de que sea la empresa:\n", + "\n", + "companies[companies.index('Apple')] = 'Apple Inc.'" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "wiki_data = []\n", + "#como queremos hacer match con las infos contenidas en las wiki boxes, hacemos una lista de los elementos que queremos checar\n", + "\n", + "elementos=['founder','location_country','revenue','operating_income','net_income','assets','equity','type','industry','products','num_employees']" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "en.wikipedia.org (parse) Walmart\n", + "en.wikipedia.org (imageinfo) File:Walmart Home Office.jpg\n", + "Walmart (en) data\n", + "{\n", + " image: {'kind': 'parse-image', 'file': 'File:Walmart H...\n", + " infobox: name, logo, logo_caption, image, image_size,...\n", + " iwlinks: https://commons.wikimedia.org/wiki/Category:W...\n", + " pageid: 33589\n", + " parsetree: