Skip to content

CastillaAlvaro/ih_datamadpt1120_project_m1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ih_datamadpt1120_project_m1

This project consists mainly in extracting some data from three differente sources, transforme it and be able to obtain insights.

  • The first source used to retrieve some data, is a .db file. In this file you will find a sample of different european citizens answering several questions related to politics. Furthermore, you will see some other relevant information such as gender, age, country codes, level of studies, etc.

  • The second source, is an API (Open Skills Project). From this fount has been retrieved the job names that correspond to the several job codes that appear in the first source, the .db file. Otherwise, it would be difficult to understand which was the professional background of the people surveyed.

  • Third source is the Eurostat website. In this site, it has been used some web scraping techniques in order to bring to our dataset the names of the countries, to understand from where the people surveyed come from. As mentioned before, in the first source we have the country codes, but in order to see the data more clear,it was needed to retrieve the names.


Once checked the information provided by the API, I figured out that all job codes contained in the first dataset, were related to Data Analysis. The main objective was to be able to extract a table where it could be easy to see how many people living in a rural environment were employed, and what kind of profile could be found between the different countries.

Libraries used

  1. Sqlalchemy
  2. Argparse
  3. BeautifulSoup
  4. Pandas
  5. Rquests
  6. Numpy
  7. Re

🔧 Configuration

Install Python 3.7 and mandatory dependencies listed before.

If you are using the Anaconda distribution. Run the following command to activate the environment where you have all these dependencies installed.

conda activate name_env

🙈 Usage

  • Clone this repo locally.

  • Open a terminal, activate the appropiate environment and navigate to the repo's path.

  • As we are using argparse you have to specify to parameter to run the script:

     '-p' '--path' (required) / '-c' '--country' (optional - Default=all)
     
      path = raw_data_project_m1.db      country_choices=['Austria', 'Belgium', 'Bulgaria', 'Cyprus', 'Czechia', 'Germany', 'Denmark', 'Estonia',
                                         'Spain', 'Finland', 'France','United Kingdom', 'Greece', 'Croatia', 'Hungary', 
                                         'Ireland', 'Italy', 'Lithuania','Luxembourg', 'Latvia', 'Malta','Netherlands', 
                                         'Poland','Portugal', 'Romania', 'Sweden', 'Slovenia', 'Slovakia']
    
  • In the terminal write: python main.py -p raw_data_project_m1.db or main.py -p raw_data_project_m1.db -c 'specific country'

Output

  • Three different csv files with the information obtained from the different sources (Query = df_sql_query.csv , APi = df_api.csv', Eurostat = df_countries.csv)

  • Aditional csv with all Dataframes merged and with the data processed (df_countries.csv)

  • A csv with the result. Two options:

    1. With all countries (df_rural.csv)
    2. With specific country (df_specific country.csv)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages