-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Overarching goal: A user should be able to trigger a process in the server that pulls data from the COPA website and imports new Allegations to the database.
Things to keep in mind:
- The copa website is: https://data.cityofchicago.org/Public-Safety/COPA-Cases-Summary/mft5-nfa8
- A data summary of copa complaints can be found here: https://data.cityofchicago.org/Public-Safety/COPA-Cases-Summary-Dashboard/uei2-mi82
- COPA has an API that we should use
- Further information on our goals is at the bottom of this issue.
- If any of the steps fail, that should not end the pipeline. Perhaps parallelize these steps on a per row basis.
Goals:
- From the UI, a use should be able to initiate a copa job and be redirected to the live status page after it starts
- Stages of a copa job:
- Initial data - download data from copa site to store in google cloud storage under initial_data/
- Phantom rows:
- Clean - Split the data based on assignment.
- All rows with assignment 'copa' should be saved cleaned/copa.csv
- All rows without assignment 'copa' should be saved under cleaned/other-assignment.csv
- Transform - Create database rows from the raw data
- All rows with the assignment 'copa' should be transformed into data_allegation rows and saved under transformed/copa.csv
- A separate csv should be made that ties log_no from copa to all information in the original dataset that is not a part of the data_allegation row, and saved under transformed/misc-data.csv
- Note: not all of the columns will be filled
- Note: If any rows cannot be transformed, they should be saved under errors/transform_error.csv and shown in the UI. NEEDS WORK TO DETECT FAILED TRANSFORM AND PRODUCE ERROR FILE. THE ONLY POSSIBLE WAYS FOR THE TRANSFORM TO FAIL IS IF API ENDPOINTS ARE REMOVED OR CHANGED, THERE WILL BE NO ROWS RETURNED IN THOSE CASES SO OUR ERRORS FILE SHOULD SAY SOMETHING ABOUT THE API ERROR GIVEN
- Augment - Replace columns with foreign key references
- All transformed copa rows should replace the
current categorycolumn with a reference to the data_allegationcategory table for that particular category - Note: A row failing to augment should not end the pipeline
- Note: if any rows cannot be augmented, they should be saved under errors/augment_failures.csv **
- show in the UI
- All transformed copa rows should replace the
- Load - Load augmented rows into the database
- Check if there already exists a row with that log_no.
- If there is, verify that all the data matches, including:
- finding_code should match against data_officerallegation.final_finding REQUIRES NEW ENTITY TYPE (data_officerallegation) PLUS LOGIC TO MATCH the scraped copa column final_finding with data_officerallegation.final_finding
- All data fields that are in data_allegation and also in the copa response (log_no, current_category, beat)
- If any data does not match then save it as the file "changed-allegation.csv" under errors/.
- This should appear in the UI - able to use loader.changed_allegations . NEEDS UI PAGE TO DISPLAY DATA SAVED ON loader OBJECT
- If all data matches, disregard this row
- If there is, verify that all the data matches, including:
- If there doesnt exist a row with that log_no add the row
- Note: if any rows cannot be turned into an entity object, they should be saved under errors/entity_failures.csv and shown in the UI
- Check if there already exists a row with that log_no.
- Clean - Split the data based on assignment.
- Data validation:
- Check for missing records
- Check if any allegations present in copa are missing from the original database
- display a list of these in the UI - able to use loader.db_rows_added to show in UI NEEDS UI PAGE TO DISPLAY DATA SAVED ON loader OBJECT
- Check for missing records
- Stages of a copa job:
The business need:
From Rajiv:
The primary purpose of this COPA Data Portal data capture step is to create incomplete/phantom complaint records in our database (for new complaints since our last successful FOIA response) so that we can have some matching data for the new documents that are being picked up by our crawlers/scrapers ( https://cpdp.co/crawlers and https:// cpdp.co/documents ).
The second purpose is to compare against the data that we have received via FOIA responses to whether we are missing any records (i.e., were any responsive complaint records omitted from our original dataset and if so which ones).
The third purpose is to compare different versions/snapshots of it over time and see what’s changing (is it just new records being added on to the end, or are older records being added, or removed, or altered).
From Basecamp:
The Civilian Office of Police Accountability (COPA) has just posted a new live data feed to the City's Open Data Portal that goes back 10 years. Here are a few early questions to investigate.
- Are there CRs that appear here during the comparable time period (i.e., before October 2016) that don't appear in our FOIA'd datasets (which were produced in October 2016)? If so, how many and are there any revealing common characteristics amongst them to suggest why they may have been excluded from the dataset we received in response to our FOIA requests but not excluded from this public release on the City's public data portal. More likely is the inverse, i.e., complaints that we know of through our FOIA request, but that were excluded from the City's public data portal even during the overlapping time period of November 2007 – November 2016.
- For all the CRs that exist both in the City Data Portal and in our FOIA'd datasets, how many rows have conflicting values for the dynamic data fields, such as CURRENT_STATUS (which we expect to change over time for open cases), and for data fields that we might not expect to change, such as COMPLAINT_DATE? What can we learn from any patterns amongst these kinds of unexpected discrepancies, particularly when they occur in cases that are already closed?
- Are there any reasons not to import all these data and overwrite the conflicting fields in our existing dataset with more "up-to-date" information from the City's data portal (of course, any new CRs would be missing all officer-identifying data and other fields that are not being published to the data portal, until our next FOIA request)? The City Data Portal has a relatively robust API and supports numerous open standards for public APIs. Can we do all this importing and merging programmatically and run it on the Civis Platform on a routine basis? Is there any equivalent to cron built into the Platform? Apart from sanity checks, what kinds of issues will we run into that require human intervention/judgment (no officer-identifying data also means no officer profile matching challenges)?