Skip to content

Migrate from popit-django to django-popolo / multiple-django-popolo-sources#1181

Open
mhl wants to merge 51 commits intomasterfrom
migrate-to-django-popolo
Open

Migrate from popit-django to django-popolo / multiple-django-popolo-sources#1181
mhl wants to merge 51 commits intomasterfrom
migrate-to-django-popolo

Conversation

@mhl
Copy link
Collaborator

@mhl mhl commented Sep 6, 2016

PopIt is being deprecated, so it no longer makes sense for write-it / WriteInPublic to depend on PopIt instances as sources of data, or for it to depend on the deprecated PopIt client, popit-django. It makes more sense now to move to a model where data sources are all URLs of Popolo JSON, and to use the django-popolo project to store the data from such Popolo JSON sources. multiple-django-popolo-sources can be used to keep the data from each of these sources distinct.

(There is also the PopIt 2.0 / NG project from Sinar now; so long as that exposes an endpoint similar to PopIt's export.json, this can be still be used - we should check that, and make a feature request if not.)

This pull request migrates from using popit-django and PopIt to using django-popolo, multiple-django-popolo-sources and Popolo JSON URLs instead.

Notes for reviewers / self review

This is a large pull request, so at the request of @chrismytton I'm going to split it into smaller ones targetting a branch called reviewed-migrate-to-django-popolo, since it still needs to be deployed in one go. Many of the commits after the data migration (i.e. after "Remove a now unneeded popitapiinstance field") could be squashed into one commit which represents updates to the code and tests for the new models, but this would make it much harder to review.

Another consideration is that there are a lot of instances in writeinpublic.com - I've tried the data migration on various sources locally, but:

  • I don't know which live instances are actually used or are particularly important, we'd want to at least check that those are migrated fine.
  • I think it would be a good idea to set up a staging site with a replica of the live data (with DEBUG=False, but in test mode) so that we can try it on real data.

Sub-Pull Requests

@mhl mhl added the in progress label Sep 6, 2016
@coveralls
Copy link

coveralls commented Sep 6, 2016

Coverage Status

Coverage increased (+0.07%) to 98.518% when pulling f352344 on migrate-to-django-popolo into 7d76154 on master.

@mhl mhl force-pushed the migrate-to-django-popolo branch from f352344 to 07352bd Compare September 22, 2016 14:34
@coveralls
Copy link

coveralls commented Sep 22, 2016

Coverage Status

Coverage increased (+0.07%) to 98.518% when pulling 07352bd on migrate-to-django-popolo into 7b1b984 on master.

@mhl mhl force-pushed the migrate-to-django-popolo branch from 07352bd to 500c1db Compare September 22, 2016 15:40
@coveralls
Copy link

coveralls commented Sep 22, 2016

Coverage Status

Coverage increased (+0.07%) to 98.518% when pulling 500c1db on migrate-to-django-popolo into 7b1b984 on master.

@mhl mhl force-pushed the migrate-to-django-popolo branch from 500c1db to 4c5fa8d Compare September 23, 2016 07:58
@coveralls
Copy link

coveralls commented Sep 23, 2016

Coverage Status

Coverage increased (+0.07%) to 98.518% when pulling 4c5fa8d on migrate-to-django-popolo into 7b1b984 on master.

@mhl mhl changed the title WIP: Migrate from popit-django to django-popolo Migrate from popit-django to django-popolo / multiple-django-popolo-sources Sep 23, 2016
@mhl mhl force-pushed the migrate-to-django-popolo branch from 4c5fa8d to 2763382 Compare September 23, 2016 08:31
@coveralls
Copy link

coveralls commented Sep 23, 2016

Coverage Status

Coverage increased (+0.07%) to 98.518% when pulling 2763382 on migrate-to-django-popolo into 7b1b984 on master.

@mhl mhl force-pushed the migrate-to-django-popolo branch from 2763382 to 1211646 Compare September 23, 2016 09:38
@coveralls
Copy link

coveralls commented Sep 23, 2016

Coverage Status

Coverage increased (+0.07%) to 98.518% when pulling 1211646 on migrate-to-django-popolo into 7b1b984 on master.

@mhl mhl force-pushed the migrate-to-django-popolo branch from 1211646 to ae24194 Compare September 23, 2016 13:33
@coveralls
Copy link

coveralls commented Sep 23, 2016

Coverage Status

Coverage increased (+0.07%) to 98.518% when pulling ae24194 on migrate-to-django-popolo into 7b1b984 on master.

@mhl mhl force-pushed the migrate-to-django-popolo branch from ae24194 to a8a4f80 Compare September 23, 2016 14:08
@coveralls
Copy link

coveralls commented Sep 23, 2016

Coverage Status

Coverage increased (+0.07%) to 98.518% when pulling a8a4f80 on migrate-to-django-popolo into 7b1b984 on master.

@mhl mhl force-pushed the migrate-to-django-popolo branch from a8a4f80 to d212d48 Compare September 23, 2016 15:41
@coveralls
Copy link

coveralls commented Sep 23, 2016

Coverage Status

Coverage increased (+0.07%) to 98.518% when pulling d212d48 on migrate-to-django-popolo into 7b1b984 on master.

@mhl mhl force-pushed the migrate-to-django-popolo branch from d212d48 to dc98bea Compare September 23, 2016 17:54
@coveralls
Copy link

coveralls commented Sep 23, 2016

Coverage Status

Coverage decreased (-74.8%) to 23.627% when pulling dc98bea on migrate-to-django-popolo into 7b1b984 on master.

@mhl mhl force-pushed the migrate-to-django-popolo branch from dc98bea to 0ecc77e Compare September 23, 2016 18:07
@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 98.52% when pulling 0ecc77e on migrate-to-django-popolo into 7b1b984 on master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 98.52% when pulling 0ecc77e on migrate-to-django-popolo into 7b1b984 on master.

return url


def forwards(apps, schema_editor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woow this is incredibly cautious particularly in preserving the old data!
I get the impression that this is a pretty interesting piece of code and we should be aware of what comes out of processing this migration. Anyhow, 👍

Copy link
Collaborator Author

@mhl mhl Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 The scariest bit of this migration for me (although this whole PR is quite scary!) is actually the update_source_url function above, since there are so many different PopIt URLs in the live database, lots of which don't work at the moment anyway. I hope that update_source_url will produce working URLs for those were working previously, but I think it's particularly worth double-checking that.

class PopoloPersonQuerySet(models.QuerySet):

def get_from_api_uri(self, uri_from_api):
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be achieved by doing something like
PopoloPerson.objects.filter(identifiers__identifier=uri_from_api,identifiers__scheme__in=['popit_url', 'popolo_uri' ]).first()
the difference that I can see in both is that this one would return None if the Person isn't found, while in the current code it would raise PopoloPerson.DoesNotExist.
If the exception is there for a reason that I'm not seeing then it makes total sense.

Copy link
Collaborator Author

@mhl mhl Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wrote it that way to be clear that you should use the popit_url for finding people, if it exists, in preference to the popolo_uri - you wouldn't be sure of that using scheme_in unless you use an order_by on scheme as well, which I don't like much.

That's almost certainly being excessively defensive, however - given how they are formed, I don't think there should ever be two different PopoloPerson objects where one has a popolo_uri which is the popit_url of the other.

(I wanted it to raise a DoesNotExist if the person couldn't be found because the code that calls this is replacing a use of .get, so it seemed best to keep that behaviour.)

But yeah, I think we could save a query in lots of cases by doing what you suggest with get instead of filter & first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abosolutely understood 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the fix that you suggested in this fixup: d579dab

@coveralls
Copy link

coveralls commented Oct 25, 2016

Coverage Status

Coverage increased (+0.07%) to 98.52% when pulling 2a6c2a2 on migrate-to-django-popolo into 7b1b984 on master.

mhl added 17 commits October 25, 2016 17:25
This include a migration that changes some foreign keys to use the
PopoloPerson proxy model instead of the base Person model.
Rather than using the /persons collection endpoint of PopIt's API, the
code now just uses export.json to extract Popolo JSON from PopIt.  This
commit has the results of re-recording these fixtures (using
roughly the procedure here:

   https://gist.github.com/mhl/848f7980d256ca38ff603375be4f264e

). The exception was persons_with_null_values.yaml which seems to be
missing the source JSON MongoDB fixture, so this was created by hand
based on one of the others.
The start_local_popit_api.bash script is designed to make it easy to set
up a local version of PopIt for testing against; this is still useful
for re-recording the fixtures (see the commit message for 967f842529
and https://gist.github.com/mhl/848f7980d256ca38ff603375be4f264e ).

However, this script installed a very old version of PopIt - this
commit updates it to use the most recent version. (A key problem with
the old version of this script is that it sets up a version of PopIt
without the export.json endpoints.)
This doesn't seem to matter, and since there is no Identifier associated
with a ContactDetail in django-popolo, it'd be awkward to preserve the
ContactDetail identifiers. It doesn't seem to be actually used anyway.
@mhl mhl force-pushed the migrate-to-django-popolo branch from 5afb702 to 66160ec Compare October 25, 2016 16:25
@coveralls
Copy link

coveralls commented Oct 25, 2016

Coverage Status

Coverage increased (+0.07%) to 98.52% when pulling 66160ec on migrate-to-django-popolo into 0830395 on master.

It's possible for the test:

  nuntium.tests.home_view_tests:HomeViewTestCase.test_list_instances

... to fail because the order that instances are listed is not
specified.  This commit orders the instances by their slug.
@coveralls
Copy link

coveralls commented Oct 27, 2016

Coverage Status

Coverage increased (+0.07%) to 98.52% when pulling 1c63a15 on migrate-to-django-popolo into 0830395 on master.

@coveralls
Copy link

coveralls commented Oct 27, 2016

Coverage Status

Coverage increased (+0.07%) to 98.52% when pulling d579dab on migrate-to-django-popolo into 0830395 on master.

mhl added 5 commits January 11, 2017 14:11
Since the migration to syncing people using django-popolo and
multiple-django-popolo-sources, we're now syncing all people (even
historic representatives) from EveryPolitician, not just those in the
current term(s).  However, we should only be allowing people to write
to current representatives.

The problem here is that the old behaviour was just to include everyone
from the data source who had a contact and hadn't been deliberately
disabled - some (or perhaps most) of these non-EveryPolitician data
sources won't have any way to tell who is current or not; the assumption
should continue that they should all be made available to write to.

We could divide these cases into sources that look like they're from
EveryPolitician or not from the URL, but that wouldn't help with people
serving EP data from elsewhere, or after modification.  Probably the
best approach would be to only show current representatives for sources
that have Event objects with classification "legislative period", and
then decide who is current by looking at those who have memberships with
a legislative_period_id corresponding to a current legislative period.

Unfortunately, at the moment django-popolo does not include the Event
class, and we need a workaround quickly for the Alpaca project, so this
commit uses a different heuristic for when to attempt to filter people
for being current memberships of a legislature: if the people from that
source have any memberships at all, it only includes those people who
have a current membership (of any organisation - of course this might
be completely spurious). This is good enough for the Alpaca case, which
will give us enough time to update django-popolo and
multiple-django-popolo-sources so that we can do this properly for the
main production writeinpublic site.

This commit also doesn't address whether historic representatives are
available via the API.
This makes the handling of a missing 'identifiers' array for people less
confusing and adds the option to preserve particular IDs on syncing.
The default behaviour is to update identifiers to remove any not in the
upstream data source when syncing, but we've deliberately created some
legacy IDs in migration that we want to keep.

(n.b. the popolo:person scheme is used is implicitly preserved because
we specified an id_prefix of 'popolo:')
We were getting one new (identical) popolo_uri Identifier with each sync
of the data; this commit should fix that.
@coveralls
Copy link

coveralls commented Jan 11, 2017

Coverage Status

Coverage increased (+0.07%) to 98.52% when pulling 3c70828 on migrate-to-django-popolo into 0830395 on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants