Skip to content

Conversation

@laols574
Copy link

@laols574 laols574 commented Feb 4, 2020

ADDITIONALLY:
On comments.py, there is a section of code that begins with "if back". This part is checking whether or not it needs to iterate upwards to get the rest of the comments in the replies. However, for some comments, like

https://mbasic.facebook.com/comment/replies/?ctoken=10162169751605725_10162170377070725&p=129&count=168&pc=1&ft_ent_identifier=10162169751605725&gfid=AQBjT1xFFeGcZxyW&refid=52&__tn__=R

which IS the first link visited from the main comment page because FB displays a middle comment on the main page due to its popularity. In order to prevent missing out on scraping these entries,
you should change:

back = response.xpath('//div[contains(@id,"comment_replies_more_1")]/a/@href').extract()

to

back = response.xpath('//div[contains(@id,"comment_replies_more_2")]/a/@href').extract()

in order to get the algorithm to iterate forwards as well. After, you have to merge these two separately generated csv files. This ended up being the easiest solution for me, but it's definitely possible to be done within a single program

added additional items to gather more information about the profile of the user and the specific reactions
see comments for details on changes, but I add random time pauses, crawled specific reaction and profile data
updated error handling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant