Updated the players.py file to parse correctly. #5
Updated the players.py file to parse correctly. #5matthewstirling wants to merge 97 commits intomdgoldberg:ncaaffrom
Conversation
…six for compatibility)
…ls are out of date.
|
Not sure what the correct way to do it is, but the ncaaf branch really could use updated utils file. After the utils are updated, it may need naming refactor. I've not done all that before, so I'm not sure how it works. I only edited the players file for now because it's what I need to use, but I have some other team stuff I could edit and work on. I'd rather work on this with an updated utils that more reflects master's 0.7.15 version |
… of annual leaders.
|
I believe the correct way to do that would be to just merge the master branch into the ncaaf branch. A possible issue (although I don't see this being an issue in this case) is that this will get updated copies of all possible files, not only utils.py. If you really want to update only utils.py, then you can do the following (make sure you're on the ncaaf branch first): |
Even more explicit in identifying the h1 containing the player's name by using [itemprop="name"]
mdgoldberg
left a comment
There was a problem hiding this comment.
I left pretty granular feedback, so I'd just look at that. Your code quality is good imo, most of the comments are about codebase organization and/or correctness. Thanks so much again for contributing!
| rawPos = re.search(r'Position : (\S+)', rawText, re.I).group(1) | ||
| allPositions = rawPos.split('/') | ||
| # TODO: returning just the first position for those with | ||
| # multiple positions. Should return the last position played |
There was a problem hiding this comment.
interesting, why do you think I should return the last position played? Regardless, as you can see, this can become a bit of an issue. In most cases positions are clear-cut, but sometimes you run into issues with guys like Jabrill Peppers. I am currently leaning towards just returning a list regardless (e.g., return ['QB'] for Brady and ['DB', 'S', 'LB'] for Sanders), but curious to hear your thoughts.
There was a problem hiding this comment.
I was thinking that players are typically referenced on most websites by one position. That position is usually the last position played. I'm thinking of Terrelle Pryor, but don't know a ton of these examples. Peppers is a pretty good example with all those postions. I'm fine with a list as I can just parse out the last position played from annual data.
| doc = self.getDoc() | ||
| table = doc('#passing') | ||
| df = sportsref.utils.parseTable(table) | ||
| if df.empty and table.length > 0: |
There was a problem hiding this comment.
Sorry you had to go through this, but this stuff is fixed in the new utils file, as you seem to have discovered. Want to merge in the current master branch to get the newest decorators, utils, etc and then replace this with the original code without this edge case? Same with the other functions.
There was a problem hiding this comment.
Sounds good, I'll look into a merge.
There was a problem hiding this comment.
I merged and this works without the if/then at this point.
| """ | ||
| doc = self.getDoc() | ||
| table = doc('#rushing') | ||
| table = doc('#all_rushing') |
There was a problem hiding this comment.
whose page are you using to look at the HTML? The pages I've looked at are all either 'table#rushing' or 'table#receiving', I haven't seen any all_rushing or all_receiving
There was a problem hiding this comment.
So the reason I am using all_whatever is because just rushing or passing works fine for the top-most table, but I could not get the pyquery to find the tables with those ids on the lower portion of the player pages. For example, for Jabrill Peppers, the methods cannot see 'rushing' id. For some reason it gets lost after the enormous comments included on those lower tables. That's also why I included the:
if df.empty and table.length > 0:
-
tree = etree.fromstring(str(table)) -
comments = tree.xpath('//comment()') -
comment = etree.tostring(comments[0]) -
contents = comment.replace("<!--", "").replace("-->", "") -
table = pq(contents) -
df = sportsref.utils.parseTable(table)
There was a problem hiding this comment.
Those lines parse the comment from the 'all_rushing' table which is the only way I could get at that data. It's pretty hacky, but you are welcome to try the other way. I got no results.
There was a problem hiding this comment.
The new utils.get_html function takes care of this (just strips literally all comments). I realize that stripping all comments could be an issue, but for now it works. I actually think I may rewrite the relevant parts of the library using lxml instead of pyquery, because I've seen some benchmarks that it's a bit faster and I've been meaning to learn XPath. Anyways, for now, the new utils.get_html handles this, so using table#rushing should be fine.
There was a problem hiding this comment.
All this is removed now and seems to work well. Thanks!!
| """ | ||
| doc = self.getDoc() | ||
| table = doc('#defense') | ||
| table = doc('#all_defense') |
There was a problem hiding this comment.
Same here, seems to me like it should be #defense but maybe I'm wrong
There was a problem hiding this comment.
see rushing comment above
There was a problem hiding this comment.
All the table definitions are now of the type
table = doc('table#defense')
| :returns: Pandas DataFrame with defensive stats. | ||
| """ | ||
| doc = self.getDoc() | ||
| table = doc('#all_scoring') |
There was a problem hiding this comment.
see rushing comment above
There was a problem hiding this comment.
see comment above. now fixed
| dfAll = dfAll.merge(dfScoring, 'outer', mergeList) | ||
| return dfAll | ||
|
|
||
| def get_college_leaders_one_year(year): |
There was a problem hiding this comment.
This is the sort of thing where I'd prefer to make a ncaaf/seasons.py file with a Season class (compare to to nba.Season in the nba subpackage). This would contain all info on/related to pages like this one. Thank you for making it though! Seems like it would be useful for a draft forecasting model :)
There was a problem hiding this comment.
Oh yea, that sounds great. I didn't know where to put this as it is appropriate in a seasons class but didn't have a home right now since that class doesn't exist. I agree with you completely. At first I was going to use this method in my forecasting model, but now I'm not, so I don't mind if it's tossed out for now.
There was a problem hiding this comment.
In that case, feel free to just toss it here, and I can refactor it to a Season class when I get to that
| :param year: the year of the roster | ||
| :return: a dataframe with the player_id, class, and position | ||
| """ | ||
| url = sportsref.ncaaf.BASE_URL +\ |
There was a problem hiding this comment.
I think using \ here is frowned upon in Python (that's at least my impression). I think using parentheses is preferred.
| df = sportsref.utils.parseTable(table) | ||
| return df | ||
|
|
||
| def get_all_college_teams(): |
There was a problem hiding this comment.
This is fine here; I had a similar function in the nba.teams module, but then I realized the teams change each year as teams are added. If this is a concern for ncaaf, it may make sense to make a similar function as a method within the ncaaf.Season class (that's how I solved it for nba, and I believe nfl).
There was a problem hiding this comment.
the way that the cfb is organized, this page contains all teams that have ever played and the year they started and also the year they stopped. I guess there's multiple ways to skin this cat. I'm also ok trying to do it with the year. I think I may have added such an approach in the nfl sportsref. Which way should i code this up?
There was a problem hiding this comment.
I suppose it depends on the use case for this function. I guess there's no harm in leaving this the way it is now; if I change my mind, I may move things around, but for now, this is fine with me.
| """ Returns all the college teams from | ||
| http://www.sports-reference.com/cfb | ||
|
|
||
| :return: A dataframe with |
There was a problem hiding this comment.
with what?? the suspense is killing me 😃
| @@ -61,14 +61,14 @@ def parseTable(table): | |||
| return pd.DataFrame() | |||
There was a problem hiding this comment.
Again, these changes should instead be added by merging master into ncaaf (or by git checkout master sportsref/utils.py if necessary).
|
I had to move the ignore *,+, and other characters used to note things within the parse_table method in utils in order to get rid of * notations on player's years (it designates the player's bowl stats are in the stat line). |
# Conflicts: # sportsref/ncaaf/players.py
|
So I think with the merge, this is now looking better. Utils is up-to-date and I changed the tables. |
Note: this branch's utils are out of date.