Updated the players.py file to parse correctly. by matthewstirling · Pull Request #5 · mdgoldberg/sportsref

matthewstirling · 2017-01-15T06:52:39Z

Note: this branch's utils are out of date.

….boxscores

…_table

…six for compatibility)

…ls are out of date.

matthewstirling · 2017-01-15T06:54:49Z

Not sure what the correct way to do it is, but the ncaaf branch really could use updated utils file. After the utils are updated, it may need naming refactor. I've not done all that before, so I'm not sure how it works. I only edited the players file for now because it's what I need to use, but I have some other team stuff I could edit and work on. I'd rather work on this with an updated utils that more reflects master's 0.7.15 version

… of annual leaders.

mdgoldberg · 2017-01-17T09:02:26Z

I believe the correct way to do that would be to just merge the master branch into the ncaaf branch. A possible issue (although I don't see this being an issue in this case) is that this will get updated copies of all possible files, not only utils.py. If you really want to update only utils.py, then you can do the following (make sure you're on the ncaaf branch first): git checkout master sportsref/utils.py. That should do the trick!

Even more explicit in identifying the h1 containing the player's name by using [itemprop="name"]

mdgoldberg

I left pretty granular feedback, so I'd just look at that. Your code quality is good imo, most of the comments are about codebase organization and/or correctness. Thanks so much again for contributing!

mdgoldberg · 2017-01-17T09:12:27Z

+        rawPos = re.search(r'Position : (\S+)', rawText, re.I).group(1)
+        allPositions = rawPos.split('/')
+        # TODO: returning just the first position for those with
+        # multiple positions. Should return the last position played


interesting, why do you think I should return the last position played? Regardless, as you can see, this can become a bit of an issue. In most cases positions are clear-cut, but sometimes you run into issues with guys like Jabrill Peppers. I am currently leaning towards just returning a list regardless (e.g., return ['QB'] for Brady and ['DB', 'S', 'LB'] for Sanders), but curious to hear your thoughts.

I was thinking that players are typically referenced on most websites by one position. That position is usually the last position played. I'm thinking of Terrelle Pryor, but don't know a ton of these examples. Peppers is a pretty good example with all those postions. I'm fine with a list as I can just parse out the last position played from annual data.

Changed to a list.

mdgoldberg · 2017-01-17T09:31:35Z

        doc = self.getDoc()
        table = doc('#passing')
        df = sportsref.utils.parseTable(table)
+        if df.empty and table.length > 0:


Sorry you had to go through this, but this stuff is fixed in the new utils file, as you seem to have discovered. Want to merge in the current master branch to get the newest decorators, utils, etc and then replace this with the original code without this edge case? Same with the other functions.

Sounds good, I'll look into a merge.

I merged and this works without the if/then at this point.

mdgoldberg · 2017-01-17T09:34:00Z

        """
        doc = self.getDoc()
-        table = doc('#rushing')
+        table = doc('#all_rushing')


whose page are you using to look at the HTML? The pages I've looked at are all either 'table#rushing' or 'table#receiving', I haven't seen any all_rushing or all_receiving

So the reason I am using all_whatever is because just rushing or passing works fine for the top-most table, but I could not get the pyquery to find the tables with those ids on the lower portion of the player pages. For example, for Jabrill Peppers, the methods cannot see 'rushing' id. For some reason it gets lost after the enormous comments included on those lower tables. That's also why I included the:

if df.empty and table.length > 0:

tree = etree.fromstring(str(table))

comments = tree.xpath('//comment()')

comment = etree.tostring(comments[0])

contents = comment.replace("", "")

table = pq(contents)

df = sportsref.utils.parseTable(table)

Those lines parse the comment from the 'all_rushing' table which is the only way I could get at that data. It's pretty hacky, but you are welcome to try the other way. I got no results.

The new utils.get_html function takes care of this (just strips literally all comments). I realize that stripping all comments could be an issue, but for now it works. I actually think I may rewrite the relevant parts of the library using lxml instead of pyquery, because I've seen some benchmarks that it's a bit faster and I've been meaning to learn XPath. Anyways, for now, the new utils.get_html handles this, so using table#rushing should be fine.

All this is removed now and seems to work well. Thanks!!

mdgoldberg · 2017-01-17T09:34:39Z

        """
        doc = self.getDoc()
-        table = doc('#defense')
+        table = doc('#all_defense')


Same here, seems to me like it should be #defense but maybe I'm wrong

see rushing comment above

All the table definitions are now of the type
table = doc('table#defense')

mdgoldberg · 2017-01-17T09:34:59Z

+        :returns: Pandas DataFrame with defensive stats.
+        """
+        doc = self.getDoc()
+        table = doc('#all_scoring')


see rushing comment above

see comment above. now fixed

mdgoldberg · 2017-01-17T09:42:30Z

+            dfAll = dfAll.merge(dfScoring, 'outer', mergeList)
+        return dfAll
+
+def get_college_leaders_one_year(year):


This is the sort of thing where I'd prefer to make a ncaaf/seasons.py file with a Season class (compare to to nba.Season in the nba subpackage). This would contain all info on/related to pages like this one. Thank you for making it though! Seems like it would be useful for a draft forecasting model :)

Oh yea, that sounds great. I didn't know where to put this as it is appropriate in a seasons class but didn't have a home right now since that class doesn't exist. I agree with you completely. At first I was going to use this method in my forecasting model, but now I'm not, so I don't mind if it's tossed out for now.

In that case, feel free to just toss it here, and I can refactor it to a Season class when I get to that

mdgoldberg · 2017-01-17T09:44:56Z

+        :param year: the year of the roster
+        :return: a dataframe with the player_id, class, and position
+        """
+        url = sportsref.ncaaf.BASE_URL +\


I think using \ here is frowned upon in Python (that's at least my impression). I think using parentheses is preferred.

mdgoldberg · 2017-01-17T09:47:05Z

+        df = sportsref.utils.parseTable(table)
+        return df
+
+def get_all_college_teams():


This is fine here; I had a similar function in the nba.teams module, but then I realized the teams change each year as teams are added. If this is a concern for ncaaf, it may make sense to make a similar function as a method within the ncaaf.Season class (that's how I solved it for nba, and I believe nfl).

the way that the cfb is organized, this page contains all teams that have ever played and the year they started and also the year they stopped. I guess there's multiple ways to skin this cat. I'm also ok trying to do it with the year. I think I may have added such an approach in the nfl sportsref. Which way should i code this up?

I suppose it depends on the use case for this function. I guess there's no harm in leaving this the way it is now; if I change my mind, I may move things around, but for now, this is fine with me.

mdgoldberg · 2017-01-17T09:47:29Z

+    """ Returns all the college teams from
+    http://www.sports-reference.com/cfb
+
+    :return: A dataframe with


with what?? the suspense is killing me 😃

mdgoldberg · 2017-01-17T09:49:43Z

@@ -61,14 +61,14 @@ def parseTable(table):
        return pd.DataFrame()


Again, these changes should instead be added by merging master into ncaaf (or by git checkout master sportsref/utils.py if necessary).

matthewstirling · 2017-01-25T20:04:26Z

I had to move the ignore *,+, and other characters used to note things within the parse_table method in utils in order to get rid of * notations on player's years (it designates the player's bowl stats are in the stat line).

…est notes.

# Conflicts: # sportsref/ncaaf/players.py

matthewstirling · 2017-01-25T20:20:31Z

So I think with the merge, this is now looking better. Utils is up-to-date and I changed the tables.

mdgoldberg added 30 commits July 4, 2016 18:23

got rid of default year in nfl

931a6e2

got rid of default year in nba

9b4c32d

updated cache dir name

580ef79

fixed problems from the PFR redesign for nfl.players

6241ddc

fixed problems from the PFR redesign for nfl.teams

857f16b

changed from requests to selenium, added parseInfoTable utils function

dc09951

added ascii encoding in cacheHTML decorator

f9c48fe

fixed problems from the PFR redesign for nfl.boxscores and nfl.pbp

375a0d8

added year param to nfl.teams functions, other small changes

62df006

updated, refactored constants and imports in nfl.finders

2747398

more refactoring, fixed up finders

aa9b4b9

small fixes, lots of updates to nfl.teams

03b253a

encode UTF8 fix in cacheHTML

97ea7f5

updated cacheValid function for PFR

9836890

added hashing to cacheHTML and small change to memoized

949883b

updated import style to relative imports

b54d79f

a bunch of implementations

0dc38f3

added plays functions to nfl.Player

f210c73

implemented nfl.Player splits, some refactoring

26e1f5c

advanced splits in nfl.Player, changed sportsref.nfl to underscore_case

15d7574

made sportsref.nba underscore_case

79c724f

got rid of ncaaf and ncaab

633da79

changing module to (mostly) standardized underscore_case for methods

e572ec3

simple time-based caching for now. cache functions are complicated

7db63bf

added str and repr, changed bsID to boxscoreID, fixed bug in nfl.Team…

c69fa25

….boxscores

updated README

2652901

Merge branch 'develop' for next release (0.7)

539fee6

updated setup.py for 0.7.0

b790ff3

added self parameter

e99a3ca

version 0.7.1

16b577d

mdgoldberg and others added 18 commits December 20, 2016 23:40

got rid of license

0c77589

Make sure develop branch is updated

cfc871b

fixed 404 error handling

02ddba6

NOW fixed unicode problems (I hope)

dc61114

added conditions to only change column names if flatten=True in parse…

f8d25df

…_table

fixed outdated parts of nba.BoxScore

74dda6d

added sphinx for documentation

febd951

moved doc directory to project root

6983394

changed 'memoized' to 'memoize'

f5ef3de

added class_memoize function for memoizing class instances

e52e07c

edited gitignore, added docs folder

467f3f9

finally got class memoization right, using the mementos package (and …

4ed37ee

…six for compatibility)

got rid of doc folder, changed it to docs earlier

6da7b28

updated docs folder to have source and build as separate folders

4e1b41d

added gh-pages make target

0c7759d

Fixed memoizing issues, adding some dependencies. Added documentation.

6bfa6af

version 0.7.15

f26833f

Updated the players.py file to parse correctly. Note: this branch uti…

ef0c173

…ls are out of date.

Added methods to get team roster, a list of teams in ncaa, and a list…

4e76953

… of annual leaders.

Edited nfl.Player.name

a2435fc

Even more explicit in identifying the h1 containing the player's name by using [itemprop="name"]

mdgoldberg requested changes Jan 17, 2017

View reviewed changes

matthewstirling added 3 commits January 23, 2017 13:40

initial merge of master into ncaaf

068af98

fixed broken utils references.

4901ab6

changed the position return to a list.

60134ba

matthewstirling added 2 commits January 25, 2017 13:13

made fixes per the refactor or utils and also the remaining pull requ…

afd1994

…est notes.

Merge remote-tracking branch 'origin/ncaaf' into ncaaf

838b2da

# Conflicts: # sportsref/ncaaf/players.py

		@@ -61,14 +61,14 @@ def parseTable(table):
		return pd.DataFrame()

Conversation

matthewstirling commented Jan 15, 2017

Uh oh!

matthewstirling commented Jan 15, 2017

Uh oh!

mdgoldberg commented Jan 17, 2017

Uh oh!

mdgoldberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewstirling commented Jan 25, 2017

Uh oh!

matthewstirling commented Jan 25, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants