Skip to content

Affiliation info: cites, pubs and fields #27

@f-hafner

Description

@f-hafner

I think we can make the tables for paper/citation counts at affiliation-year-field0 as well as the keyword list at the same level in one go:

  1. Create temporary table with unique PaperId, Year, AffiliationId
  2. summarise paper outcomes at AffiliationId-Field0-Year level
  3. summarise paper keywords at AffiliationId-Field0-Year level

To consider/note

  • We will replace the table affiliation_outcomes, which is only at the AffiliationId level, with the output from step 2 above. I don't know right now where we use this old table as an input, and we should check
  • There are two ways of assigning papers to "departments" at institutions: from the author's main field, or from the paper's main field. The latter is probably more accurate
  • The output from step 3 contains additionally the columns FieldOfStudyId and a Score. We can use the next lower integer of the score as a frequency weight to calculate tf-idf (but not sure how to exactly implement frequency weights)
  • Not sure how to add the count of researcher in this query

Here are the queries, which we can use to replace the query in affiliation_outcomes.py.

-- ## 1. create temp table 
CREATE TEMP TABLE paper_affiliation_year AS 
SELECT DISTINCT AffiliationId, Year, PaperId
FROM (
    SELECT a.AuthorId, a.AffiliationId, a.Year, b.Paperid
    FROM AuthorAffiliation a -- ## if an author has 2 main affiliations in the same year, we count their papers at both institutions
    INNER JOIN (
        SELECT PaperId, AuthorId, Year
        FROM PaperAuthorUnique
        INNER JOIN (
            SELECT PaperId, Year
            FROM Papers
        ) USING(PaperId)
    ) b
    ON a.AuthorId=b.AuthorId AND a.Year=b.Year
    -- reduces size of the data set 
    INNER JOIN (
        SELECT PaperId
        FROM paper_outcomes
    ) USING(PaperId)
)

CREATE INDEX ON idx_paper_temp ON paper_affiliation_year (PaperId)

-- ## 2. create table with citation/paper counts
CREATE TABLE affiliation_outcomes AS  -- this is already defined, where? can we replace it? where do we use it?? 
SELECT COUNT(PaperId) AS PaperCount
    , SUM(CitationCount_y10) AS CitationCount_y10
    , AffiliationId
    , Year 
    , Field0
FROM paper_affiliation_year 
INNER JOIN (
    SELECT PaperId, CitationCount_y10 
    FROM paper_outcomes 
) USING(PaperId)
INNER JOIN ( -- each field is unique per paper, so it is ok to join only here 
    SELECT PaperId, Field0 
    FROM PaperMainFieldsOfStudy
) USING(PaperId)
GROUP BY AffiliationId, Field0, Year

CREATE UNIQUE ON idx_affo_AffilYearField ON affiliation_outcomes (AffiliationId, Year, Field0)


-- ## 3. table with keywords
CREATE TABLE affiliation_fields AS 
SELECT SUM(Score) AS Score
    , FieldOfStudyId
    , AffiliationId 
    , Field0
    , Year 
FROM paper_affiliation_year 
INNER JOIN (
    SELECT PaperId, FieldOfStudyId, Score
    FROM PaperFieldsOfStudy 
    INNER JOIN (
        SELECT FieldOfStudyId 
        FROM FieldsOfStudy 
        WHERE level < 2 -- choose appropriate level 
    ) USING(FieldOfStudyId)
) USING(PaperId)
GROUP BY AffiliationId, Field0, Year

CREATE UNIQUE ON idx_afff_AffilYearField ON affiliation_outcomes (AffiliationId, Year, Field0)

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions