This project aims to predicting personally identifiable infor- mation (PII), such as age and gender of the author by ex- tracting features from his/her personal content blog texts. We intend to define the state-of-the-art in the field and over- come the shortcomings of the prior works in the personality recognition tasks. This report describes our progress so far and contains details about our future work-flow.
Authorship Profiling, PII, Blogosphere
Though the enormous impact of social media on our daily life, we observe a lack of information about those who create the contents. In this regard, author profiling tries to deter- mine the gender, age, native language or personality type of authors by analyzing their published texts. In this study, we focus on building a system to identify only the gender and age of the authors. Other authorship details will be a part of the future work in this area. Author profiling is of growing importance: E.g., from a marketing viewpoint, companies may be interested in knowing the demographics of their tar- get group in order to achieve a better market segmentation; from a forensic viewpoint, determining the linguistic profile of a person who wrote a ”suspicious text” may provide valu- able background information.
This study is targeted towards partial fulfillment of require- ments for CS60057: Speech & Natural Language Processing during Fall 2015, under the guidance of Prof. Pawan Goyal.
The remainder of this paper is organized as follows. Sec- tion 2 describes the corpus, Section 3 covers the proposed approach, Section 4 presents the results obtained so far, Sec- tion 5 discusses the evaluation measures, Section 6 contains details about the future work-plan and Section 7 concludes by listing the work done by the individual team-mates.