Demographic segmentation, a pillar of cost-efficient marketing, has assumed a new order of complexity with the rise of social media and Big Data. Reddit is of particular interest given its exponentially expanding user base, topical community structure, and temporal alignment with pop culture. However, Reddit’s aggressive response to user privacy concerns and minimalistic platform have critically hindered analytic capabilities in comparison to competitors like Facebook and Twitter. Consequently, a large portion of online discussion and interaction has been ignored by the commercial sector.
To overcome Reddit’s design and privacy barriers, we constructed a new lexicon-based demographic inference tool. In this pilot study, we demonstrate the ability to classify a Reddit user’s gender based solely on publicly available comment history, achieving an accuracy on the order of 85% via Logistic Regression. Model training is carried out on a data set consisting of over 250,000 comments from 11,000 unique users. Additionally, we showcase a proprietary natural language processing tool, LegendaryTokenizer, and characterize the predictive value of several textual features. Specifically, we highlight negation and n-gram usage as significant features in characterizing gender-specific lexicon. Finally, we apply this framework to a subset of Reddit communities, called subreddits, in a case study on gender differences in topical communities.
To our knowledge, this is the first approach to incorporate and subsequently demonstrate the value in Reddit text data, despite claimed anonymity and API limitations. Continued development of this method offers substantial promise to understand online behavior and analyze consumer trends.