ATTention: Understanding Authors and Topics in Context of Temporal Evolution

Understanding thematic trends and user roles is an important challenge in the field of information retrieval. In this paper, we worked on a novel probabilistic model for capturing the evolution of user’s interests in terms of content they generate over time. Our approach ATTention (a name derived from analysis of Authors and Topics in the Temporal context) addresses this problem by using of Bayesian modeling of relations between authors, latent topics and temporal data. The results of our experiments with scientific publications datasets are presented. The results show that using the ATTention model, it is possible to capture the change in users’ interests over time. We also discuss opportunities of using the model in certain mining and recommendation scenarios.


INTRODUCTION
The internet provides a platform for content sharing activities where people can share views, participate in discussions, publish technical domain specific blogs and research papers, thereby, contribute tremendous online contents related to different topics. It has been observed that topics discussed in collaborative social networks exhibit spikes (sudden topics, linked to current events, or enjoying a limited-time interest) and chatters (more recurring, long term topics) indicating strong correlation. Consider a scenario where a user tries to track back a particular topic for its emergence, growth patterns, popularity and underlying key players contributing to it over a period of time. In such a scenario manual analysis of this tremendous amount of text for finding latent topics, capturing topic evolution, identifying the author's interests is expensive in terms of time and labor. Following this observation, the challenge is to provide an approach which can help user in information filtering and finding users with similar interests. One such approach can also be helpful in finding the most influential authors at different stages of topic evolution and thus can be helpful in characterizing the authors as pioneers, mainstream or laggards in different subject areas. To tackle above mentioned challenges, we exploited probabilistic methods which in recent years have proved to be very successful for modeling topics in document collections. One such method is Latent Dirichlet Allocation (LDA) (Beli et al., 2003). LDA is a Bayesian multinomial mixture model which has become a popular method in text analysis due to its ability to produce interpretable and semantically coherent topics. It uses the Dirichlet distribution to model the distribution of the topics for each document. Each word is considered sampled from a multinomial distribution over words specific to this topic. LDA is a well-defined generative model and generalizes easily to new documents. Since LDA is highly modular and hierarchical, therefore, it can easily be extended. Many extensions to basic LDA model have been proposed to incorporate document metadata. In this type of model, each topic has a distribution over words as in the standard model, as well as a distribution over metadata values. Examples of such models include, the Topics over Time model (Wang et al., 2006), the Group-Topic model , the Author-Topic model (Rosen et al., 2004), the Linked Topic and the Interest Model (Cheng et al., 2008). None of the given approaches models documents and author together with the temporal information. In this paper, we propose ATTention (a name derived from analysis of Authors and Topics in the Temporal context); a model of topic dynamics in social media which connect the temporal topic dependency with the social actors, thereby, providing an insight into the evolution of topics over time along with capturing the author interests for a given time period. In probabilistic topic modeling a "Topic" is seen as a multinomial distribution over a vocabulary that assigns high probability to a set of words that tend to appear in the similar documents. A qualitatively "better topic" is that in which words that have high probability are semantically related to each other and a human subject is able to say that "these words are about X", where X can be any domain like business, computer science, chemistry etc. There is no consensus in literature on what could be a formal definition of a topic model. So, we see a "Topic Model" as a model of the generative process by which documents are created and captures the word co-occurrence patterns in a document corpus to produce semantically coherent topics. A variety of statistical models have been proposed for topic-based analysis and modeling of text documents. To name few of them are unigram model, mixture of unigram model [Nigam et al., 2000], latent semantic analysis (pLSA) (Hofmaan, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003). LDA is a Bayesian multinomial mixture model which has become a state of the art and popular method in text analysis due to its ability to produce interpretable and semantically coherent topics. It uses the Dirichlet distribution to model the distribution of the topics for each document. In LDA each word is considered sampled from a multinomial distribution over words specific to this topic. LDA is a well-defined generative model and generalizes easily to new documents without overfitting. Since LDA is highly modular and hierarchical, therefore, it can easily be extended. Many extensions to basic LDA model have been proposed to incorporate document metadata. The simplest method of incorporating the metadata in generative topic models is to generate both the words and the metadata simultaneously given hidden topic variables. In this type of model, each topic has a distribution over words as in the standard model, as well as a distribution over metadata values. Examples of such model includes, Topics over Time model (Wang et al., 2006) of Wang and McCallum, Continuous Time Dynamic Topic Models. The Group-Topic model of Wang (Wang et al., 2008), Mohanty and McCallum , Author-Topic model (Rosen et al., 2004) of Rosen Zvi, Griffiths, Steyvers and Smyth, Linked Topic and Interest Model (Cheng et al., 2008) of Cheng and Li.

THE ATTention MODEL
We extend LDA by incorporating document metadata i.e. author and timestamp of the document. Figure 1 is a graphical representation of LDA and ATTention. In the model each author is modeled as having distribution overs topics and each topic is modeled as having distribution over words. The ATTention model has three sets of unknown parameters; the author distribution over topics θ, the topic distribution over words φ and the topic distribution over time ψ. Both θ and φ have multinomial distributions with symmetric Dirichlet priors having the hyperparameters α and β respectively. To avoid time discretization, we use a continuous pertopic parametric Beta distribution ψ over absolute time values in the generative process, this gives a natural distribution of topics over time. We normalize the time-stamps to values between 0 and 1 for parameter estimation. The generative process of the ATTention model as given in Figure 1, which corresponds to the process used in Gibbs sampling for parameter estimation is described as follows.  The generative process of the ATTention model as given in Figure 1, which corresponds to the process used in Gibbs sampling for parameter estimation is described as follows.
· Draw θ Dir(α) · Draw φ Dir(β) · For each document d, pick an author from the list of authors ad and draw a multinomial θd from Dirichlet prior α; then for each of the Nd words, wi, -Draw a topic zd i from multinomial θd -Draw a word wdi from multinomial φzdi In the ATTention model three parameters θ, φ, ψ are estimated. Exact inference of the parameters of LDA type models is intractable, therefore, we use Gibbs sampling to perform approximate inference. In the model there are three latent variables z, a and t. Each set (zi, ai, ti) of these latent variables is drawn as block conditioned on all other variables. We begin with the joint probability of dataset, and using the chain rule we obtain conditional probability for

RESULTS AND DISCUSSION
We apply our approach to a subset of the CiteSeer dataset consisting of abstracts and titles of research papers published by authors having more than 150 publications from 2001 to 2009. The minimum limit of 150 publications is applied to have sufficient text for capturing author interest over time. Dataset is preprocessed to remove stop words and noise be removing highly frequent terms and terms occurring in less than 10 documents. We set the number of topics to K=100 and fix the hyperparameters α = 50/K and β = 0.01. The results shown are obtained by sampling from the 2000th iteration of Gibbs Sampler. Due to space constraints we are showing four topics with their most influential authors and beta PDF modeling the topic distribution over time. Table1 to Table 4 show the top 5 terms and the top 4 authors for each topic. The interesting observation from the results is that the activities in the Semantic Web and Database System topics are correlated. As one topic starts gaining, the activity in another topic starts decreasing. It also shows that as the topic of semantic web started to emerge, influential authors in the database systems topic shifted to semantic web topic.

CONCLUSION
In this paper, we have presented a probabilistic method of modeling text, authors and timestamps in a given set of text data enabling us to identify temporal activity of topics and finding out influential users for the identified topics. Joint modeling and learning posterior probabilities of text, author and time allows us to query model for any combination of these variables conditioned on each other for finding information about how author's interests change over time and how activity in topics changes with emergence of new topics. Results from the application of this model to the CiteSeer dataset show the applicability of the model to arbitrary document collections with author and temporal information for detecting topics trends, topic evolution and author's interests. The generative process of the ATTention model as given in Figure 1, which corresponds to the process used in Gibbs sampling for parameter estimation is described as follows.
· For each document d, pick an author from the list of authors ad and draw a multinomial θd from Dirichlet prior α; then for each of the Nd words, wi, -Draw a topic zd i from multinomial θd -Draw a word wdi from multinomial φzdi -Draw a timestamp tdi from Beta ψzdi In the ATTention model three parameters θ, φ, ψ are estimated. Exact inference of the parameters of LDA type models is intractable, therefore, we use Gibbs sampling to perform approximate inference. In the model there are three latent variables z, a and t. Each set (zi, ai, ti) of these latent variables is drawn as block conditioned on all other variables. We begin with the joint probability of dataset, and using the chain rule we obtain conditional probability for (1) where zi, xi, ti represent topic, author and time assigned to wi whereas z−i, x−i, ti are all other assignments of that topic, author and time excluding the current assignment. w−i represents all other words in the document set and ad is the observed author of the document. Learning joint probabilities of these three latent variables enables us to query the model conditioned on any combination of these variables using Baye's rule. For example, given the author and time find the authors interest in that time period P (φd a, t) or given the topic and time find the top authors contributing to the topic in that time P (θd z, t). The presented approach can be used for variety of applications. For example, authors that have high probability for a topic when it starts emerging can be seen as "topic pioneers" who conduct innovative research in that topic. Moreover, active authors that frequently change their topics of interest can be considered as "trend setters" in the respective research community. On the other hand, authors that have high probability at the peak topic activity can be seen as "mainstream" researchers that follow general trends and interests of the community. Finally, authors that have time-independent profiles with stable topics of interest can be recognized as foundational researchers that act independently of fluctuating trends and popular issues. From the application perspective, this knowledge can be exploited in a variety of ways, e.g. for advanced impact ranking, similarity-based contact recommendation for future collaborations, or better summarization of recent research trends and prediction of their further evolution.

RESULTS AND DISCUSSION
We apply our approach to a subset of the CiteSeer dataset consisting of abstracts and titles of research papers published by authors having more than 150 publications from 2001 to 2009. The minimum limit of 150 publications is applied to have sufficient text for capturing author interest over time. Dataset is preprocessed to The generative process of the ATTention model as given in Figure 1, which corresponds to the process used in Gibbs sampling for parameter estimation is described as follows.
· Draw θ Dir(α) · Draw φ Dir(β) · For each document d, pick an author from the list of authors ad and draw a multinomial θd from Dirichlet prior α; then for each of the Nd words, wi, -Draw a topic zd i from multinomial θd -Draw a word wdi from multinomial φzdi -Draw a timestamp tdi from Beta ψzdi In the ATTention model three parameters θ, φ, ψ are estimated. Exact inference of the parameters of LDA type models is intractable, therefore, we use Gibbs sampling to perform approximate inference. In the model there are three latent variables z, a and t. Each set (zi, ai, ti) of these latent variables is drawn as block conditioned on all other variables. We begin with the joint probability of dataset, and using the chain rule we obtain conditional probability for p (zi = j, xi = k, ti = l|wi = m, z−i, x−i, t−i, w−i, ad) (1) where zi, xi, ti represent topic, author and time assigned to wi whereas z−i, x−i, ti are all other assignments of that topic, author and time excluding the current assignment. w−i represents all other words in the document set and ad is the observed author of the document. Learning joint probabilities of these three latent variables enables us to query the model conditioned on any combination of these variables using Baye's rule. For example, given the author and time find the authors interest in that time period P (φd a, t) or given the topic and time find the top authors contributing to the topic in that time P (θd z, t). The presented approach can be used for variety of applications. For example, authors that have high probability for a topic when it starts emerging can be seen as "topic pioneers" who conduct innovative research in that topic. Moreover, active authors that frequently change their topics of interest can be considered as "trend setters" in the respective research community. On the other hand, authors that have high probability at the peak topic activity can be seen as "mainstream" researchers that follow general trends and interests of the community. Finally, authors that have time-independent profiles with stable topics of interest can be recognized as foundational researchers that act independently of fluctuating trends and popular issues. From the application perspective, this knowledge can be exploited in a variety of ways, e.g. for advanced impact ranking, similarity-based contact recommendation for future collaborations, or better summarization of recent research trends and prediction of their further evolution.

RESULTS AND DISCUSSION
We apply our approach to a subset of the CiteSeer dataset consisting of abstracts and titles of research papers published by authors having more than 150 publications from 2001 to 2009. The minimum limit of 150 publications is applied to have sufficient text for capturing author interest over time. Dataset is preprocessed to remove stop words and noise be removing highly frequent terms and terms occurring in less than 10 ATTention: Understanding Authors and Topics in Context of Temporal Evolution