We have used two approaches for applying LDA model to our dataset.
LingPipe, is a Java library which provides many of the functions required in NLP. We have used LingPipe in the following manner.
We have used symmetric KL Divergence to calculate the similarity of a paper with respect to each of the seed paper. Symmetric KLDivergence is used to compare the similarity of two probability distributions.
For each paper from DBLP+Citation dataset. We do the following:
- The bag of words for each paper again consists of the title of the paper, the abstract and the titles of the reference papers if available.
- We use stop word removal and stemming to filter the text. It is then fed into the lda model and tokenized. Tokenization ensures that the input to the bayes estimator only consists of the words which are already in the lda model.
- If no tokens are generated from the paper then the paper is directly given a very high KL divergence value.
- A bayes estimate of the topic distribution on the paper is made. There is a possibility of a topic being assigned a zero probability. This makes the KL divergence of the corresponding paper infinite. To avoid this, if the probability of a particular topic is estimated to be zero we assign it a very low probability of 10^-7.
- The similarity of the paper is then computed by calculating the KL Divergence with respect to the topic distribution of each seed paper.
- We assign the minimum score (maximum similarity) from the set of similarity scores generated with comparisons of all the seed papers.
After this process, we sort the list of the papers according to the KL Divergence calculated and then a human evaluation is done to pick out the most promising papers based on their title.
Mahout is a machine learning library developed on top of the Hadoop Cloud Computing platform.
In this approach, we apply the LDA on the set of DBLP dataset instead of the seed set. The steps are as follows:
- To make the dataset manageable, we first run the dataset through a keyword filter wherein all papers which do not contain the keywords with respect to computational sustainability are discarded.
- We then run the dataset through a stop word removal filter.
- We run the LDA model on this set which is augmented with the seed paper set to get a unified topic distribution model over the entire set.
- Then we use three similarity measures to get three different lists and we sort it according to their similarity measures.
- Then a human evaluation is done to pick out the most promising papers based on their title.