Sort daily papers by learning users topics preference

<p><span class="drop-cap">I</span>t's been a while since we at easytechgreen wanted to start doing Artificial Intelligence things. In the middle of last year (2019) we got in touch with a friend from college.</p> <p><a href="https://easytechgreen.com/ezequiel-alvarez">Sequi</a>, our friend and now easytechgreen's IA R&amp;D Director, is a particle physicist who has been working with neural networks and machine learning techniques for a while. He proposed us to write a joint paper, which was released in February 2020 and was wrote by Ezequiel Alvarez, Federico Lamagna, Cesar Miquel and Manuel Szewc. </p> <blockquote> <p>The idea was applying <a href="https://dl.acm.org/doi/10.1145/2133806.2133826">Topics Model</a> to scientific literature and to use its outcome to create a new tool for sorting papers according to each user topics preference.</p> </blockquote> <p>You can read part of the paper and main ideas in the following lines. </p> <p>There exists a variety of algorithms applied to scientific papers and scientific literature. Among them we can mention <a href="http://arxitics.com/">Arxitics.com</a> which allows to share voting, reviews and comments to provide an enhanced interface for reading and discussing the Arxiv; <a href="https://scirate.com/">Scirate.com</a> which sorts papers according to ratings from the community (<a href="https://drops.dagstuhl.de/opus/volltexte/2015/5477/">Mining scientific articles powered by machine learning techniques</a>), which mines scientific articles for recommending them to users based on abstract content using a personal collection of references, <strong>CiteULike</strong> which allowed users to share preferences on scientific papers, <a href="https://www.mendeley.com/">Mendeley.com</a> which is a complete desktop service for sorting and archiving bibliography and generating bibliography for given articles, among other services, and <a href="https://dl.acm.org/profile/81100221035">Chong Wang</a> and <a href="https://dl.acm.org/profile/81100028344">David Meir Blei</a> which uses <a href="https://dl.acm.org/doi/10.1145/2020408.2020480">collaborative filtering within a framework of topic modeling</a> to recommend articles to users. Some of these and other cases use rating algorithm, hand-made functions, and/or Machine Learning techniques to provide scientists with better access to bibliography. However, at the current knowledge of the authors, there is not yet an available algorithm that learns from each user's personal preferences and sorts the scientific papers accordingly for each user, which is the main goal that drives the content of this work.</p> <p>Our departure point to tackle this problem is that in many cases the creation of a new scientific paper can be modeled as putting together scientific knowledge from different topics into a new problem. This understanding of a scientific paper corresponds to the modeling of documents within a corpus in which each document is a specific mixture of topics in given proportions. <strong>The Latent Dirichlet Allocation</strong> (<a href="https://www.jmlr.org/papers/v3/blei03a">LDA</a>)  algorithm is an unsupervised <strong>Machine Learning</strong> framework to address the topics and weights extraction in this kind of corpus. On the other hand, and solely for the purposes of this article, we may also model the interests of scientists through these given topics: scientists have a weight on each one of these topics that represent their interest on them. Given this modeling, it is suitable to start our enterprise of classifying papers and Arxiv readers based on the <strong>LDA</strong> algorithm.</p> <h3>Topics Model in Latent Dirichlet Allocation (LDA)</h3> <p>Topic Modeling, or Topics Model, is a general framework of statistical models that aims to infer abstract “topics” from a given corpus of unlabeled documents. These abstract topics can be thought of as the generators of the corpus and can thus be assumed to encode all the information necessary about the corpus. The topics can be used to label the documents with some criteria or to perform a dimensional reduction from a large corpus to a relatively small number of topics over which to conduct different operations, in a similar way to other unsupervised clustering techniques such as Principal Component Analysis (PCA).</p> <p>While in PCA the clustering works with correlations to find the principal variables that encode the variance of the data, topic modeling aims to find clusters of words with semantic meaning (although this is not guaranteed as we are using abstract topics). With these criteria in mind, we focus on probabilistic topic modeling where we assume a generative model that encodes the semantic structure of the corpus. These generative models can capture better inter- and intra-documents statistical structure than non-generative models such as Term Frequency times Inverse Document Frequency (TF.IDF) and are thus better suited for semantic clustering. From the generative model, we can derive a posterior using Bayes theorem which, although intractable, can be approximated by algorithms such as Gibbs Sampling or Variational Bayes. The estimated posterior then provides the semantic information from which we can obtain the topic distributions over the vocabulary and each document distribution over the topics. One example of this, and the one we focus on this work, is the <a href="https://www.jmlr.org/papers/v3/blei03a">Latent Dirichlet Allocation</a> (LDA).</p> <p>In <em>LDA</em> we assume a mixed membership model with a fixed amount <em>K</em> of topics, in which all of the <em>D</em> documents are composed of every topic and every topic contains all the <em>N</em> words in the vocabulary. To account for this, we assume each document d has a multinomial probability distribution θd over the topic space and each topic κ has a multinomial distribution βκ over the vocabulary. In turn, these probability distributions are sampled from two Dirichlet Distributions with hyperparameters α and η respectively. In this context, θ, β, α and η play the role of latent variables which generate the corpus but are not directly observable. The Dirichlet distribution is the conjugate prior of the multinomial distribution, which allows the Bayesian inference to keep the multinomial distribution shape, albeit with different probabilities assigned to each category.</p> <center><h3>Read the full paper  <a href="https://share.hsforms.com/1Wi68U7tEQASudFgQZKRq3w1fvjv" target="_blank"><button type="button">Download Paper!</button></a></h3></center> <h3>IArxiv.org</h3> <p>To allow users to navigate the corpus of papers from <a href="http://Arxiv.org">Arxiv.org</a> according to their preferences we built an application where they can register and view the listings from four of the Arxiv categories: astro-ph, gr-qc, hep-ph and hep-th.</p> <img alt="iarxiv.org screenshot" data-entity-type="file" data-entity-uuid="c2f14849-a6e2-4fe5-aba4-74878e5caf3b" src="/sites/default/files/inline-images/Screen%20Shot%202020-10-17%20at%2000.29.19.png" class="align-center" /><p>We developed the this application using  using the <a href="https://reactjs.org/">ReactJS</a> for  the frontend and a backend server written in <a href="https://golang.org/">Golang</a>. All the information is stored on a PostgreSQL database that is accessed through the Golang. On a nightly basis a Python process reads all the new papers from the Arxiv and adds them into the database. The process computes the paper vectors. The database model is relatively straightforward: it has several tables to store users, papers, authors, paper vectors and user vectors. The information is normalized to reduce redundancy. </p> <p> </p> <center><h3>Want to Try it? <a href="https://iarxiv.org/" target="_blank"><button type="button">Go to IArxiv</button></a></h3></center>