Dec. 19th, 2018
During my junior year, I took Dr. Anupam Basu's Introduction to Digital Humanities course, in which we discussed topic models. In our class, we explored a tool called Serendip, which visualizes topics as a matrix of relationships between texts in the corpora and topics. Then, this semester, I took Dr. Neumann's Analysis of Network Data class, and it occurred to me that Serendip's matrix presentation of the topic model is visually similar to an adjacency matrix.
I wanted to take a distinctly different approach than Serendip, though. While Serendip focuses on drilling down into specific relationships and presenting a large volume of data at once, I wanted to focus more on exploring the topic-story relationship in a more approachable, intuitive manner.
As I'm implementing a different style of topic modeling (LDA modeling, this time), I'm becoming more and more curious how one can tune the topic models for fast computation and reliable convergence.
My existing implementation of a topic-model algorithm is strictly random and heuristic. There is no cleverness in the implementation -- it simply assigns words to random topics and calls it a day. LDA randomly distributes words between topics with duplicates, then iterates over the data set, processing it an arbitrary number of times. Ideally, by the time the modeling is over, the topics have converged towards unique terms and the model is stable.
That process, however, sort of begs the question of what an unstable process model would look like. I'm interested to explore that with the implementation of LDA Topic modeling that I've pulled into my project.
For my initial corpus, I arbitrarily chose the twenty most popular texts on Project Gutenberg, selected from this page. Unfortunately, JavaScript isn't quite as efficient as the C++ and Python topic modeling implementations that I'm familiar with, and it's turned out to be extraordinarily slow to process so many records with the LDA algorithm.
To facilitate faster development, I actually took advantage of the flexibility of my approach and turned to analyzing the five most recent State of the Union addresses. However, this decision turned out to be more informative as to the thesis of my project than I had initially anticipated.
Figure 1: A prototype colour-coded network graph of State of the Union addresses.
As I began working on my prototype with the more politically-oriented data set, I had something of a small epiphany. In order to stay true to what I intuited the purpose of my project to be, I had to actively dissuade myself from delving into enhancing the model with functionality specific to that data set. For example, colour-coding nodes based on their party affiliation, while interesting in the context of a politically informed data set, would be nigh-completely irrelevant for a data set like the Gutenberg Top 20 that I started and finished with.
This intuitive discouragement, however, made little sense in the context of the assignment that we, as a class, had been given. This disjunction in intent finally crystalized the intent of my application. Rather than creating a bespoke visualization for a specific set of data, my intention was to create a tool to visualize and explore topic models themselves.
For further development of the data set, it may be worth swapping out the Top-20 texts for something smaller. Song lyrics come to mind, as do public addresses, such as presidential addresses and the like. However, given that I'd purely swapping one set of raw text files for another, the Top-20 corpus is more than sufficient for project development. The flexibility of the application to accommodate many different kinds of input data is also one of the great strengths of topic modeling as a means of distant reading.
As I've mentioned, I've used topic modeling tools in the past, specifically as part of Dr. Anupam Basu's Introduction to Digital Humanities class as a means of "distant reading". Conveniently enough, though, familiarity with the algorithms involved is all that's necessary for analysis of a topic model. The rest of the data analysis is tied exclusively in with displaying it as whatever sort of visualization one desires. Topic models are one of those pesky data structures that's quite simple in theory (it's just a list of weights between stories and topics, after all) and yet also exceptionally difficult to visualize in one's head.
Fortunately, the data itself requires very little pre-processing. All that was requried for the Gutenberg Top 20 was the removal of the Project Gutenberg header and footer from the documents after they were retrieved. The State of the Union addresses required even less cleanup. Again, this versatality is one of the strong points for the design of my application. The data set can easily be expanded or altered entirely with a single-line modification to the application and minimal manipulation of the initial texts.
The original impetus behind this assignment was to create a visual model to frame the relationship between documents in a corpus and the topics extrapolated from the corpus. When I was at the point of framing my project proposal, the motivation towards displaying a topic model as a force-network was simply based on the notion that different documents, or stories, have different affinities towards different topics.
The unrefined notion of the force-network for a topic model was immediately appealing to me, because the force-network facilitated the spacial expression of the "pull" that each topic has on each story. It's exceedingly rare to have a topic model wherein a topic has no relation to a story, but topic models can often have a story that's drawn particularly strongly to an individual topic.
Figure 2: The initial force-network layout with absolute equivalence between documents and topics.
With these characteristics in mind, I continued with the development of the force-network model. My initial intent was to create a visual model that established a visual equivalence between the topics and the documents as elements of the network. My initial conception of the force-network was as a means of establishing visual equivalence between topics and stories. This equalizing intent was based on the notion that both data structures are fundamentally ordered sets of words, though the documents in the data set are created by humans and the topics are the results of a data analysis algorithm.
However, as the design progressed, I quickly found that framing stories and topics as visual equals quickly dilluted the meaning of the topic model to the point of being nigh-useless. Without a visual context for the information in the network, the network ceased to have any meaning.
My first method of addressing the lack of information was to encode the roles of the nodes in the graph with both colour and size. Somewhat arbitrarily, topics became green and large, whereas documents remained blue and small.
Figure 3: The force-network with topics configured as static "force-generators".
As the network model matured, I developed the concept of having force-generator nodes and force-receiver nodes as a means of re-asserting the directed relationship between documents and topics. Force-generator nodes (topics) were assigned a fixed position based on their order in the list of topics, and force-receiver nodes were set in the middle of the graph. d3's force-network simulation then handled the distribution of the force-receivers and allowed for repositioning of the topic nodes.
This framework of force-generator nodes allowed for a clearer expression of hierarchy in the display of the topic model, and additionally force the documents into a somewhat logical spacial distribution in relation to the topics that they're most attracted to.
Ultimately, however, the force-generator model turned out to be a step too far towards over-constraining the force-network model. While documents were arranged according to the topic that they most closely corresponded with, fixing the location of topics in the network prevented the topics from distributing themselves towards the documents that they most closesly correspond with. While this seems like a natural and somewhat obvious consequence of fixing the topics, preventing the topics from clustering around stories robbed the force-network of crucial spacial data.
Figure 4: The force-network with topics visually labeled but without the force-generator model.
When topics are allowed to shift and pull closer to the documents that they most closesly correspond with, it gives the network a spacial distribution of information. When the topics are fixed, there is no meaningful data encoded in the spacial shape of the network. Rather, all of the important data is encoded in the x-y coordinates of the documents.
Now, with the movable stories, the spacial distribution of colour and shapes within the network conveys important information about the topic model. Of particular instance to topic modelling is the case when a topic is generated that is characteristic of only a certain subset of the corpus and loosely connected to the rest of the documents. In the force-generator model, the only impact of this node is that it shifts the distribution of a subset of the documents slightly, perhaps imperceptibly. However, in the model where topics are allowed to move freely, topics that are only loosely connected to the rest of the graph naturally drift away from the rest of the graph and cluster with the subset of the corpus that they correspond with. This behaviour is suitably different from the normal behaviour of a more general topic that it should be visual indicator of important information about the model.
Thus, the final iteration of the design was settled on.
The LDA topic modelling algorithms are derived from open-source implementations in Javascript that were refactored and re-implemented for this project. They are incorporated into the LDA class of the application. Because these algorithms run in the client browser, they're quite slow compared to better optimized implementations in faster lanuages. However, this allows the users of the application to regenerate different topic models and explore the behaviour of LDA topic modelling at their own liesure.
Modifying and implementing LDA turned out to be a bigger challenge than I was expecting. Unfortunately, because of this, work on the topic modelling algorithm consumed more of the project than I had budgeted for at the onset.
Ultimately, the following features were implemented:
Notably missing from the must-have features is a details pane to display more information about stories or documents when they were selected. The functionlaity of this requirement was divided and split into two features: tooltips and a "Selected" sub-window. With another window, the interface to the application became too cluttered to use efficiently. The tooltips in the project deliver the same information that would have been contained in a Details window. For topics, the tooltips display the ten most important words in the topic, and the document display the title and author of the document.
Furthermore, the alternate content-view for explicitly stories made the details pane largely redundant, as there was no information that appeared in the details of the story that did not also appear in the story cards.
In the bonus features, the preset configuration for the network inserts the topics in a ring around the stories, then allows the force-simulation to do the rest of the work. The initial distribution of the topics is most evident in Figure 3, where the ring distribution is visible thanks to the static position of the topics in the network.
Otherwise, all the features listed are implemented in straight-forward ways.
While this project has accomplished the desired goal of the proposal--in-browser exploration and visualization of LDA topic models--there remains ample opportunity for extension to the project. The existing code base could easily be expanded to accomodate multiple different kinds of topic models. With a real back-end, users could upload and analyzed their own corpora.
Another relatively low-burden way to enhance and strengthen tool would be to expose tunable parameters for the topic model generation, as well as different kinds of topic models. One might consider arbitrarily placing words in certain topics, for example, as another topic model engine to implement.
Ultimately, this project suffered most from a lack of hours spent working. The proposal would have been a reasonable task for a group of three working part-time on the project, or the full focus of a single group member. However, with the kind of split-focus that the end of the semester demands, the project ultimately didn't receive the level of polish necessary to take it from the functional to the polished state. Rough edges abound.
I think, however, that the tool is an effective means of visualizing the abstract data embedded in a topic model, specifically in regards to visually communicating the shape of the topic model in relation to the corpus.
This project was certainly a success in creating a visualization of LDA topic models as a sensible force-network in d3.
All things considered, I'm proud of the tool that I created, despite the fact that it's a bit of an ugly duckling. A data visualization only a mother could love. :)