Jing-Mao Ho
One of the research questions that I address in my dissertation project is: What does the term "statistics" mean? I approach this question from a number of perspectives. Each perspective corresponds to a context in which the meaning of the term statistics is created, appropriated, and reproduced. In this document, I focus on the political context because official statistics are everywhere in society and national governments are key actors that keep and disseminate statistical numbers. Therefore, I collected the official documents from 192 countries' national statistical offices (NSO) that introduce their missions and tasks. This is to understand the meaning of the term statistics in the political context. Methodologically, I chose to use an approach that has been commonly used in the field of natural language processing: LDA Topic Models.
import nltk
f = open("E:\\Topic\\organizations.txt")
raw = f.read()
Next, I tokenize the document, which is the most basic step in natural language processing.
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
from nltk import word_tokenize
tokens = word_tokenize(raw)
text = nltk.Text(tokens)
text
There is a very useful command that one can use to list all the sentences that consist of a certain word. In my research, I am very interested in the word "census."
text.concordance('census')
To extrat information I need from the tokenized data, I have to create a feature matrix. To do so, one needs a Python package: sci-kit learn.
from sklearn.feature_extraction.text import CountVectorizer
Next, I transformed tokens to a matrix by using the command "CountVectorizer."
vectorize = CountVectorizer()
vectorize.fit(tokens)
One also cant get all the features' names.
features = vectorize.get_feature_names()
The words (features) that are filtered out are called "stop words." In the sci-kit learn package, there a built-in dictionary of common English stop words. Therefore, I singled out all the common stop words from my data.
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
vectorize = CountVectorizer(min_df=100, max_df=10000,stop_words="english").fit(tokens)
tokens_reduced = vectorize.transform(tokens)
print("tokens without stop words:\n{}".format(repr(tokens_reduced)))
As a result, I got a feature matrix that is based on the data filtered out the stop words.
tokens_reduced
To fit LDA topic models, first one needs to load the LDA library.
from sklearn.decomposition import LatentDirichletAllocation
I opted to obtain 5 topics and 10 interations
lda = LatentDirichletAllocation(n_topics=5, learning_method="batch", max_iter=10, random_state=0)
Then I fit the LDA models:
topics = lda.fit_transform(tokens_reduced)
lda.components_.shape
Next, I sorted out the topic features.
import numpy as np
components_sorted = np.argsort(lda.components_, axis=1)[:, ::-1]
Now we can take a look at what features we have had:
features = np.array(vectorize.get_feature_names())
features
Finally, using the machine learning library "mglearn", I laid out all the topics that LDA models discovered.
import mglearn
mglearn.tools.print_topics(topics=range(5), feature_names=features, sorting=components_sorted, topics_per_chunk=5, n_words=9)