Making Sense of Organizationl Documents by Using Natural Language Processing Tools

Jing-Mao Ho

Introduction

One of the research questions that I address in my dissertation project is: What does the term "statistics" mean? I approach this question from a number of perspectives. Each perspective corresponds to a context in which the meaning of the term statistics is created, appropriated, and reproduced. In this document, I focus on the political context because official statistics are everywhere in society and national governments are key actors that keep and disseminate statistical numbers. Therefore, I collected the official documents from 192 countries' national statistical offices (NSO) that introduce their missions and tasks. This is to understand the meaning of the term statistics in the political context. Methodologically, I chose to use an approach that has been commonly used in the field of natural language processing: LDA Topic Models.

Processing the Data

Loading Data

In the beginning, I loaded the "nltk" package, which is a popular Python natural language processing package. Aslo, I read the text data that I already collected.

In [1]:
import nltk
f = open("E:\\Topic\\organizations.txt")
raw = f.read()

Tokenization

Next, I tokenize the document, which is the most basic step in natural language processing.

In [2]:
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
In [3]:
from nltk import word_tokenize
tokens = word_tokenize(raw)
text = nltk.Text(tokens)
In [4]:
text
Out[4]:
<Text: Country Profile of Afghanistan Main statistical agency Main...>

There is a very useful command that one can use to list all the sentences that consist of a certain word. In my research, I am very interested in the word "census."

In [5]:
text.concordance('census')
Displaying 25 of 1059 matches:
 5 . Finance and Administration 6 . Census and Surveys 7 . Provincial Supervis
a collection Most recent population census 1 April 2001 Access to administrati
tional Commission of the Population Census ( N.C.P.C ) was created , and the f
 created , and the first population census was conducted in 1966 . In 1971 , i
n 1972-1975 , the second population census in 1977 , and the census and househ
population census in 1977 , and the census and household consumption survey in
a collection Most recent population census 16-30 April 2008 Access to administ
a collection Most recent population census 2009 Data confidentiality The Andor
available as a result of occasional census , detailed trade data collection an
. The next household and population census was conducted in 2001 , which provi
a collection Most recent population census 9 May 2001 Access to administrative
a collection Most recent population census 28 May 2001 Data confidentiality In
ruise Ship Arrivals 2006 , Complete Census Summary Report 2001 , The 2005 Top 
 1869 the first National Population Census was carried out throughout the enti
pervision of all Federal survey and census operations , as well as the elabora
a collection Most recent population census 18 November 2001 Data dissemination
esults of the National Agricultural Census . Year 2002 . Cereal produce , by c
tion , there is a Law on Population Census . The Law “ On Population Census ” 
on Census . The Law “ On Population Census ” ( adopted on October 12th , 1999 
e the guidelines for the population census , which was organised in 2001 . The
nd implementation of the population census . It also regulates financial means
data gathered during the population census . The Law on Civil Service , adopte
er 2002 . The Law “ On Agricultural Census ” has been adopted by the RA Govern
a collection Most recent population census 10-19 October 2001 . Access to admi
a collection Most recent population census 14 October 2000 Access to administr

Feature Extraction

Transforming Tokens to Matrices

To extrat information I need from the tokenized data, I have to create a feature matrix. To do so, one needs a Python package: sci-kit learn.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

Next, I transformed tokens to a matrix by using the command "CountVectorizer."

In [7]:
vectorize = CountVectorizer()
vectorize.fit(tokens)
Out[7]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

One also cant get all the features' names.

In [8]:
features = vectorize.get_feature_names()

Filtering Out Uninformative Features

The words (features) that are filtered out are called "stop words." In the sci-kit learn package, there a built-in dictionary of common English stop words. Therefore, I singled out all the common stop words from my data.

In [9]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
vectorize = CountVectorizer(min_df=100, max_df=10000,stop_words="english").fit(tokens)
tokens_reduced = vectorize.transform(tokens)
print("tokens without stop words:\n{}".format(repr(tokens_reduced)))
tokens without stop words:
<320083x289 sparse matrix of type '<class 'numpy.int64'>'
	with 100258 stored elements in Compressed Sparse Row format>

As a result, I got a feature matrix that is based on the data filtered out the stop words.

In [10]:
tokens_reduced
Out[10]:
<320083x289 sparse matrix of type '<class 'numpy.int64'>'
	with 100258 stored elements in Compressed Sparse Row format>

LDA Topic Models

To fit LDA topic models, first one needs to load the LDA library.

In [11]:
from sklearn.decomposition import LatentDirichletAllocation

I opted to obtain 5 topics and 10 interations

In [12]:
lda = LatentDirichletAllocation(n_topics=5, learning_method="batch", max_iter=10, random_state=0)

Then I fit the LDA models:

In [13]:
topics = lda.fit_transform(tokens_reduced)
c:\users\hojin\anaconda3\pkgs\python-3.7.0-hea74fb7_0\lib\site-packages\sklearn\decomposition\online_lda.py:314: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21
  DeprecationWarning)
In [14]:
lda.components_.shape
Out[14]:
(5, 289)

Next, I sorted out the topic features.

In [15]:
import numpy as np
components_sorted = np.argsort(lda.components_, axis=1)[:, ::-1]

Now we can take a look at what features we have had:

In [16]:
features = np.array(vectorize.get_feature_names())
features
Out[16]:
array(['10', '2000', '2001', '2002', '2003', '2004', '2005', '2006',
       '2007', '2008', 'access', 'accordance', 'according', 'accounts',
       'act', 'activities', 'activity', 'address', 'administration',
       'administrative', 'advisory', 'affairs', 'agencies', 'agency',
       'agricultural', 'agriculture', 'analysis', 'annual', 'annually',
       'appointed', 'approved', 'area', 'areas', 'article', 'authorities',
       'authority', 'availability', 'available', 'azerbaijan',
       'background', 'bank', 'based', 'basic', 'basis', 'board', 'bodies',
       'body', 'brief', 'budget', 'bulletin', 'bureau', 'business',
       'cabinet', 'calendar', 'carried', 'cd', 'census', 'censuses',
       'central', 'collect', 'collected', 'collection', 'commission',
       'committee', 'compilation', 'conditions', 'conduct', 'conducted',
       'conducting', 'confidential', 'confidentiality', 'consumer',
       'cooperation', 'coordination', 'council', 'country', 'created',
       'cso', 'data', 'databanks', 'databases', 'decree', 'demographic',
       'department', 'departments', 'deputy', 'des', 'development',
       'different', 'director', 'directorate', 'disseminated',
       'dissemination', 'division', 'divisions', 'duties', 'economic',
       'economy', 'education', 'employees', 'employment', 'en', 'english',
       'ensure', 'enterprises', 'environment', 'established',
       'establishment', 'establishments', 'et', 'european', 'executive',
       'existence', 'external', 'federal', 'field', 'finance',
       'financial', 'following', 'foreign', 'form', 'functions',
       'general', 'georgia', 'geostat', 'gov', 'government',
       'governmental', 'head', 'health', 'history', 'household',
       'housing', 'http', 'implementation', 'including', 'income',
       'index', 'indicators', 'individual', 'industrial', 'industry',
       'ine', 'information', 'institute', 'institution', 'institutions',
       'international', 'issued', 'issues', 'la', 'labour', 'languages',
       'law', 'legal', 'legislation', 'level', 'local', 'main', 'making',
       'management', 'members', 'methodology', 'microdata', 'minister',
       'ministers', 'ministries', 'ministry', 'monthly', 'multi',
       'national', 'necessary', 'needs', 'new', 'non', 'nso', 'number',
       'observations', 'office', 'offices', 'official', 'online',
       'operations', 'order', 'organisation', 'organisations',
       'organization', 'organizational', 'organizations', 'paper',
       'period', 'person', 'persons', 'plan', 'planning', 'policy',
       'population', 'position', 'preparation', 'president', 'price',
       'prices', 'private', 'procedures', 'process', 'processing',
       'produced', 'producers', 'production', 'profile', 'program',
       'programme', 'programs', 'protection', 'provide', 'provided',
       'provides', 'providing', 'public', 'publication', 'publications',
       'publish', 'published', 'purpose', 'purposes', 'quality',
       'quarterly', 'recent', 'records', 'regional', 'register',
       'related', 'release', 'relevant', 'report', 'reporting', 'reports',
       'republic', 'required', 'research', 'resources', 'respondents',
       'responsibility', 'responsible', 'results', 'rom', 'scientific',
       'section', 'sector', 'service', 'services', 'set', 'shall',
       'social', 'socio', 'sources', 'special', 'staff', 'standards',
       'state', 'statistical', 'statistician', 'statistics', 'structure',
       'studies', 'subject', 'support', 'survey', 'surveys', 'technical',
       'term', 'time', 'trade', 'training', 'transport', 'ukraine',
       'unit', 'units', 'use', 'used', 'users', 'various', 'web',
       'website', 'work', 'www', 'year', 'yearbook', 'years'],
      dtype='<U15')

Finally, using the machine learning library "mglearn", I laid out all the topics that LDA models discovered.

In [17]:
import mglearn
mglearn.tools.print_topics(topics=range(5), feature_names=features, sorting=components_sorted, topics_per_chunk=5, n_words=9)
topic 0       topic 1       topic 2       topic 3       topic 4       
--------      --------      --------      --------      --------      
published     national      statistical   data          main          
calendar      census        statistics    state         annual        
paper         official      website       population    online        
department    economic      information   government    databanks     
act           release       publications  rom           existence     
legal         cd            disseminated  ministry      division      
bodies        office        databases     agency        general       
surveys       administrativeactivities    annually      social        
law           dissemination council       public        basis