Where Does World Society Come From?

A Methodological Appendix

Jing-Mao Ho

Automated Data Collection

To collect data from the Year Book of International Organizations (the online version), I wrote an automated program to collect all the web pages from the Book's website. Below I will demonstrate how to use a web crawler to collect raw data (web pages).

To start off, import the packages that we need:

In [2]:
import urllib.request

Next, I use a simple web crawler to scrap all web pages on a web site:

In [ ]:
i=0
while i < 68710:
    i+=1
    id_str=str(i+100000).lstrip('1')
    id_str = "11000"+id_str
    r = urllib.request.urlopen('http://ybio.brillonline.com/s/or/en/'+id_str).read()
    w = open(id_str+".txt", 'w')
    w.write(r)
    w.close()
    time.sleep(random.randint(0,60))

Data Wrangling: Parsing Raw Data (HTML Files)

Data wrangling refers to the process by which we transform raw data to the kind of data format that contains the information we need for future analysis.

In this case, the raw data are HTML files, which are very difficult to be read by humans. Therefore, it is important to extract the information that is useful for us.

Therefore, there is a need to use HTML parsers to extract information that I need. I use a Python package "BeautifulSoup" to extract information from HTML files.

In [4]:
from bs4 import BeautifulSoup
import bs4 as bs
import pandas as pd
import os
import timeit
import re

Next, we parse all the HTML files to extract relevant information. There are details to take into accout while extracting information (in the first case, the information is names of international organizations). First, names have different formats, so it is necessary to eliminate all the extra, unnecessary characters. Second, one organization has multiple different names. To solve this problem, one would need to identify those redundant names.

In [ ]:
ids = []; titles = []

##--------------------------
## get the orgs names
##--------------------------
i=0;count=0
for j in range(0,1100010001):
  if count%10000==0:
     link_df = pd.DataFrame({"id":ids, "name":titles})
     link_df.to_csv('E:\\INGO\\data\\idtitle_df_'+str(count)+".csv", encoding="utf-8")
     ids = []; titles = []
  count+=1

  id_str = str(j)
  org = urllib.request.urlopen("E:\\INGO\\data\\"+id_str+".txt")
  soup = BeautifulSoup(org)

  ##--------------------------
  ## to get the title     
  ##--------------------------
  title = soup.findAll("h1")
  title = soup.select('h1.title')[0].text.strip()
  title = re.sub(r'[\t\r\n]', ' ', title) # remove \t\r\n
  title = re.sub(' +',' ',title)          # remove multiple space

  # get titles without parantheses
  title_short = re.sub("[\(\[].*?[\)\]]", "", title) # get rid of text between parantheses
  title_short = title_short.rstrip()                 # get rid of whitespace at the end

  # get acronyms
  acro = title[title.find("(")+1:title.find(")")]    # get the text between parantheses

  titles.append(title_short)
  id = [int(j)]
  ids.extend(id)

  # get  name 3
  na3=soup.br.next_sibling

  na_extra=soup.find('p', style=re.compile(r'font-style:italic;color:#666;margin-top:-20px;'))

  na2=[]

  # get  name - name3 = name2
  name_extra = na_extra.get_text()
  if name_extra:
      print "yes" # if na != null
      if na3:
         if na3 in name_extra:
            na2 = name_extra.replace(na3,"")
            name3s.append(na3.string)
         else:
            name3s.append("")
      else:
          na2 = name_extra
          name3s.append("")
  else:
    name3s.append("")

  ##name2s=[]
  if na2:
      name2s.append(na2)
  else:
      name2s.append("")
  

Then I combine all the data extracted from HTML files:

In [ ]:
fnames = os.listdir("E:\\INGO\\data\\")
idtitle_df_list=[]

fnames = [m for m in fnames if m.startswith("idtitle_dfv")]
for fname in fnames:
    idtitle_df_list.append(pd.read_csv("E:\\INGO\\data\\"+fname,index_col=False))
    print(fname)

idtitle_dfs = pd.concat(idtitle_df_list)

idtitle_dfs['id'] = idtitle_dfs['id']-1100000000

idtitle_dfs.to_csv("E:\\INGO\\data\\idtitlev.csv",index_col=False)

Data Wrangling: Merging Data Parsed from HTML Files

Now I have parsed all the HTML files that the web crawler scraped from the website, and the next step is to merge all the data parsed from HTML files. Since data are in different formats, it is important to notice the difference among those data.

In [ ]:
df1=pd.read_csv("E:\\INGO\\data\\idtitlev.csv", index_col=False) # id
df2=pd.read_csv("E:\\INGO\\data\\ingo.table.csv", index_col=False) #table

result1 = pd.merge(df1, df2, left_on='name', right_on='name')

result2 = pd.merge(df1, df2, left_on='name2', right_on='name')
result2.columns.values[3] = 'name'
result2 = result2.drop(['Unnamed: 0','Unnamed: 0.1','name_y'], axis=1)

result3 = pd.merge(df1, df2, left_on='name3', right_on='name')
result3.columns.values[3] = 'name'
result3 = result3.drop(['Unnamed: 0','Unnamed: 0.1','name_y'], axis=1)

frames = [result1,result2,result3]
result = pd.concat(frames, join_axes=[result1.columns])
result = result.drop(['Unnamed: 0','Unnamed: 0.1'], axis=1)
result = result.sort_values('id')

##----------------------------------------------------
## merge the extra with table    
##----------------------------------------------------
df1=pd.read_csv("E:\\INGO\\data\\idtitlev_extra.csv", index_col=False)
df2=pd.read_csv("E:\\INGO\\data\\ingo.table.csv", index_col=False)

cols_to_use = df2.columns - df1.columns

result_extra1 = pd.merge(df1, df2, left_on='name', right_on='name')

result_extra2 = pd.merge(df1, df2, left_on='name2', right_on='name')
result_extra2.columns.values[3] = 'name'
result_extra2 = result_extra2.drop(['Unnamed: 0','Unnamed: 0.1','name_y'], axis=1)

result_extra3 = pd.merge(df1, df2, left_on='name3', right_on='name')
result_extra3.columns.values[3] = 'name'
result_extra3 = result_extra3.drop(['Unnamed: 0','Unnamed: 0.1','name_y'], axis=1)

frames = [result_extra1,result_extra2,result_extra3]
result_extra = pd.concat(frames, join_axes=[result_extra1.columns])
result_extra = result_extra.drop(['Unnamed: 0','Unnamed: 0.1'], axis=1)
result_extra = result_extra.sort_values('id')

result_extra.to_csv('E:\\INGO\\data\\io4_extra.all.csv')

##----------------------------------------------------
## concat result and result_extra  
##----------------------------------------------------
frames = [result,result_extra]
result_entire = pd.concat(frames, join_axes=[result.columns])

result_entire.to_csv('E:\\INGO\\data\\io.all.csv')

Pereparing Data for Longitudinal Network Analysis (in Python)

Now I am preparing data for further analysis. First, because my goal is to model the founding of international organizations at the country level, I combine data I have collected from the Year Book with a preexisting data on countries all over the world.

In [ ]:
country = pd.read_csv("E:\\INGO\\data\\member_all.newname.csv", index_col=0)
country_name = country.country.unique()
country_name= country_name.tolist()
df = pd.read_csv("E:\\INGO\\data\\io.country.csv")

for item in country_name:
    df[item].fillna(0, inplace=True)

df_country = df[country_name]
df_country['state.total']=0
for i in xrange(0,len(df_country)):
    df_country.at[i,'state.total'] = df_country.iloc[[i]].sum(axis=1)

df_country['id']=df['id']
colnames = df_country.columns.tolist()
colnames = colnames[-1:] + colnames[:-1]
df_country = df_country[colnames]

df_country['state.total'].describe()
df['state.total']=df_country['state.total']
df['state.total'].describe()
df.to_csv('E:\\INGO\\data\\io.country.total.csv',index=False)

Second, since I will model the founding of IOs from 1800 to the 2000s, I am building a data set for longitudinal analysis. More precisely, the data set will be a panel one.

In [ ]:
country = pd.read_csv("E:\\INGO\\data\\country.io.2.csv")
io = pd.read_csv("E:\\INGO\\data\\io.country.total.csv")

country.ix[0,'ccode']
io_name = country.columns.tolist()
n = 5
io_name = io_name[n:]
head=['ccode','country','year']
head += io_name

## create an empty dataframe ##
for i in xrange (13,len(country)):
  if not np.isnan(country.ix[i,'indepy18']):
    a=2017-int(country.ix[i,'indepy18'])
    country_year = pd.DataFrame(index=range(a),columns=head)
    a=2017-int(country.ix[i,'indepy18'])
    country_year = pd.DataFrame(index=range(a),columns=head)
    country_year['ccode'] = country.ix[i,'ccode']
    country_year['country'] = country.ix[i,'country']
    country_year['year'] = xrange(int(country.ix[i,'indepy18']),2017)

    for j in xrange(0,len(io)): ## row of io len(io)
        id = str(int(io.ix[j,'id']))
        ##print "id=",id
        if country.ix[i,id] ==1:  ## if there is a link not np.isnan(io.ix[j,'found'])
            if not np.isnan(io.ix[j,'found']):
             if not np.isnan(io.ix[j,'diss']):
               start = int(io.ix[j,'found']) ## found year
               start_index = int(start - country.ix[i,'indepy18'])  ## found index
               end = int(io.ix[j,'diss'])  ## end year
               end_index = int(end - country.ix[i,'indepy18'])  ## end index
               country_year.ix[start_index:end_index,id]=1  ## assign 1 between found index and end index country_year.ix[1:3,id]
               ##print start,end
             else:
               start = int(io.ix[j,'found']) ## found year
               start_index = int(start - country.ix[i,'indepy18'])  ## found index
               end = 2017  ## end year
               end_index = int(end - country.ix[i,'indepy18'])  ## end index
               country_year.ix[start_index:end_index,id]=1  ## assign 1 between found inde
    country_year.to_csv('E:\\INGO\\data\\country.year_'+country.ix[i,'country']+'.csv')

Third, I also will use the data on intergovernmental organizations from the 1800s to the 2000s. The first step to take is to split the data into smaller pieces in order to produce network measures for the network in every year. In the beginning, I import the libraries I need:

In [6]:
import networkx as nx
from networkx.algorithms import bipartite
from networkx.algorithms import *
import pandas as pd
import numpy as np
import scipy as sp
import os

Next, we load data from a local folder into a Pandas dataframe object in Python:

In [7]:
df2 = pd.read_csv("E:\\IGO\\data\\igo.all.csv")
In [8]:
df2[:5]
Out[8]:
ccode year AAAID AAB AACB AACarib AALCO AARO AATA AATPO ... WEU WHO WIPO WMO WNF WPact WTO WTOURO Wassen igosum
0 100.0 1832.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 100.0 1833.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 100.0 1834.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 100.0 1835.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 100.0 1836.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 532 columns

In [2]:
df2 = pd.read_csv("E:\\IGO\\data\\igo.all.csv")

Then I "divide" the dataframe, "df2," into smaller dataframes by year, and write them back to csv files in another folder.

In [ ]:
y = df2['year'].unique().tolist()
for item in y:
    df = df2[df2.year==item]
    year_index = str(int(item))
    df.to_csv('E:\\IGO2\\data\\network by time\\igo_'+year_index+'.csv',encoding='utf-8',index=False)

Now we have 100+ networks in a local folder.

Next, we would like to generate certain network measures, such as centrality and clustering coefficients, for further analysis. An important step is to convert the bipartite graphs to usual adjacency matrix in order to do calculation. The procedure is implemented in the following chunk of code:

In [ ]:
fnames = os.listdir("E:\\IGO2\\data\\network by time\\")
fnames = [m for m in fnames if m.startswith("igo_")]
for fname in fnames:
    file = str(fname)
    df_time = pd.read_csv('E:\\IGO2\\data\\network by time\\'+file)
    df_test = df_time.loc[:,2:len(df_time.columns)]
    df_test = df_test.drop(['igosum'],axis=1)
    df_matrix = sp.sparse.csr_matrix(df_test.values)
    B=bipartite.from_biadjacency_matrix(df_matrix)
    nx.is_connected(B)
    top_nodes = set(n for n,d in B.nodes(data=True) if d['bipartite']==0)
    bottom_nodes = set(B) - top_nodes
    GW = bipartite.weighted_projected_graph(B, top_nodes)
    GW2 = bipartite.overlap_weighted_projected_graph(B, top_nodes)
    GW3 = bipartite.collaboration_weighted_projected_graph(B, top_nodes)

    df_netV = df_time.loc[:,0:2]
    df_netV['igosum'] = df_time['igosum']
    df_netV['total'] = df_time['igosum'].sum()

    d_cen = degree_centrality(GW) # centrality
    df_netV['d_cen'] = d_cen.values()
    c_cen = closeness_centrality(GW) # closeness
    df_netV['c_cen'] = c_cen.values()
    b_cen = betweenness_centrality(GW) # betweenness
    df_netV['b_cen'] = b_cen.values()

    e_cen_w = eigenvector_centrality_numpy(GW,weight='weight')   # eigen for weighted
    df_netV['e_cen_w'] = e_cen_w.values()
    e_cen_w_2 = eigenvector_centrality_numpy(GW2,weight='weight') # eigen for overlap weighted
    df_netV['e_cen_w_2'] = e_cen_w_2.values()
    e_cen_w_3 = eigenvector_centrality_numpy(GW3,weight='weight') # eigen for collaboration weighted
    df_netV['e_cen_w_3'] = e_cen_w_3.values()

    cluster = clustering(GW)
    df_netV['cluster'] = cluster.values()

    df_netV.to_csv('E:\\IGO2\\data\\network by time\\'+'netV_'+file,encoding='utf-8',index=False)
    
netv_igo_list=[]
fnames = os.listdir("E:\\IGO2\\data\\network by time\\")
fnames = [m for m in fnames if m.startswith("netV_igo")]
for fname in fnames:
    netv_igo_list.append(pd.read_csv("E:\\IGO2\\data\\network by time\\"+fname))

netV_igo_all = pd.concat(netv_igo_list)
netV_igo_all.to_csv('E:\\IGO2\\data\\network by time\\netV_igo.csv',encoding='utf-8',index=False)    

Preparing Data for Panel Regression Analysis (in R)

Finally, to conduct panel regression analysis, I built a data set that combines longitudinal data from a number of different sources. (Note: This chunck of code is written in the R environment. )

In [ ]:
library(foreign)

country_year <- read.csv("E:/INGO/data/country.year.HQ.3.t.csv",stringsAsFactors = FALSE)
polity <- read.csv("E:/INGO/data/newpolity.csv",stringsAsFactors = FALSE)
nmc <- read.csv("E:/INGO/data/nmc.csv",stringsAsFactors = FALSE)
ngo <- read.csv("E:/INGO/data/ngo.csv",stringsAsFactors = FALSE)
igo <- read.csv("E:/IGO2/data/network by time/netV_igo.csv",stringsAsFactors = FALSE)
Banks <- read.csv("E:/INGO/data/banks.6.csv",stringsAsFactors = FALSE)
maddison <- read.csv("E:/INGO/data/maddison.csv")
gdp.mix <- read.csv("E:/INGO/data/gdp.mix.csv")
war <- read.csv("E:/INGO/data/war/war.year.all.csv")
trade <- read.csv("E:/INGO/data/country_trade.csv")

all <- merge(polity, country_year, by=c("ccode","year"))
all <- merge(all, nmc, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, ngo, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, igo, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, Banks, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, maddison, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, gdp.mix, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, war, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, trade, by=c("ccode","year"), all.x=TRUE)

write.csv(all, "E:/INGO/data/merge.2.csv",row.names=FALSE)

For the panel regression analysis, please see the document: Replication Code