To collect data from the Year Book of International Organizations (the online version), I wrote an automated program to collect all the web pages from the Book's website. Below I will demonstrate how to use a web crawler to collect raw data (web pages).
To start off, import the packages that we need:
import urllib.request
Next, I use a simple web crawler to scrap all web pages on a web site:
i=0
while i < 68710:
i+=1
id_str=str(i+100000).lstrip('1')
id_str = "11000"+id_str
r = urllib.request.urlopen('http://ybio.brillonline.com/s/or/en/'+id_str).read()
w = open(id_str+".txt", 'w')
w.write(r)
w.close()
time.sleep(random.randint(0,60))
Data wrangling refers to the process by which we transform raw data to the kind of data format that contains the information we need for future analysis.
In this case, the raw data are HTML files, which are very difficult to be read by humans. Therefore, it is important to extract the information that is useful for us.
Therefore, there is a need to use HTML parsers to extract information that I need. I use a Python package "BeautifulSoup" to extract information from HTML files.
from bs4 import BeautifulSoup
import bs4 as bs
import pandas as pd
import os
import timeit
import re
Next, we parse all the HTML files to extract relevant information. There are details to take into accout while extracting information (in the first case, the information is names of international organizations). First, names have different formats, so it is necessary to eliminate all the extra, unnecessary characters. Second, one organization has multiple different names. To solve this problem, one would need to identify those redundant names.
ids = []; titles = []
##--------------------------
## get the orgs names
##--------------------------
i=0;count=0
for j in range(0,1100010001):
if count%10000==0:
link_df = pd.DataFrame({"id":ids, "name":titles})
link_df.to_csv('E:\\INGO\\data\\idtitle_df_'+str(count)+".csv", encoding="utf-8")
ids = []; titles = []
count+=1
id_str = str(j)
org = urllib.request.urlopen("E:\\INGO\\data\\"+id_str+".txt")
soup = BeautifulSoup(org)
##--------------------------
## to get the title
##--------------------------
title = soup.findAll("h1")
title = soup.select('h1.title')[0].text.strip()
title = re.sub(r'[\t\r\n]', ' ', title) # remove \t\r\n
title = re.sub(' +',' ',title) # remove multiple space
# get titles without parantheses
title_short = re.sub("[\(\[].*?[\)\]]", "", title) # get rid of text between parantheses
title_short = title_short.rstrip() # get rid of whitespace at the end
# get acronyms
acro = title[title.find("(")+1:title.find(")")] # get the text between parantheses
titles.append(title_short)
id = [int(j)]
ids.extend(id)
# get name 3
na3=soup.br.next_sibling
na_extra=soup.find('p', style=re.compile(r'font-style:italic;color:#666;margin-top:-20px;'))
na2=[]
# get name - name3 = name2
name_extra = na_extra.get_text()
if name_extra:
print "yes" # if na != null
if na3:
if na3 in name_extra:
na2 = name_extra.replace(na3,"")
name3s.append(na3.string)
else:
name3s.append("")
else:
na2 = name_extra
name3s.append("")
else:
name3s.append("")
##name2s=[]
if na2:
name2s.append(na2)
else:
name2s.append("")
Then I combine all the data extracted from HTML files:
fnames = os.listdir("E:\\INGO\\data\\")
idtitle_df_list=[]
fnames = [m for m in fnames if m.startswith("idtitle_dfv")]
for fname in fnames:
idtitle_df_list.append(pd.read_csv("E:\\INGO\\data\\"+fname,index_col=False))
print(fname)
idtitle_dfs = pd.concat(idtitle_df_list)
idtitle_dfs['id'] = idtitle_dfs['id']-1100000000
idtitle_dfs.to_csv("E:\\INGO\\data\\idtitlev.csv",index_col=False)
Now I have parsed all the HTML files that the web crawler scraped from the website, and the next step is to merge all the data parsed from HTML files. Since data are in different formats, it is important to notice the difference among those data.
df1=pd.read_csv("E:\\INGO\\data\\idtitlev.csv", index_col=False) # id
df2=pd.read_csv("E:\\INGO\\data\\ingo.table.csv", index_col=False) #table
result1 = pd.merge(df1, df2, left_on='name', right_on='name')
result2 = pd.merge(df1, df2, left_on='name2', right_on='name')
result2.columns.values[3] = 'name'
result2 = result2.drop(['Unnamed: 0','Unnamed: 0.1','name_y'], axis=1)
result3 = pd.merge(df1, df2, left_on='name3', right_on='name')
result3.columns.values[3] = 'name'
result3 = result3.drop(['Unnamed: 0','Unnamed: 0.1','name_y'], axis=1)
frames = [result1,result2,result3]
result = pd.concat(frames, join_axes=[result1.columns])
result = result.drop(['Unnamed: 0','Unnamed: 0.1'], axis=1)
result = result.sort_values('id')
##----------------------------------------------------
## merge the extra with table
##----------------------------------------------------
df1=pd.read_csv("E:\\INGO\\data\\idtitlev_extra.csv", index_col=False)
df2=pd.read_csv("E:\\INGO\\data\\ingo.table.csv", index_col=False)
cols_to_use = df2.columns - df1.columns
result_extra1 = pd.merge(df1, df2, left_on='name', right_on='name')
result_extra2 = pd.merge(df1, df2, left_on='name2', right_on='name')
result_extra2.columns.values[3] = 'name'
result_extra2 = result_extra2.drop(['Unnamed: 0','Unnamed: 0.1','name_y'], axis=1)
result_extra3 = pd.merge(df1, df2, left_on='name3', right_on='name')
result_extra3.columns.values[3] = 'name'
result_extra3 = result_extra3.drop(['Unnamed: 0','Unnamed: 0.1','name_y'], axis=1)
frames = [result_extra1,result_extra2,result_extra3]
result_extra = pd.concat(frames, join_axes=[result_extra1.columns])
result_extra = result_extra.drop(['Unnamed: 0','Unnamed: 0.1'], axis=1)
result_extra = result_extra.sort_values('id')
result_extra.to_csv('E:\\INGO\\data\\io4_extra.all.csv')
##----------------------------------------------------
## concat result and result_extra
##----------------------------------------------------
frames = [result,result_extra]
result_entire = pd.concat(frames, join_axes=[result.columns])
result_entire.to_csv('E:\\INGO\\data\\io.all.csv')
Now I am preparing data for further analysis. First, because my goal is to model the founding of international organizations at the country level, I combine data I have collected from the Year Book with a preexisting data on countries all over the world.
country = pd.read_csv("E:\\INGO\\data\\member_all.newname.csv", index_col=0)
country_name = country.country.unique()
country_name= country_name.tolist()
df = pd.read_csv("E:\\INGO\\data\\io.country.csv")
for item in country_name:
df[item].fillna(0, inplace=True)
df_country = df[country_name]
df_country['state.total']=0
for i in xrange(0,len(df_country)):
df_country.at[i,'state.total'] = df_country.iloc[[i]].sum(axis=1)
df_country['id']=df['id']
colnames = df_country.columns.tolist()
colnames = colnames[-1:] + colnames[:-1]
df_country = df_country[colnames]
df_country['state.total'].describe()
df['state.total']=df_country['state.total']
df['state.total'].describe()
df.to_csv('E:\\INGO\\data\\io.country.total.csv',index=False)
Second, since I will model the founding of IOs from 1800 to the 2000s, I am building a data set for longitudinal analysis. More precisely, the data set will be a panel one.
country = pd.read_csv("E:\\INGO\\data\\country.io.2.csv")
io = pd.read_csv("E:\\INGO\\data\\io.country.total.csv")
country.ix[0,'ccode']
io_name = country.columns.tolist()
n = 5
io_name = io_name[n:]
head=['ccode','country','year']
head += io_name
## create an empty dataframe ##
for i in xrange (13,len(country)):
if not np.isnan(country.ix[i,'indepy18']):
a=2017-int(country.ix[i,'indepy18'])
country_year = pd.DataFrame(index=range(a),columns=head)
a=2017-int(country.ix[i,'indepy18'])
country_year = pd.DataFrame(index=range(a),columns=head)
country_year['ccode'] = country.ix[i,'ccode']
country_year['country'] = country.ix[i,'country']
country_year['year'] = xrange(int(country.ix[i,'indepy18']),2017)
for j in xrange(0,len(io)): ## row of io len(io)
id = str(int(io.ix[j,'id']))
##print "id=",id
if country.ix[i,id] ==1: ## if there is a link not np.isnan(io.ix[j,'found'])
if not np.isnan(io.ix[j,'found']):
if not np.isnan(io.ix[j,'diss']):
start = int(io.ix[j,'found']) ## found year
start_index = int(start - country.ix[i,'indepy18']) ## found index
end = int(io.ix[j,'diss']) ## end year
end_index = int(end - country.ix[i,'indepy18']) ## end index
country_year.ix[start_index:end_index,id]=1 ## assign 1 between found index and end index country_year.ix[1:3,id]
##print start,end
else:
start = int(io.ix[j,'found']) ## found year
start_index = int(start - country.ix[i,'indepy18']) ## found index
end = 2017 ## end year
end_index = int(end - country.ix[i,'indepy18']) ## end index
country_year.ix[start_index:end_index,id]=1 ## assign 1 between found inde
country_year.to_csv('E:\\INGO\\data\\country.year_'+country.ix[i,'country']+'.csv')
Third, I also will use the data on intergovernmental organizations from the 1800s to the 2000s. The first step to take is to split the data into smaller pieces in order to produce network measures for the network in every year. In the beginning, I import the libraries I need:
import networkx as nx
from networkx.algorithms import bipartite
from networkx.algorithms import *
import pandas as pd
import numpy as np
import scipy as sp
import os
Next, we load data from a local folder into a Pandas dataframe object in Python:
df2 = pd.read_csv("E:\\IGO\\data\\igo.all.csv")
df2[:5]
df2 = pd.read_csv("E:\\IGO\\data\\igo.all.csv")
Then I "divide" the dataframe, "df2," into smaller dataframes by year, and write them back to csv files in another folder.
y = df2['year'].unique().tolist()
for item in y:
df = df2[df2.year==item]
year_index = str(int(item))
df.to_csv('E:\\IGO2\\data\\network by time\\igo_'+year_index+'.csv',encoding='utf-8',index=False)
Now we have 100+ networks in a local folder.
Next, we would like to generate certain network measures, such as centrality and clustering coefficients, for further analysis. An important step is to convert the bipartite graphs to usual adjacency matrix in order to do calculation. The procedure is implemented in the following chunk of code:
fnames = os.listdir("E:\\IGO2\\data\\network by time\\")
fnames = [m for m in fnames if m.startswith("igo_")]
for fname in fnames:
file = str(fname)
df_time = pd.read_csv('E:\\IGO2\\data\\network by time\\'+file)
df_test = df_time.loc[:,2:len(df_time.columns)]
df_test = df_test.drop(['igosum'],axis=1)
df_matrix = sp.sparse.csr_matrix(df_test.values)
B=bipartite.from_biadjacency_matrix(df_matrix)
nx.is_connected(B)
top_nodes = set(n for n,d in B.nodes(data=True) if d['bipartite']==0)
bottom_nodes = set(B) - top_nodes
GW = bipartite.weighted_projected_graph(B, top_nodes)
GW2 = bipartite.overlap_weighted_projected_graph(B, top_nodes)
GW3 = bipartite.collaboration_weighted_projected_graph(B, top_nodes)
df_netV = df_time.loc[:,0:2]
df_netV['igosum'] = df_time['igosum']
df_netV['total'] = df_time['igosum'].sum()
d_cen = degree_centrality(GW) # centrality
df_netV['d_cen'] = d_cen.values()
c_cen = closeness_centrality(GW) # closeness
df_netV['c_cen'] = c_cen.values()
b_cen = betweenness_centrality(GW) # betweenness
df_netV['b_cen'] = b_cen.values()
e_cen_w = eigenvector_centrality_numpy(GW,weight='weight') # eigen for weighted
df_netV['e_cen_w'] = e_cen_w.values()
e_cen_w_2 = eigenvector_centrality_numpy(GW2,weight='weight') # eigen for overlap weighted
df_netV['e_cen_w_2'] = e_cen_w_2.values()
e_cen_w_3 = eigenvector_centrality_numpy(GW3,weight='weight') # eigen for collaboration weighted
df_netV['e_cen_w_3'] = e_cen_w_3.values()
cluster = clustering(GW)
df_netV['cluster'] = cluster.values()
df_netV.to_csv('E:\\IGO2\\data\\network by time\\'+'netV_'+file,encoding='utf-8',index=False)
netv_igo_list=[]
fnames = os.listdir("E:\\IGO2\\data\\network by time\\")
fnames = [m for m in fnames if m.startswith("netV_igo")]
for fname in fnames:
netv_igo_list.append(pd.read_csv("E:\\IGO2\\data\\network by time\\"+fname))
netV_igo_all = pd.concat(netv_igo_list)
netV_igo_all.to_csv('E:\\IGO2\\data\\network by time\\netV_igo.csv',encoding='utf-8',index=False)
Finally, to conduct panel regression analysis, I built a data set that combines longitudinal data from a number of different sources. (Note: This chunck of code is written in the R environment. )
library(foreign)
country_year <- read.csv("E:/INGO/data/country.year.HQ.3.t.csv",stringsAsFactors = FALSE)
polity <- read.csv("E:/INGO/data/newpolity.csv",stringsAsFactors = FALSE)
nmc <- read.csv("E:/INGO/data/nmc.csv",stringsAsFactors = FALSE)
ngo <- read.csv("E:/INGO/data/ngo.csv",stringsAsFactors = FALSE)
igo <- read.csv("E:/IGO2/data/network by time/netV_igo.csv",stringsAsFactors = FALSE)
Banks <- read.csv("E:/INGO/data/banks.6.csv",stringsAsFactors = FALSE)
maddison <- read.csv("E:/INGO/data/maddison.csv")
gdp.mix <- read.csv("E:/INGO/data/gdp.mix.csv")
war <- read.csv("E:/INGO/data/war/war.year.all.csv")
trade <- read.csv("E:/INGO/data/country_trade.csv")
all <- merge(polity, country_year, by=c("ccode","year"))
all <- merge(all, nmc, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, ngo, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, igo, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, Banks, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, maddison, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, gdp.mix, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, war, by=c("ccode","year"), all.x=TRUE)
all <- merge(all, trade, by=c("ccode","year"), all.x=TRUE)
write.csv(all, "E:/INGO/data/merge.2.csv",row.names=FALSE)
For the panel regression analysis, please see the document: Replication Code