African Conflicts

A Series Exploring Conflicts on the Content

Ravi 2018-03-22

African Conflicts

The data is from African Conflict Location and Event Data Project and can be downloaded from Kaggle https://www.kaggle.com/jboysen/african-conflicts

From the Kaggle Description:

“This dataset codes the dates and locations of all reported political violence and protest events in dozens of developing countries in Africa. Political violence and protest includes events that occur within civil wars and periods of instability, public protest and regime breakdown. The project covers all African countries from 1997 to the present.”

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

pd.options.display.max_columns = 999
df = pd.read_csv('african_conflicts.csv', encoding ='latin1', low_memory = False)
df.head()

	ACTOR1	ACTOR1_ID	ACTOR2	ACTOR2_ID	ACTOR_DYAD_ID	ADMIN1	ADMIN2	ADMIN3	ALLY_ACTOR_1	ALLY_ACTOR_2	COUNTRY	EVENT_DATE	EVENT_ID_CNTY	EVENT_ID_NO_CNTY	EVENT_TYPE	FATALITIES	GEO_PRECISION	GWNO	INTER1	INTER2	INTERACTION	LATITUDE	LOCATION	LONGITUDE	NOTES	SOURCE	TIME_PRECISION	YEAR
0	Police Forces of Algeria (1999-)	NaN	Civilians (Algeria)	NaN	NaN	Tizi Ouzou	Beni-Douala	NaN	NaN	Berber Ethnic Group (Algeria)	Algeria	18/04/2001	1416RTA	NaN	Violence against civilians	1	1	615	1	7	17	36.61954	Beni Douala	4.08282	A Berber student was shot while in police cust...	Associated Press Online	1	2001
1	Rioters (Algeria)	NaN	Police Forces of Algeria (1999-)	NaN	NaN	Tizi Ouzou	Tizi Ouzou	NaN	Berber Ethnic Group (Algeria)	NaN	Algeria	19/04/2001	2229RTA	NaN	Riots/Protests	0	3	615	5	1	15	36.71183	Tizi Ouzou	4.04591	Riots were reported in numerous villages in Ka...	Kabylie report	1	2001
2	Protesters (Algeria)	NaN	NaN	NaN	NaN	Bejaia	Amizour	NaN	Students (Algeria)	NaN	Algeria	20/04/2001	2230RTA	NaN	Riots/Protests	0	1	615	6	0	60	36.64022	Amizour	4.90131	Students protested in the Amizour area. At lea...	Crisis Group	1	2001
3	Rioters (Algeria)	NaN	Police Forces of Algeria (1999-)	NaN	NaN	Bejaia	Amizour	NaN	Berber Ethnic Group (Algeria)	NaN	Algeria	21/04/2001	2231RTA	NaN	Riots/Protests	0	1	615	5	1	15	36.64022	Amizour	4.90131	Rioters threw molotov cocktails, rocks and bur...	Kabylie report	1	2001
4	Rioters (Algeria)	NaN	Police Forces of Algeria (1999-)	NaN	NaN	Tizi Ouzou	Beni-Douala	NaN	Berber Ethnic Group (Algeria)	NaN	Algeria	21/04/2001	2232RTA	NaN	Riots/Protests	0	1	615	5	1	15	36.61954	Beni Douala	4.08282	Rioters threw molotov cocktails, rocks and bur...	Kabylie report	1	2001

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165808 entries, 0 to 165807
Data columns (total 28 columns):
ACTOR1              165808 non-null object
ACTOR1_ID           140747 non-null float64
ACTOR2              122255 non-null object
ACTOR2_ID           140747 non-null float64
ACTOR_DYAD_ID       140747 non-null object
ADMIN1              165808 non-null object
ADMIN2              165676 non-null object
ADMIN3              86933 non-null object
ALLY_ACTOR_1        28144 non-null object
ALLY_ACTOR_2        19651 non-null object
COUNTRY             165808 non-null object
EVENT_DATE          165808 non-null object
EVENT_ID_CNTY       165808 non-null object
EVENT_ID_NO_CNTY    140747 non-null float64
EVENT_TYPE          165808 non-null object
FATALITIES          165808 non-null int64
GEO_PRECISION       165808 non-null int64
GWNO                165808 non-null int64
INTER1              165808 non-null int64
INTER2              165808 non-null int64
INTERACTION         165808 non-null int64
LATITUDE            165808 non-null object
LOCATION            165805 non-null object
LONGITUDE           165808 non-null object
NOTES               155581 non-null object
SOURCE              165635 non-null object
TIME_PRECISION      165808 non-null int64
YEAR                165808 non-null int64
dtypes: float64(3), int64(8), object(17)
memory usage: 35.4+ MB

There are a lot of features in this data set. I’m particularly interested Country where a conflict occured, which countries or groups of people were involved, when the event took place, How many fatalities, some notes on the conflict, and what type of conflict was it.

So I am going to limit our data set to these columns

df = df.loc[:,['ACTOR1', 'ACTOR2', 'ADMIN1', 'ADMIN2','COUNTRY', 'LOCATION', 'EVENT_DATE', 'EVENT_TYPE', 'INTERACTION', 'FATALITIES', 'NOTES']]
df.head()

	ACTOR1	ACTOR2	ADMIN1	ADMIN2	COUNTRY	LOCATION	EVENT_DATE	EVENT_TYPE	INTERACTION	FATALITIES	NOTES
0	Police Forces of Algeria (1999-)	Civilians (Algeria)	Tizi Ouzou	Beni-Douala	Algeria	Beni Douala	18/04/2001	Violence against civilians	17	1	A Berber student was shot while in police cust...
1	Rioters (Algeria)	Police Forces of Algeria (1999-)	Tizi Ouzou	Tizi Ouzou	Algeria	Tizi Ouzou	19/04/2001	Riots/Protests	15	0	Riots were reported in numerous villages in Ka...
2	Protesters (Algeria)	NaN	Bejaia	Amizour	Algeria	Amizour	20/04/2001	Riots/Protests	60	0	Students protested in the Amizour area. At lea...
3	Rioters (Algeria)	Police Forces of Algeria (1999-)	Bejaia	Amizour	Algeria	Amizour	21/04/2001	Riots/Protests	15	0	Rioters threw molotov cocktails, rocks and bur...
4	Rioters (Algeria)	Police Forces of Algeria (1999-)	Tizi Ouzou	Beni-Douala	Algeria	Beni Douala	21/04/2001	Riots/Protests	15	0	Rioters threw molotov cocktails, rocks and bur...

The column Interaction actually tells us quite a bit. The first digit is the type of Actor 1, and the second digit is the type of Actor 2. From the Userguide:

A numeric code indicating the interaction between types of ACTOR1 and ACTOR2. Coded as an interaction between actor types, and recorded as lowest joint number:

1 Government/Military/Police
2 Rebel group
3 Political Militia
4 Communal Militia
5 Rioters
6 Protestors
7 Civilians
8 Other (e.g. Regional groups such as AFICOM; or UN

e.g. When the action is between a government and a rebel group, this will be coded as 12; when a political militia attacks civilians, it is coded as 37.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165808 entries, 0 to 165807
Data columns (total 11 columns):
ACTOR1         165808 non-null object
ACTOR2         122255 non-null object
ADMIN1         165808 non-null object
ADMIN2         165676 non-null object
COUNTRY        165808 non-null object
LOCATION       165805 non-null object
EVENT_DATE     165808 non-null object
EVENT_TYPE     165808 non-null object
INTERACTION    165808 non-null int64
FATALITIES     165808 non-null int64
NOTES          155581 non-null object
dtypes: int64(2), object(9)
memory usage: 13.9+ MB

Suprisingly the above shows that we don’t have a lot of Null values. Only in places we might expect such as ‘Notes’ or ‘Actor2’. I know I’m want to explore conflicts as they progress over time so I better enocde Event_Date as a datetime object

df['EVENT_DATE'] = pd.to_datetime(df['EVENT_DATE'])

Exploratory Data Analysis

Off the bat here are some things I’m curious about:

In the last 5 years which African country had the most conflict events?
most fatalities?
what type of events were those?

df[doi].COUNTRY.value_counts().head(10)

Somalia                         16916
Nigeria                          8933
South Africa                     8425
Sudan                            7849
Egypt                            7829
Democratic Republic of Congo     6368
Libya                            5752
South Sudan                      5083
Burundi                          4479
Tunisia                          3740
Name: COUNTRY, dtype: int64

doi = df['EVENT_DATE'] > '2012'
plt.figure(figsize=(8,6))
sns.countplot(x = 'COUNTRY',data = df[doi], order=df[doi].COUNTRY.value_counts().head(10).index,
             )
plt.xticks(rotation = 90)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)

png

In the last 5 years Somalia by far has had the most events, more that double that most of the other African Countries. The list isnt too suprising if you’ve been paying attention to the news. We see some North African Countries: Libya, Tunisia, Egypt so this data shows the after effects of Arab Springs. Remember also that in 2011 Gadaffi was overthrown.

Lets see this list by fatalities

plt.figure(figsize=(8,6))
sns.barplot(x='COUNTRY', y='FATALITIES', data = df[doi],
            estimator = sum,
            ci = None,
           order = df[doi].groupby('COUNTRY')['FATALITIES'].sum().sort_values(ascending = False).head(10).index)

plt.xticks(rotation = 90)
plt.ylabel('Total Sum of Fatalities')

<matplotlib.text.Text at 0x1fcdf569668>

png

Very very interesting! Although Somalia had the most number of incidents, Nigeria by far had the most fatalities. I wonder what type of conflicts occured there? Lets take a look!

dfdoi = df[doi]

nigeria = dfdoi['COUNTRY']=='Nigeria'
plt.figure(figsize = (12,6))
sns.countplot(x = 'INTERACTION', data = dfdoi[nigeria])

<matplotlib.axes._subplots.AxesSubplot at 0x1fcdf1b77f0>

png

60 refers to Protestors Sole Action, that actually is a decent measure of unrest in a country, and entirely possible there were no fatalities during many of these protests. Lets do a quick pandas query sorting INTERACTION type by number of fatalities

dfdoi[nigeria].groupby('INTERACTION')['FATALITIES'].sum().sort_values(ascending = False)

INTERACTION
  9807
  8388
  7781
  5730
  3951
  1432
  1285
   681
   588
   428
   390
   339
   284
   238
   189
    69
    46
    44
    41
    32
    14
    10
     8
     8
     5
     5
     2
     2
     2
     1
     1
     0
     0
     0
     0
     0
     0
Name: FATALITIES, dtype: int64

AHA! This tells us much more information. Turns out there were 2 fatalities in all of those protests. And almost 10,000 people were killed in MILITARY vs REBEL conflicts.

Also take a look at the next 3 numbers. Theres a common element, the digit ‘7’. This unfortunately represents Civilians. As in there was a vast amount of fatal violent conflicts directed towards Civilians by Military, Rebel, and Militia groups. An unfortunate reality that is all too common in places where multiple factions are fighting for a piece of the pie

This was just a brief introduction into the dataset. I am definitly going to make this a series where I’ll gather insights, throw in some interactive plots, and definitly maps to help visualize the conflicts that have ravaged one of most beautiful continents on Earth.

Mosquitos and Heat Maps

Predicting the Geographical Spread of West Nile Virus in Chicago

Ravi 2018-03-10

This project was particularly fascinating to me for a multitude of reasons.

Having actually suffered and recovered from malaria, I was interested in how mosquito vectors could spread other diseases, and with my background in public health and medicine I wanted to know how West Nile spread in Chicago.

I got the inspiration from browsing kaggle datasets as we data scientists do looking for interesting things to play around with. This is the kaggle competition for those interested.

First things first: Heatmaps year by year for the training data

Training data was data from the odd years between 2007 to 2013. The testing data set contained the even years and I treated it as sacred

2007

2009

2011

2013

And here is where they placed the traps:

traps

Description of Project and some Results

The task at hand was to predict where West Nile Virus would show up based on data collected from the various mosquito traps dispersed through the greater Chicago region. There were 8 species of mosquitos collected, but a vast majority of them were some type of Culex Pipiens/Restuans. Here is the distribution of positive West Nile cases by Species for every time data was collected from a mosquito trap.

species

Additionally I had weather data available to me over the dame time period West Nile was running rampant through Chicago. This was vitally important since mosquito breeding season are the summer months, and mosquitoes prefer humid, tropical climates, and they don’t like rain. Here is the weather data, with min and max temperature, wetbulb, and dewpoint for humidity analogs.

weather

So on to the modeling! After cleaning, feature engineering, and merging the weather data and training data collected from the traps I decided to make a random forest classifier. However we had one issue: we had some seriously unbiased classes.

West Nile negative: 9955 cases
West Nile positive: 551 cases.

To combat this and prevent my model from just predicting zeros and calling it a day, I oversampled the minority classes. Essentially I artificially created more positive cases to match a 1:1 ratio with the negative cases. But first I split my data, and oversampled one split, and used the other split as a holdout set to test my model against.

Here are my results for the Random Forest Classifier:

Random Forest

Also tried out a Neural Network with these results:

Neural Network

Battle of the Clusters

A Comparison of 3 Common Clustering Algorithims

Ravi 2018-03-08

Battle of the Clusters

I wanted to visually test three of the clustering algorithms that are commonly used on 7 data sets that are specifically designed to evaluate clustering algorithm effectiveness.

The three algorithims I’m going to use are:

K-means: k-means clustering

Agglomerative clustering: hierarchical clustering (bottom up)

DBSCAN: density-based clustering.

Clustering, I believe, is particularly suited for visual representations, and it’s very easy to gain an intuition on how each algorithim performs based on visuals.

For ease of use the 7 datasets I generated have 2 dimensions, an X and y coordinate. Remember that in unsupervised learning methods like clustering, there generally will not be “true labels.” I provided them here as colors simply as a convenience to be able to compare with the algorithims

Here are the 7 datasets clustered by color. We are going to see how close the clustering algorithms can get to these images.

png

Flame

Which algorithm (visually) performs best?

flame

DBSCAN performed the best. The great thing about DBSCAN is that we can really tune its hyperparameters: epsilon (the distance between samples) and miniminum number of core samples needed to define a cluster.

Interestingly Kmeans and the heirarchal algorithims performed similarly. Looking at the center of the Kmeans we see the centers that the algorithm eventually converged on are ‘off’ in comparison to how we as humans visually see a flame.

Agg

Which algorithm (visually) performs best?

flame

You can reallllyy tune DBSCAN, here I had epsilon (our distance between samples in a cluster) to be 0.218 and had the min number of samples as 13.

Out of the box though, kmeans, and heirarchical clustering didn’t do a bad job. you can see by how they grouped their clusters that both of these use distance metrics primarily. K means uses distances from a center point ‘X’, while Agglomerative/Heirarchal sees which points are closest togethor, creates a group and then adds members by calculating the shortest distance from any member of the new group to another point not in the group. We set the groups to 7, so the algorithim stops when it has seven groups.

Comp

Which algorithm (visually) performs best?

flame

DBSCAN performs best, but really, none of them are perfect. This demonstrates something that we should be aware of when using unsupervised clustering algorithims. is that clusters or groups with different distances, or densities will be seen as different groups. In this case, the red dots in the DBSCAN image are considered outliers or not belonging to a group.

Jain

flame

I found this one really fascinating. I could have probably fine tuned DBSCAN even more, but regardless it found 3 groups instead fo two. again highlighting the issue with clusters that have different intradistances between points. AKA if cluster A has a distance of 1, and cluster B has a distance 2, this is problematic for our algorithm.

I like how Kmeans and Agglomerative/Heirarchal attempt this. You can clearly see kmeans takes a mean centered approach and heirarchal takes a bottom up approach.

Pathbased

flame

DBSCAN for the win, theres a general theme isn’t there?

r15

my favorite data set of them all. Reminds of playing paintball.

flame

Kmeans and Agglomerative/Heirarchal perform perfectly right out of box. DBSCAN does well here too. I’d say this is a threeway tie.

Spiral

The spiral is where we can really see DBSCAN shine and gain some intuition on how the algorithm works

flame

Future considerations

I’d love to see how Affinity Propogation holds up on these data sets. Maybe in the future I’ll do a 4 way battle ala Mario Kart style!

Iowa Liquor Sales

What do they drink?

Ravi 2018-02-24

Forecasting Liquor Sales in the State of Iowa

COMIC

Exploring the data

I performed some exploratory statistical analysis and made some plots, such as histograms of transaction totals, bottles sold, etc.

I had to clean and munge the data first. Checking for Missing values and NaNs. I try and stay away from code snippets in my blog posts, but i just wanted to show one here to give the viewer a better sense of the columns of table and the datatype they were encoded in.

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df.info() 
#looks likesome Nulls in County and county number. lets use either city and or zipcode and see how much each
#city/zipcode sold
#change all prices into floats.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2709552 entries, 695077 to 2709551
Data columns (total 24 columns):
Invoice/Item Number      object
Date                     datetime64[ns]
Store Number             int64
Store Name               object
Address                  object
City                     object
Zip Code                 object
Store Location           object
County Number            float64
County                   object
Category                 float64
Category Name            object
Vendor Number            int64
Vendor Name              object
Item Number              int64
Item Description         object
Pack                     int64
Bottle Volume (ml)       int64
State Bottle Cost        object
State Bottle Retail      object
Bottles Sold             int64
Sale (Dollars)           object
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
dtypes: datetime64[ns](1), float64(4), int64(6), object(13)
memory usage: 516.8+ MB

Sales, Cost, and Retail are stored as a non numeric object, so lets get rid of the dollar sign and store it as a float

df['Sale (Dollars)'] = df['Sale (Dollars)'].str.replace('$', '').astype('float64')
df['State Bottle Cost'] = df['State Bottle Cost'].str.replace('$', '').astype('float64')
df['State Bottle Retail'] = df['State Bottle Retail'].str.replace('$', '').astype('float64')

When working with Strings I tend to either keep them lowercase or uppercase so just in case people have entered the cities with the first letter capitalized or not the computer won’t equte them as different values aka cities.

df['City'] = df['City'].map(lambda x: x.upper())

I saw some of the cities were mispelled, in the long run this probably wont make a difference but i renamed them anways.

#
df['City'] = df['City'].map(lambda x: x.replace('MT','MOUNT') if x[0] == 'M' else x)
df['City'] = df['City'].map(lambda x: x.replace('ST ','ST. ') if x[0] == 'S' else x)
df['City'] = df['City'].map(lambda x: x.replace('OTTUWMA','OTTUMWA') if x == 'OTTUWMA' else x)
df['City'] = df['City'].map(lambda x: x.replace('LEMARS','LE MARS') if x == 'LEMARS' else x)
df['City'] = df['City'].map(lambda x: x.replace('LECLAIRE','LE CLAIRE') if x == 'LECLAIRE' else x)
df['City'] = df['City'].map(lambda x: x.replace('GUTTENBURG','GUTTENBERG') if x == 'GUTTENBURG' else x)

Lets show the cities with the top 25 sales.

In keeping with xkcd comic above, lets use xkcd style in our graphs!

Cities

Here are a few stores with the best sales for the year of 2015

Cities

Avg Retail Price per Bottle

Cities

Cost of Bottle per store

Cities

Cost and Retail plotted on the same graph, I included some code to see how seaborn uses matplotlib functionality

fig, ax = plt.subplots(figsize = (12,8))
#c1, c2, c3 = sns.color_palette("Set1", 3)
ax.set_xlim(xmin = 0, xmax = 40)
plt.xkcd()
sns.distplot(df.groupby('Store Number')['State Bottle Retail'].mean(),
             hist_kws={"histtype": "stepfilled", "color": "yellow"},
             norm_hist = False,
             kde = True,
             ax = ax,
             label = 'Retail Price')

sns.distplot(df.groupby('Store Number')['State Bottle Cost'].mean(),
             hist_kws={"histtype": "stepfilled", "color": "deepskyblue", 'alpha': .25},
             norm_hist = False,
             kde = True,
             ax = ax,
             label = 'Cost for Store')

plt.title('Retail Price and Cost of Bottles')
plt.ylabel('Density')
plt.xlabel('Each Stores Average Price')
plt.legend()

<matplotlib.legend.Legend at 0x2b2d40024e0>

Cities

fig, ax = plt.subplots(figsize = (18,12))
df[df['Category Name'] == 'CANADIAN WHISKIES']['Item Description'].value_counts().head(10).plot(kind = 'barh', color = 'gold', fontsize = 24)

<matplotlib.axes._subplots.AxesSubplot at 0x2b2d4133c88>

Cities

fig, ax = plt.subplots(figsize = (18,12))
df[df['Category Name'] == 'STRAIGHT BOURBON WHISKIES']['Item Description'].value_counts().head(10).plot(kind = 'barh', color = 'purple', fontsize = 24)

<matplotlib.axes._subplots.AxesSubplot at 0x2b2a37898d0>

Cities

EDA Findings summarized

top 5 stores with highest total sales are: DES MOINES |2633 |12282646.26 DES MOINES |4829 |11085530.53

IOWA CITY

2512

5206377.22

CEDAR RAPIDS	3385	4759187.79
WINDSOR HEIGHTS	3420	4018414.55

The average number of bottles sold by store per transaction was 10 bottles, there was a heavy positive skew though

average retail cost was 14.7 - averate store cost 9.8 = 4.9$ margins on each bottle sold

Vodka 80 proof, Canadian whiskeys, and straight bourbon whiskies were the most popular type of liquors

Haweye vodka, black velvet, and Jim bean were the top sellers in those respective categories

Mining and Refining the data

Now I’m ready to compute the variables I’ll use for my regression from the data. For example, I computed the total sales per store from Jan to March of 2015, mean price per bottle, etc.

I created added my predictors piecemeal to a pandas dataframe

I also started looking for relationships within the data and amongst the variables to get a better sense of what I wanted to include in my final model

	Store Number	Jan to March Sales	Avg Bottles Sold	Avg Bottle Retail	Avg Bottle Cost	Avg Volume Sold (Liters)
0	2106	337166.53	19.514209	16.126917	10.744967	18.355608
1	2113	22351.86	4.667047	15.963609	10.632919	4.623090
2	2130	277764.46	18.523256	15.386269	10.251927	16.787416
3	2152	16805.11	4.024110	13.139489	8.742015	4.195431
4	2178	54411.42	7.677192	15.127811	10.068841	8.070549
5	2190	255939.81	8.285090	18.142489	12.086029	5.056179
6	2191	319020.69	13.863643	17.125039	11.411090	14.230320
7	2200	45340.33	4.001452	17.425620	11.606667	4.477271
8	2205	57849.23	6.048682	15.036204	10.013212	5.043834
9	2228	51031.04	5.333907	15.000341	9.989567	5.354428

Refine the data

fig, ax = plt.subplots(figsize = (18,12))
sns.set(font_scale=3)
sns.regplot(x = 'Jan to March Sales', y = 'Sale (Dollars)', data = X.merge(target, how = 'left', on = ('Store Number')))

Cities

Here’s a Heat Map demonstrating correlation of my predictors

Cities

#this takes a while to run, it just visualizes the above correlation matrix
sns.set()
sns.pairplot(X.merge(target, how = 'left', on = ('Store Number')))

Cities

Building the model

I used sklearn to build a regression model and evaluated it’s performance

I standardized my values first before I threw them into my model, This is almost always a good idea and necessary when using regularization. It prevents variables with larger numerical scales from overshadowing variables with smaller numerical scales.

I also cross validated to find the best hyperparameters, in this case ‘C’ for our error terms in the regularized regression model. I’m not going to put the code here, you can see it all on my github

Results

I plotted the predictions of my models below

Cities

Summary

On average the difference in yearly sales for iowa liquor stores from 2016 to 2015 was 9,215, thats a $9,215 per store on average increase in yearly sales in 2016, very little difference bewteen my Ridge Regression and linear regression models, most likely because jan - March sales were so heavily correlated with total Sales.