No country for XGB Tree.

May 11, 2017

Is altitude a good predictor for election ?

It’s now a well anchored habit in France : before each election, media (re)discover the rise of extrem right. Afterward, maps flood the internet to provide the best analysis such as:

Using some socio-economical features, we too can build a predictive model for the elections outcomes.

A look at the original artwork French voters produced during the first round of the Presidential election :

Hommage à Jackson Pollock : 1st round best candidate per town

And focusing on the Front National :

Elementary, my dear Pearson.

Using data retrieved from the french national statistics office (INSEE), we focus on the two regions where the Front National got the best results :

The north (Hauts-de-France)
The south (Provence-Alpes-Côte d’Azur)

We can draw the highest correlated factors for each regions :

In Provence, Le Pen scores high where :

average education level is low,
the number of private nurses is low,
service sector is sparse,
and, weirdly, altitude is low.

This last element deserves a short digression : André Siegfried, son of a Minister, was a french sociologist and geograph at the beginning of the 20th century. After having lost a campaign, Siegfried investigated the possible relationship between geology and political orientation and came to the following conclusion : “granite votes right, limestone votes left”.

In the North, Le Pen scores high where :

average education level is low,
the median income are low,
unemployment is high.

***

Out of sheer curiosity, how about the relationship between Front National and the percentage of immigrants ?

It seems that the more immigrants in town, the less prone are voters to choose the Front National.

***

And how about population density ?

As pointed out by Hervé Le Bras, big cities offer more opportunities, thus lowering the Front National score. But interestingly enough, this is not the case in Provence.

***

Education seems to be the most proeminent factor for Le Pen, how does it compare with Macron ?

***

And incomes ?

As Hervé Le Bras already mentioned, poors have a low turnout. In the north, Le Pen scores well where unemployment is high.

***

Finally, how about elevation ?

Indeed, comparing with an other region with high elevation gradient (Rhône Alpes), Le Pen does score poorly at high altitude.

An other interesting observation is for the area of Nice, where unemployment is low and Le Pen high. This confirm the thesis of a two-headed Front National, one of the North, more social, and an other of the South, closer to Poujade. And following latest development in the french political landscape (as of may 2017), it might augure a schism within the Front National.

***

Creating a model

Well, now we have a dataset with 35000 cities and for each, 170 predictors (such as the ratio of camping place per habitant, the proportion of student or the local GDP).

Instead of going through the painful process of features selection (as seen before, lots of multicollinearity here) and regularization, we prefer to hop in the Land Cruiser of Machine Learning : XGB Tree.

Load the dataset, tune hyperparameters and off we go !

For the detailed implementation of the algorithm, see here

***

Confusion Matrix and Statistics

prediction/real	FILLON	LE.PEN	MACRON	MÉLENCHON
FILLON	995	358	205	48
LE.PEN	454	4854	429	204
MACRON	229	278	1319	286
MÉLENCHON	46	111	197	529

Overall Statistics
                                          
               Accuracy : 0.7401          
                 95% CI : (0.7316, 0.7484)
    No Information Rate : 0.5313          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5864          
 Mcnemar's Test P-Value : 6.207e-11       

Statistics by Class:

                     Class: FILLON Class: LE.PEN Class: MACRON Class: MÉLENCHON
Sensitivity                0.59605        0.8698        0.6212          0.53140
Specificity                0.93704        0.7925        0.9104          0.95703
Pos Pred Value             0.64918        0.8261        0.6397          0.58214
Neg Pred Value             0.92229        0.8430        0.9037          0.94772
Prevalence                 0.16350        0.5313        0.2039          0.10125
Detection Rate             0.09746        0.4621        0.1267          0.05381
Detection Prevalence       0.15012        0.5594        0.1980          0.09243
Balanced Accuracy          0.76655        0.8311        0.7658          0.74421

74% accuracy on the testing set is not so bad, given the model does not take into account the local specificities and history (althought we use the outcome of the previous presidential election). But what is striking is the high sensitivity of the Le Pen compared with her challenger, that might be related to the unbalanced aspect of the dataset : She finished first in half of the towns on the first round.

Regarding the features weight, the 2012 results are overwhelmingly the best predictors. Then come :

the ratio of self-employed in the active population
population’s density,
ratio of university degrees,
latitude,
elevation,
average income per household.

So, on the scale of France, the elevation does indeed play a significant role in the vote. I’ll quote Hervé Le Bras : for communities away from main communication axis and thus less prone to mobility, social interactions are stronger and rumours less likely to spread.

CL

No country for XGB Tree.

Elementary, my dear Pearson.

Creating a model

Confusion Matrix and Statistics

References :