No country for XGB Tree.

May 11, 2017

Is altitude a good predictor for election ?

It’s now a well anchored habit in France : before each election, media (re)discover the rise of extrem right. Afterward, maps flood the internet to provide the best analysis such as:

Source : Renaud Epstein

Using some socio-economical features, we too can build a predictive model for the elections outcomes.

A look at the original artwork French voters produced during the first round of the Presidential election :

Hommage à Jackson Pollock : 1st round best candidate per town

And focusing on the Front National :

Elementary, my dear Pearson.

Using data retrieved from the french national statistics office (INSEE), we focus on the two regions where the Front National got the best results :

  • The north (Hauts de France)
  • The south (Provence-Alpes-Côte d’Azur)

We can draw the highest correlated factors for each regions :

In Provence, Le Pen scores high where :

  • average education level is low,
  • the number of private nurses is low,
  • service sector is sparse,
  • and, weirdly, altitude is low.

This last element deserves a short digression : André Siegfried, son of a Minister, was a french sociologist and geograph at the beginning of the 20th century. After having lost a campaign, Siegfried investigated the possible relationship between geology and political orientation and came to the following conclusion : “granite votes right, limestone votes left”.


In the North, Le Pen scores high where :

  • average education level is low,
  • the median income are low,
  • unemployment is high.


Out of sheer curiosity, how about the relationship between Front National and the percentage of immigrants ?

It seems that the more immigrants in town, the less prone are voters to choose the Front National.


And how about population density ?

As pointed out by Hervé Le Bras, big cities offer more opportunities, thus lowering the Front National score. But interestingly enough, this is not the case in Provence.


Education seems to be the most proeminent factor for Le Pen, how does it compare with Macron ?


And incomes ?

Northern France
Northern France
Northern France

As Hervé Le Bras already mentioned, poors have a low turnout. In the north, Le Pen scores well where unemployment is high.


Finally, how about elevation ?

Provence-Alpes-Côte d'Azur
Provence-Alpes-Côte d'Azur
Provence-Alpes-Côte d'Azur

Indeed, comparing with an other region with high elevation gradient (Rhône Alpes), Le Pen does score poorly at high altitude.

An other interesting observation is for the area of Nice, where unemployment is low and Le Pen high. This confirm the thesis of a two-headed Front National, one of the North, more social, and an other of the South, closer to Poujade. And following latest development in the french political landscape (as of may 2017), it might augure a schism within the Front National.


Creating a model

Well, now we have a dataset with 35000 cities and for each, 170 predictors (such as the ratio of camping place per habitant, the proportion of student or the local GDP).

Instead of going through the painful process of features selection (as seen before, lots of multicollinearity here) and regularization, we prefer to hop in the Land Cruiser of Machine Learning : XGB Tree.

Load the dataset, tune hyperparameters and off we go !

For the detailed implementation of the algorithm, see here


Confusion Matrix and Statistics

FILLON 995 358 205 48
LE.PEN 454 4854 429 204
MACRON 229 278 1319 286
MÉLENCHON 46 111 197 529
Overall Statistics
               Accuracy : 0.7401          
                 95% CI : (0.7316, 0.7484)
    No Information Rate : 0.5313          
    P-Value [Acc > NIR] : < 2.2e-16       
                  Kappa : 0.5864          
 Mcnemar's Test P-Value : 6.207e-11       

Statistics by Class:

                     Class: FILLON Class: LE.PEN Class: MACRON Class: MÉLENCHON
Sensitivity                0.59605        0.8698        0.6212          0.53140
Specificity                0.93704        0.7925        0.9104          0.95703
Pos Pred Value             0.64918        0.8261        0.6397          0.58214
Neg Pred Value             0.92229        0.8430        0.9037          0.94772
Prevalence                 0.16350        0.5313        0.2039          0.10125
Detection Rate             0.09746        0.4621        0.1267          0.05381
Detection Prevalence       0.15012        0.5594        0.1980          0.09243
Balanced Accuracy          0.76655        0.8311        0.7658          0.74421

74% accuracy on the testing set is not so bad, given the model does not take into account the local specificities and history (althought we use the outcome of the previous presidential election). But what is striking is the high sensitivity of the Le Pen compared with her challenger, that might be related to the unbalanced aspect of the dataset : She finished first in half of the towns on the first round.

Regarding the features weight, the 2012 results are overwhelmingly the best predictors. Then come :

  • the ratio of self-employed in the active population
  • population’s density,
  • ratio of university degrees,
  • latitude,
  • elevation,
  • average income per household.

So, on the scale of France, the elevation does indeed play a significant role in the vote. I’ll quote Hervé Le Bras : for communities away from main communication axis and thus less prone to mobility, social interactions are stronger and rumours less likely to spread.

References :