Python for Data Science Final Project

Machine Learning to Identify Fraud in the Enron Corpus

1. Context

In late 2001, Enron, an American energy company, filed for bankruptcy after one of the largest financial scandals in corporate history. After the company's collapse, over 600,000 emails generated by 158 Enron employees - now known as the Enron Corpus - were acquired by the Federal Energy Regulatory Commission during its investigation. The data was then uploaded online, and since then, a number of people and organizations have graciously prepared, cleaned and organized the dataset that is available to the public today (a few years later, financial data of top Enron executives were released following their trial).

2. Project

The aim of this project is to apply machine learning techniques to build a predictive model that identifies Enron employees that may have committed fraud based on their financial and email data.
The objective is to get a precision and recall score of at least 0.42

Task 1: Select what features we'll use



Features of this dataset fall into three major types, namely financial features, email features and POI labels as the target feature.


**The missing data for financial features (payment, fees and stock) represent 0 and not unknown quantities, as it is shown in [Enron's public data](https://github.com/louisedietrich/Enron-Financial-Dataset). We will thus in a first place replace all the missing values by 0.**


**For the features concerning emails (to_messages, from_messages, from_poi_to_this_person, from_this_person_to_poi, shared_receipt_with_poi.), missing values are unknown information. We don't want to drop all those values as we already are working on a rather small dataset and need data to build our models. We will replace the missing information by its average values, depending on the person's profile (i.e. average values for the POI or not POI for each feature).**

Now that all NaN values are replaced by 0, we will focus on the 5 variables explaining the interaction with a poi or not poi:</br>
to_messages, from_messages, from_poi_to_this_person, from_this_person_to_poi, shared_receipt_with_poi.

This email data is meaningful information for our prediction of wether an employee will be a POI or not and we will replace, for each of this features, the previously missing values with the mean of the variable grouped by POI profile.

Task 2: Remove outliers

The "total row" value displayed in the outliers plots is taken off, as it is not relevant information and will distort our resuls.

Considering the visualization of our outliers, we try to identify the people we could remove from our model according to their high salary and/or bonus, and that have not been identified as POI

From this list of people with very high salaries and/or bonuses but that are not identified as POI, we decided to remove from our study four of them who had upper-level executive positions and are not representative of the average Enron employee:


As we saw a step above, we also will remove the TOTAL row from our dataset.

Two other already identified outliers are "THE TRAVEL AGENCY IN THE PARK" that is not a person, along with "LOCKHART EUGENE E" who has no feature values.

Task 3: Create new feature(s)

After computing our model with a different number of features, we found out that 4 was the number of variables giving the best results with our data. We used the function with selectKBest to get the list of the 4 highest scoring features.

Task 4: Using algorithm

We decided to poursue our classification with the logistic regression as it presented the best precision and recall metrics for our data.

Task 5: Tune your classifier to achieve better than .42 precision and recall using the testing script

Parameter tuning & optimization with GridSearchCV

In order to find the best parameters for our logistic regression we used the GridSearchCV algorithm, which also employs cross validation with the StratifiedKFolds cross-validator.

We chose to optimise the F1 score because it conveys the balance between the precision and the recall.

The code below was used for the tuning of the random forest. However it should not be used as in the end we chose logistic regression, which has better results.

Task 6: Dump your classifier, dataset, and features_list so anyone can check your results

Conclusions

The feature selection with SelectKBest resulted in 4 features we would not have intuitively think of:

'exercised_stock_options', 'salary', 'total_stock_value', and 'bonus'.

Indeed what would come naturally to the human mind would be features such as the number of emails from/to a POI received or sent by an Enron employee.

We can consider that using this algorithm enabled us to achieve higher metrics, in the same way as the GridSearchCV for tuning our logistic regression parameters.

Our final results have a precision score of 0.424 and a recall score of 0.675. This translates as of all the people identified as POI by our model 42.4% were indeed POI, and that our model identified 67.5% of the POI present in the dataset as such.

Those scores might appear as quite low but we had to consider a relatively small dataset to build our model. Moreover the number of non-POI in the dataset is way higher than the number of POI (128 to 18), which make it more difficult to create a reliable algorithm to detect the POI.

Another way to improve these scores would be to explore the email data we have not handled, which might lead to the finding of new interesting patterns and more accuracy in our prediction.