7 Simple TRICEPS That Will Kill Your Data Science Project
By Brian Dorricott Head of Data Science, BoomData
Data Science can be embarrassing…
Heard the one about IBM’s “Watson for Oncology” that was cancelled after $62 million was spent due to unsafe treatment recommendations, Apple’s Face ID being defeated by a 3D Mask, or Amazon’s facial recognition software matching 28 US Congresspeople with criminals? Frightening? Or predictable? Applying the principle behind each letter of TRICEPS would have reduced the embarrassment and loss of face for each of these data science projects. Let’s take a look at each of these seven areas in detail that businesses should consider when embarking on an advanced analytics initiative:
Technology is not the goal. Today there are many advanced analytics techniques available (Tensorflow, PyTorch, NLP, CNNs, etc.), each with multiple parameters. There are also many “magic” Artificial Intelligence (AI) systems that make analysis “easy” by applying hundreds of techniques to the data (e.g. Alteryx, RapidMiner, DataRobot, Feature Labs, etc.). Simply plugging the data in, letting the system figure out the answer and writing the report is a recipe for potential disaster. Without any knowledge of how the chosen solution works, its biases and limitations, major errors can occur. Further, how can the results be explained or substantiated? Start with simple models like linear regression or decision trees since these will often get the required results.
Removing outliers. Outliers or out-of-band data are measures that appear to be a long way from all the other data that has been collected. In “In the Plex: How Google Thinks, Works, and Shapes Our Lives”, Reese explains how this error cost Google’s ISP millions. Google needed to transmit huge quantities of data across from west to east coast USA, so they rented a fibre connection. Full usage of the fibre would have cost $250,000 per month but they exploited a loophole in the billing process. The ISP removed all “outlier” bandwidth measurements and charged for the remainder. So, Google transferred all their data in 24 hours per month which meant that, after outliers were removed, they apparently used no bandwidth. The ISP charged them zero! The lesson: review why the outliers exist before removing or ignoring them.
Ignoring the collectors. So, you have a pile of data to analyse. Do you know who collected it and for what purpose? Was it ad-hoc or systematically compiled? Is there any missing? Did the collection units change? Did different people collect it over time? An example: consider an IoT sensor that collects humidity data that is out in the field. Having only 2% of measurements missing due to radio transmission errors doesn’t seem to be a problem unless you realise that transmission only fails when it rains. Failure to understand these attributes of the data means the data analysis can discover erroneous correlations and make incorrect predictions based on the collection methodology rather than the item being measured.
Confusing Correlation and Causation. Correlation is a measure of a relationship or connection between two or more things. Causation is where one act contributes to the production of another event. Just because two things are highly correlated does not mean one caused the other. The problem occurs because it is extremely easy to find correlations between measures – simply do enough analysis and you’ll find some correlations at random. For example, in “Spurious correlations: Margarine linked to divorce?” there is an example of a 99% correlation between margarine consumption and divorce. Does that mean there is a link?
Evading recalibration or ignoring drift. Shops mandatorily recalibrate their scales every year – you would not want it any other way. It is the same in data science but even more important since, as well as the underlying data changing due to the success of the model, the external variables may change (e.g. demographics, preferences, new options, etc.) as well. In data science, this phenomenon is called drift. In the simplest case, you could imagine a model that predicts which books people are going to read next. If it is not recalibrated it will miss the most recently published books and market trends meaning the predictions will get worse over time. To avoid this problem, consider a recalibration schedule to ensure the model captures the changes of underlying data.
Poor model validation. How do you know that the model you have built is working correctly? You validate it. Sounds simple? Considerable thought must be given to the data that is used to validate a model as Amazon discovered with its new recruiting engine. It turned out that the majority of training and validation data was from men, so the AI algorithm ended up being biased against women. The AI project was ultimately abandoned. To improve model validation, consider the issue at the beginning of the project – does your data represent the real world or does it have in-built bias that your prediction could exacerbate?
Specification. Getting the specification right throughout any implementation project is key to project success. This extends from the start where it is critical to understand the business needs and gain support of the key stakeholders, designing the data flows and data storage through to processing and display of the desired results. Each element of the process from conception to delivery is worth an article in its own right! Avoid this error by thinking deeply about the business question you are looking to answer and how you will know you have been successful.
Next time you are designing an advanced analytics or AI project, don’t forget to consider each of the above TRICEPS to avoid wasting time and money. Forewarned is forearmed!