codete garbage in garbage out preparing your data set formachine learning main 5f5b8ee812
Codete Blog

Garbage in, Garbage out – Preparing Your Data Set for Machine Learning

kacper lukawski2 14941aa6f0

10/06/2021 |

9 min read

Kacper Łukawski

Machine Learning-based processes rely on data. So does the GIGO rule, which is critical to the development process of virtually any project. Today we will discuss the importance of well-organized datasets and their impact on the efficiency of Machine Learning models. 

As a refresher, let’s start with a brief review of the Garbage in, Garbage out (GIGO) concept.

  1. What is GIGO?
  2. Garbage In, Garbage Out meaning for business
  3. What is the most important GIGO principle?

What is GIGO?

The concept of Garbage In Garbage Out (GIGO) states that the output of an algorithm is only as good as the quality of the input it receives. Think of how we raise our children. We can't expect them to know much if they don't have access to appropriate educational materials. The same rule applies to Machine Learning models: If we don't provide quality data (in terms of validity or accuracy), our models won't work in the real world.

Basically, feeding the model bad data leads to useless results that cannot be trusted. Without using GIGO, it is impossible to design a solution that works well. Technically, you're putting in unfiltered garbage, so don't expect anything more on the output. It's usually a waste of people's time. And money.

That's the lesson every data worker needs to understand to avoid erroneous, incomplete, or incoherent results. When it comes to using algorithms in the real world, this is the foundation. And of course, the GIGO principle is critical in any machine learning effort. We cannot increase accuracy if we have lost control over the quality and consistency of the data that is collected, stored, and analyzed.

The garbage in garbage out rule is also critical to virtually any project development process. It's often implemented in the early stages - to standardize input so that multiple teams can work in parallel without disrupting workflow - personally, I can't imagine working without it.

Garbage In, Garbage Out meaning for business

When it comes to managing all operations related to business growth, GIGO is an extremely important rule to follow because literally any improvements or changes in the business must be supported by background analysis based on verified data sets. If the input is flawed, the specific decision may result in a choice that is detrimental to the company.

To avoid this, you should emphasize how your Machine Learning models are built and trained. Let's say your client owns an online store and needs a model that categorizes products by understanding their metadata. As a result of training, it already recognizes that:

  • 65”, 55”, Sony, QLED, 4K, Smart TV, Crystal Display, Quantum HDR - are mapped as “TVs”
  • Stainless steel, built-in grill, defrost, control lock, programmable menu - are mapped as “Microwave oven”

Following the standard approach, given queries are associated with one of the above categories so that the customer can easily search for the desired product. Now, let’s imagine that our perfectly trained ML algorithm is being tested on a completely new database, i.e. by supporting search queries in the clothing store.

As a result of inserting unknown to the model queries, users can be thrown over the website, generating random results - as both input and output are now recognized as garbage. That could harm sales. And scare off even the most loyal customers.

The following example shows that an ML model is only as good as the "material" we feed it. It will only recognize the queries and examples you provide it with. Consequently, any deviation from the training scenario (even if it is working seamlessly within the tests) will be recognized as garbage, resulting in incomprehensible output.

This begs the question, "What can we do to overcome GIGO barriers?"

WRONG DATA

Before we get to that, let's talk about the importance of data. The performance of Machine Learning models is determined by training data. Bad data leads to bad results. Worse, it flows through ML systems and constantly feeds incorrect information into the models.

But what is behind the term bad (or noisy) data?

  • Data that is biased. When biases enter the data used to train machine learning, data integrity is compromised and predictions become inaccurate. It could be something as simple as returning male names while searching for female contact.
  • Data that is incorrect. It is best to clean the data before training the predictive model when using an ML project. However, cleansing does not always correct or identify all errors, and the data can be compromised. And even the smallest error could be fatal.
  • Data that is missing. Predictions made by machine learning are difficult to achieve with missing or incomplete data (see the example above).

What is the most important GIGO principle?

It is quite difficult to develop a foolproof set of best practices because the GIGO should not be considered a technique - it is essentially a rule-to-go, a first step in the process of developing an ML model. The GIGO cannot be avoided. You can't work if you don't put it into practice.

While you are preparing your dataset for machine learning, it is always good to do so:

1. Use EDA as the initial phase of the project and work with experts in the field

This method is used to thoroughly explore the purpose of the model by describing, representing, and analyzing the data obtained. Discuss its meaning with other departments in your company to gain a better understanding of the behavior of the users of your model.

Spend some time predicting alternative input data variations (both hypothetical and quite possible) that your model would encounter in the real world. Observing multiple opinions will help you detail the requirements of the model and identify possible errors, such as lack of queries, precision, or logic. Your model must always reflect the reality of the users.

It's also worth sticking to the idea of documenting and detailing the processes for all implementations in one place. In corporate communications, for example, it's about describing all the quality metrics and trying to translate them into a specific region. For example, what does 94 percent accuracy mean? And what could be even considered as accurate at all? What impact does that have on our solution? When is our system most flawed? 

2. Focus on data cleansing and processing

When you begin to engage with a dataset, it is critical to thoroughly examine the data it contains. It's fair to say that poorly maintained data integrity leads to many pitfalls, which in turn leads to schedule delays. It's like trying to drive a car without putting gas in it - with an empty tank, you can turn the steering wheel and push the car down the hill occasionally, but it won't move forward.

Only after you have examined its quality and potential for improvement can you begin to model the data. It all starts with good management of the dataset - if the data is meaningful, the process goes quickly, but if we are working in a chaotic environment, we will spend most of our time doing this.

To achieve this, you should always double-check your dataset and make sure that:

  • It is fully normalized and contains only certain values with their own variables coupled with observations (which are defined as variables of a unit). If you are dealing with larger variables, specify whether they are fixed. Also, keep checking to see if the values of the data points you are collecting have changed.
  • There are no embedded special characters, tabs, additional spaces, or line breaks in any of the text inputs. Additionally, ensure that all blanks and missing values are filled or eliminated. Consider removing outliers and noisy data. Check the data format.
  • It is also worth developing an intuitive understanding of what this data indicates and what the absence of certain information may mean. Sometimes you need to look at the process of collecting the data itself to make sure the set collected isn't skewed, like when we only collected surveys from dissatisfied customers.

Unfortunately, with large amounts of data, the data cleaning process can take a long time. However, if you don't ensure that your data is clean, valuable, and easy to use, low-quality results can cripple your machine learning attempts.

3. Focus on testing

In the real world, clean data doesn't always lead to the desired results. The inability of the model to handle poorly designed input is the normal difficulty with properly structured data. People misspell words. They change the order of words. They use synonyms. They search for colors. They click randomly on anything.

Remember, some of those inputs could be considered garbage by models. And according to the GIGO rule, they respond by sending even less useful output. Consequently, you can't rely on the organized scenario of "A leads to B (only)" when testing. Because that's how you produce a bunch of nonsensical results.

A model is only as good as the examples it has been trained with. Teach your ML model how to handle non-standard queries by training it under different circumstances based on real user behavior. And go beyond that by preparing it for (still) imaginary ones. Finally, models don't learn to solve problems in and of themselves; rather, they learn to minimize a loss on training datasets, thereby inadvertently yielding generalizable values.

4. Avoid date leakage 

Data leakage in ML is about training the model with data that will not be available in the later use phase of the model to recognize new examples. Make the effort to ensure that best practices for data entry are followed.

Review your data sets regularly to correct any inaccuracies, and modify data entry to ensure that it does not produce incorrect results. It's also beneficial to version the data you're working on, take care of the techniques for capturing it, and implement what's called a data dictionary to ensure everyone on the team has a consistent understanding of what certain information means.

GIGO - CONCLUSION

I can't imagine working on any project without taking data quality into account. In the realm of machine learning, quality is key. That is why GIGO is the dread of all data scientists. It's something I always use when I start a new project.

Also, remember that the models are what they eat. If you feed them dodgy, biased data, they will produce absolute nonsense. And if you can't trust the data used to train machine learning algorithms, you can't trust the decisions made by ML models.

Rated: 5.0 / 1 opinions
kacper lukawski2 14941aa6f0

Kacper Łukawski

Software Engineer. Big fan of AI and applying machine learning methods in real-life problems, with an experience in web development and databases. Currently involved in Big Data projects as well as in internal research at Codete.

Our mission is to accelerate your growth through technology

Contact us

Codete Global
Spółka z ograniczoną odpowiedzialnością

Na Zjeździe 11
30-527 Kraków

NIP (VAT-ID): PL6762460401
REGON: 122745429
KRS: 0000983688

Get in Touch
  • icon facebook
  • icon linkedin
  • icon instagram
  • icon youtube
Offices
  • Kraków

    Na Zjeździe 11
    30-527 Kraków
    Poland

  • Lublin

    Wojciechowska 7E
    20-704 Lublin
    Poland

  • Berlin

    Bouchéstraße 12
    12435 Berlin
    Germany