Clean Data for AI & Automation: Best Practices & Avoiding Automation Failures

One of the first steps to ensure your success when automating a process or using AI in your workflow is to have clean data.

What do I mean by clean data?

Clean data is data that can be easily manipulated by software, such as AI, to provide meaningful and accurate results.

Let me tell you about an experience I had with data that wasn’t clean. I was tasked with automating the migration of data out of an old billing system into a new one. In the old system, “Contracts” were stored as free-form data. The client name and the products they purchased were also free-form text. This meant that if “Client A” originally purchased “Product A,” and later wanted “Product C” as well, an additional contract was created. This time the client was listed as “Client A LLC” with a product called “Service.” How would you go about determining what the client is contracted for? Well, it wouldn’t take much for someone to recognize that “Client A” and “Client A LLC” are actually the same company. I could have made part of the import process remove common suffixes like “LLC” when matching everything up. However, everything we sold was a “Service.” The only way to determine which service was purchased was to hope the salesperson remembered what it was the client bought. This was just one of many problems encountered with the data.

Even if we weren’t migrating to a new system, generating reports to get a holistic view of a client was impossible, apart from someone who knew the clients inside and out spending hours manually compiling the data.

Fortunately, the new system had restrictions in place to discourage things like that from happening again. As a result, we only had to go through the painful process of cleaning the data once. The new system creates contracts for clients rather than having you manually type the client name. Instead, you selected the client from a list of existing clients. If it’s a new client, you would create the client first. It also included checks to ensure that you don’t create duplicate clients. Products were selected from a list of known products. Reports were then automatically generated in seconds, rather than the hours it previously took.

Get it clean.

The process of cleaning up data can be daunting. Good news is there are tools that can help. The bad news is, depending on the level of dirt, manual review may be the only option.

Data Cleansing Tools

There are tools out there specifically designed to help in the process, such as OpenRefine. Yes, some even use AI to aid in the cleansing! If it had existed back when I was dealing with the whole “Client A” “Client A LLC” issue, it would have saved me the time I spend building a custom solution.

Manual Review

Unfortunately, there are some things you will just have to fix and review yourself, or pay somebody else to do it. Take for instance the “Service” product, how would a cleansing tool know what to do with that?

Keep it clean.

Data cleansing can be a lengthy and painful process. To prevent it from becoming dirty again, make sure to implement processes that address this issue. Here are some considerations, but remember they are guidelines not rules:

Avoid using free form text fields unless absolutely necessary.

Do you recall the “Service” example we discussed previously? Allowing users the autonomy to make errors ultimately leads to their inevitability.

Split the data out into as many individual components as possible.

You might be storing an address in a single field. What if you need to run a report that breaks down details by country or state? In that case, it’s safer to store the address as separate fields such as street, city, state, zip code, and country.

Don’t rely on naming conventions.

Naming conventions can sometimes be broken. For instance, you might consider denoting a company’s subsidiary as <ParentName> - <SubsidiaryName>. A better approach is to store the parent as a separate field. If display purposes require, you can show it in the specified convention.

If you are storing the same data twice, something probably went wrong.

The issue arises when a change is necessary. For example, if a company changes its name, you need to update it in all instances rather than just one place.

You might not be able to do all of this.

Maybe your current system doesn’t support storing a particular piece of data you need to keep track of. You might be tempted to use another unused field for that purpose. Please bear in mind that it may have unintended consequences. What if you need that field later for its intended purpose? It probably doesn’t have the structures in place to prevent more dirty data from being created. In the end, you may have no other alternative apart from getting some custom development done or switching to a new system. When in doubt, that’s where people like me come in to help.

Why it matters.

The short answer is, dirty data cost you money.

The longer answer is that dirty data makes automation difficult, if not impossible. This means someone is likely spending time manually dealing with it. Dirty data also leave more room for errors. Take the original example for instance, how would they even know if they are billing the customer for the correct amount? After cleaning the data, I build a reconciliation process to compare what services the customers were being billed compared to what services the customers were using, and found tens of thousand of dollars in lost revenue.

As with many things involving computers, AI follows the principle of “garbage in - garbage out.” If you are considering utilizing AI, you will want clean data.

Not sure where to start?

Evaluating data cleanliness can be a daunting task, especially for people that aren’t used to creating automation around datasets. This is were I can help. For the last 20 years I have spent an enormous amount of time cleaning and processing data in many different format. Schedule a free 30 minute consultation now!