Gable Innovation
Back to Insights
CRM Optimization

How to Clean CRM Data Without Losing History

Sales teams sit on CRMs packed with duplicate contacts, outdated titles, and leads from three years ago - but hesitate to clean them because they fear breaking pipeline history, attribution reports, or the audit trail finance relies on for revenue recognition. That fear is mostly unfounded. Updating contact details in Salesforce or HubSpot doesn't corrupt past activity records; the platform simply refreshes the current information while preserving historical interactions. The real challenge isn't whether to clean CRM data - it's knowing what to update, what to merge, and what to delete without disrupting how sales and the broader organization actually use that data. Done right, teams reclaim accurate reporting and usable records without sacrificing the integrity of what came before.

Why CRM data gets messy in the first place

This mess all starts with data entry. A sales rep will type "Apple Inc." into the first field of a new record but then remember that Apple was called "Apple Computer" back in the 90s for that record. Then as part of a large import of new companies that the Marketing team has been compiling, they will all be entered in ALL CAPS for the company name. Meanwhile an SDR has been entering in all of the companies for a large corp and for some reason has decided to abbreviate "International" to "Intl." for half of the companies and spell it out in full for the rest.

People are the culprit here. Everyone is doing their job to the best of their ability. However, they are being put under an extreme amount of pressure to perform in a very fragmented market with very little process around data entry.

Integrations further compound the problem. The company's webinar software is connected to HubSpot, so all contacts from webinars are automatically added to HubSpot in a certain format. LinkedIn Lead Gen Forms are also automatically connected to HubSpot in a different format. The company's payment processor, Stripe, is connected to the company's Salesforce account, and all transaction data is automatically written to the accounts in Salesforce in yet another format. So the same account is written three times: "Apple Inc.", "Apple, Inc." and "Apple Inc " (note the trailing space). None of these are ever matched up within the system.

Here are the reasons why teams can't simply merge all duplicates:

  • Historical reporting affected by changed fields: All historical reports using the fields in question (e.g. company name, address etc.) will be affected. So quarterly revenue for Q1 suddenly drops to $0 and the CFO wonders what happened.
  • Integration dependencies are broken: Even if teams manage to merge all the records from all the systems, there are often marketing automation processes that still reference the old Record ID of the "merged" Account. Delete the "old" Account and these processes fail silently - without any indication of what went wrong.
  • Activity history vanishes: The activity history (email threads, call logs, meeting notes) of the "wrong" record will be deleted. The team members who need this information to close deals will be very sad.
  • Custom fields contain unique data: This is the worst case scenario for sales and marketing teams. The record(s) that teams don't want to delete (due to incorrect contact information) contains the correct information for the custom fields such as the industry, phone numbers, etc.

Another huge error that can occur is when someone creates custom picklist values for a field within Salesforce. For example someone created a value for "Lead Source" for "Website - Contact Form" and a few weeks later another person created "Contact Form (Website)" for "Lead Source" because they couldn't find the previously created value. And then months later someone just picks "Website" because it's easier. This means that lead source reports are now useless because there are 3 values that all mean the same thing.

Step 1: Audit your data before you touch anything

Before even starting to clear out and reorganize CRM data, make a full backup of ALL data. Teams have no idea when they might discover months later that some workflow has been secretly reading a field that was assumed all along to be blank. To export data from within the Salesforce interface, go to Setup → Data Loader → Export Objects and choose to export all of the standard objects for the organization. Then, for each object, choose to export all fields (do not limit to the fields that teams immediately think they will need for cleanup work). Each object and each field will be exported as a separate.CSV file, which can be named before export. The timestamp on the.CSV file name will serve as proof that the file was exported at the exact time and date that it was exported. The HubSpot export function is found at Settings → Data Management → Export Data. From here, teams can choose to export all data for all of the objects in the HubSpot database, including all contacts, all companies, all deals, all tickets and all custom objects as well.

The biggest CRM cleanup disaster is removing a field that looks inactive but is secretly being used by a report or integration. The Field Usage tool can help identify fields that are not used for a long time but this must be verified for reports and integrations.

Document which fields are actually used

Salesforce Field Usage: The last time a field was used is displayed in the Field Usage tool found in Object Manager → Field & Relationships for a specific object. This tool is sorted by the last time a field was modified in descending order. Thus, fields that have been blank for an extended period can be marked for removal after verification to see if the field is used for quarterly reports, etc.

See also how each field is used within HubSpot. From the Objects view in Settings → Data Management for a particular object (e.g. Contacts), click on a particular property (e.g. Phone) from the Properties view for that object. At the bottom of the resulting page for that property is a Usage view. In this view, teams can see a List of Objects (e.g. Workflows, Forms, Reports) that use that field, as well as the Last Updated date for that field. If the List of Objects is empty for a property and the Last Updated date is a long time ago for that property then it is likely that the field can be removed. Some fields only need to be updated on a quarterly basis for example and teams don't want to remove a field 2 weeks before Q4 reporting starts. Document all of this in a spreadsheet: field name, last modified date, used in workflows/reports, and the decision (keep/archive/delete). This will be the best guide throughout the process for cleaning data in the CRM.

Map your integration dependencies

Before removing any duplicate records or cleaning up any fields in the CRM, determine which external systems are adding data to the CRM. For example, teams may find that a marketing automation platform is adding a field for Lead Source to every new Lead record created in the CRM.

Step 2: Standardize field formats and values

First, one must clean up the data in the database and then after that apply the new standards to newly entered data with the newly created workflows for validation of the new data against the newly found standards.

The biggest mistake people make when trying to clean up inconsistent data is to enable validation before they have a chance to clean up the existing data. This can very quickly lead to a situation where teams are locked out of performing a bulk update on a set of records, for example because there is a validation rule which fails on the first record and prevents the rest from being updated. Even worse, teams can get into an infinite loop of a workflow automating updates to a field, which in turn triggers the update in the workflow that automates the update to that field, and so on. This is a worst case scenario that teams really want to avoid at all costs.

Phone numbers and addresses

Another item to think about when cleaning up data are phone numbers and addresses. As mentioned earlier, it's typical to clean up or reformat phone numbers and addresses within a field, but often it's helpful to keep the original data in the field for auditing and compliance reasons. Salesforce and HubSpot enable teams to keep the original data in a field and also create a clean version of that data as well.

In Salesforce, to remove non-numeric characters and then reformat a field to a new phone number field (i.e. (555) 123-4567), create a Formula Field. The Formula field expression to remove non-numeric characters and then reformat a field to a new phone number field is as follows: LEFT(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(Phone, "(", ""), ")", ""), "-", ""), 10). This will return just the digits from the original phone number field. Teams can then use the TEXT function and the MID function to then reformat these digits to whatever format is required.

HubSpot formatting of data is slightly different to how data is formatted in Salesforce. A workflow can be created to copy information from a field and then use the action 'Format data' to reformat the information as required. The newly formatted information is then written to a separate clean field. Integrations then should be updated to reference the clean field of information instead of the original field of information.

The same process is used to standardize all address fields on all objects, by first setting up a clean field to store the standard addresses. A Flow or a workflow is then set up to process each current address and transform it as needed to be standard. This then is the field that all integrations are then set up to pull from, and as with any data transform, it is best to test the transform on a sample of the affected records first to check for any issues.

Picklist and dropdown cleanup

Salesforce's Schema Builder allows teams to view all picklist values in the org and where they are being used. The 'Replace' function in the 'Picklist Value Sets' area of Setup for a picklist can be used to merge two values for a field (e.g. 'California' and 'CA'). By selecting the picklist value to retire and choosing a replacement value, Salesforce will go through and update all records where the picklist value was previously used with the new value. Unlike editing a field on a record, this process will not create a field history entry for the old value name. Thus, reports filtered by the old value names will continue to work.

Native support to merge property values doesn't exist in HubSpot. Teams would export the contacts in HubSpot with the property values and then in the resulting CSV file do a find-and-replace to clean up the values (e.g. "CA" to "California"). Then in HubSpot import this CSV file with the option "update existing records" turned on. This is pretty labor intensive for what should be a one-time fix for data that was set up incorrectly.

Step 3: Deduplicate contacts and companies safely

Merging contacts seems to be such a simple thing until teams merge 2 objects and kill an integration that was long forgotten about.

After the two records are merged, one of the original records is marked as the 'master record' and the other record is removed from the database. However, the record that was removed still has an ID and that ID is stored in all systems where that ID was used before (marketing automation, support software etc.) including all integrations that were set up with Zapier using that ID of that removed record. And when those integrations start firing, they will fire to a record that no longer exists in the database. And because those integrations stop working silently (i.e. without throwing an error) teams will typically not even know that they have stopped working until someone alerts them to the fact that a record cannot be found in the CRM or marketing automation software, or that weeks ago a workflow stopped firing for unknown reasons.

Not great.

Identify all the duplicate contacts, and then decide which contact is the most important one to save.

Salesforce deduplication mechanics

When Salesforce identifies potential duplicates they are added to a list for review. Teams are able to add rules for the application of these potential duplicates such as an exact email match, a fuzzy match of name and company, phone number match etc. Each set of duplicates can then be viewed side by side to determine which record to make the master record. No automatic merge is provided.

When two records are merged together, the fields on the master record will be copied over by default. However, teams can choose to manually select fields from the other record to copy as well. For example, let's say there are two records for a contact, one with the contact's most up-to-date email address information and the other with the contact's outdated information. Teams would choose to keep the most up-to-date information in the fields on the other record before merging the two records together. Child records (such as Opportunities, Cases, Activities) will automatically re-parent to the master record after a merge.

However, if the wrong record is picked as the master record for the merged records (e.g. an old record with outdated contact information as opposed to a newer record with the correct contact information), then teams would have to go back in and manually update all of the fields from the merged in record. Also, note that Salesforce only allows for 3 record merges at a time. Therefore, for example, if there are 5 records that are duplicates of each other, teams would have to perform 3 record merges to merge the 3 records on the left side of the page, and then perform another 2 record merge to merge the resulting record with the last 2 records on the right side of the page.

To do a successful merge of contacts, teams have to make sure that the ID of the record they are going to "lose" is up to date in all integrations. So before starting the merge process export all contacts with IDs and check for each record which ID appears in all integrations (e.g. in Zapier, in HubSpot, in the support system etc.). If the "wrong" record (the one about to be deleted) is the one stored in all these integrations outside of the CRM first update all these external systems with the correct ID of the record to be "kept" after the merge.

HubSpot deduplication mechanics

The HubSpot duplicate merge tool has a merge preview that lists out all of the fields that will be kept from the merged records. In addition, by default the most recently updated value for each field will be kept during the merge. However, each field can be set to keep the opposite record for the field during the merge.

Step 4: Handle incomplete and outdated records

Archive incomplete records (i.e., records that have had some history of interaction) instead of deleting them as teams will lose history of prior interaction. Only remove from sales & marketing workflows.

The best approach for records that teams want to keep but are missing some information is to use data enrichment on these records of value. Data enrichment on partial data records of value such as leads who have been added to the CRM pipeline recently and accounts with open deals in the CRM will greatly increase the accuracy of the information contained in these records. Clearbit, a data provider, is natively integrated within Salesforce so it can be used to update the email and job title of a contact for instance. There are also third party data providers, such as ZoomInfo that can be used to update incomplete information in both Salesforce and HubSpot. Additionally, many CRMs also have native data enrichment built into them as well. As an example, in Salesforce teams can pull in company information and contacts' social media information.

Only enrich leads that are in the pipeline and are still relevant. In this case, those would be leads that were added to the pipeline recently and accounts that have active deals. This will ensure that teams are not wasting their API credits enriching old stale contacts that are already not going to be followed up with.

Some data in the CRM, such as old leads, closed accounts and bounced contacts will not benefit from enrichment. Archiving these records in the CRM is better than deleting them. This allows historical reporting to still reference the old data (for example, auditing the reason a deal didn't complete 2 years ago). In Salesforce teams can create custom record types (in this case 'Archived') and add fields (such as a checkbox field) to the record type. Then update the record type of relevant records and check the box to mark them as inactive. Teams can then create list views to filter out these inactive records to stop them appearing in searches carried out by reps in the CRM.

For records in HubSpot's CRM pipeline that will be removed from the pipeline (either by the user or automatically), it is best to archive them (as opposed to deleting them). Archived records can be moved to a custom lifecycle stage (e.g. "Archived - No Longer Relevant") or to a custom contact property (e.g. "Archive Status"). Teams can then create active lists that include only archived contacts and companies (for example, for follow-up research) and also create active lists that exclude archived contacts and companies (for example, for sales contacts and companies that are active). Set default views for sales contacts and companies to only include those that are active and sales teams will no longer see contacts or companies that have been archived.

The audit trail is important to track down who the contact was for the deal that closed in the past for example.

When should teams actually delete? These scenarios only:

  • Test records and junk data: If there is a test environment or sandbox then teams will often create test records in order to test functionality in the CRM and also in order to test out import processes etc. The problem is that such records often contain junk data (for example a record for "Bob Smith" with a phone number of "1-800-CRM-US-1" and an email address of "bsmith@crmusa.com" etc - in other words records that are obvious test records but which are often not removed in a sandbox environment). Obviously such records would have no value historically and would likely cause issues.

Step 5: Lock in your clean data with validation and automation

The hard work of removing duplicate records and changing various versions of a name such as John Smith (John S., John, J. Smith, J. S. etc.) to one version will not last long.

The mess returns faster than expected. People will find ways to route around restrictions that they find annoying. New integrations will import data however they want. And before teams know it, two months will have passed and they'll be dealing with a database full of phone numbers that have parentheses in them and others with dashes in them.

Validation at point of entry. This simply refers to the rules that are put in place as data is entered into the database via a form (e.g. contact form, opportunity form). Such rules are typically implemented on the form itself. Each CRM supports such rules in various ways. Salesforce, for example, supports complex validation rules that can, for example, specify what to do with a record that does not contain specified information (e.g. reject the record). HubSpot, for example, supports required fields (i.e. fields that MUST be completed) as well as specifying default values for form fields and for record creation in workflows.

Start with these automation patterns:

  • Auto-format phone numbers and addresses. The example above shows how records with phone numbers in parentheses (e.g. (555) 123-4567) will be auto-formatted once the automation is enabled (e.g. 555-123-4567). This can be done using a Salesforce Flow or a HubSpot workflow triggered off of the saving of a record.
  • Auto-capitalize names and company fields. As mentioned before, a small feature but a big time saver. For example: " john smith" will automatically be changed to "John Smith" when added as a new contact in a Salesforce or HubSpot database.
  • Auto-assign values based on other fields. When someone selects "Enterprise" for deal size, automatically populate minimum revenue for that deal. Also, when someone types in a company's domain, automatically look up the industry for that company from an enrichment API (like Clearbit or LeadIQ).
  • Set up the CRM's dedupe settings for the form. The Web-to-Lead form in Salesforce, for example, has duplicate rules for email and company name. HubSpot forms are set up to do a dedupe on email before creating a new contact. (Zapier and Make also have steps for deduping in workflows but these check for an existing record and then teams decide whether to create a new one).

Systems must be balanced and allow enough flexibility to enable users to add information that is necessary to complete a sale. Validation rules at the point of entry of data are better than none, so start with the fields where inconsistent data will cause the greatest harm in reports, such as phone number fields being stored with parentheses, with dashes, etc., or for deal stages, etc. Then, add more rules as necessary.

Some fields are more important to clean than others, and typically these are the fields that create the biggest headaches in reports. For example, phone number fields (with or without parentheses, with or without dashes, etc.), country code fields, and deal stage fields. However, teams also have to be mindful of how many required fields they add to a record, because if it is made too hard for reps to add a new customer, then they'll just start filling in required fields with 'N/A' or 'Unknown' in order to close the dialog box and get back to work.

Frequently Asked Questions

How long does it take to clean CRM data?

The amount of time required to clean CRM data depends on the amount of data stored in the CRM (number of records). For CRMs holding moderate amounts of data, it will take several weeks to complete the clean of the data. However, only a portion of this time will actually be spent actually cleaning the data. Automated deduplication tools are available for CRMs and can complete the deduplication process within a few days, again allowing for time to set up the matching logic and for dealing with the odd edge case that has probably been missed.

Can I clean CRM data without deleting records?

Yes. Teams should, actually.

Deleting records can be very problematic for reporting as well as cause 'orphaned' data for email opens and deal stages etc. Therefore, it is best to mark them as inactive in a custom field (e.g. Status = 'Archived' or 'Duplicate - Do Not Use') and then filter these out in views and workflows.

What's the difference between deduplication and data standardization?

Deduplication and data standardization are two different processes in CRM data cleaning. Deduplication of data in a CRM system removes and/or merges duplicate records from the database. Data standardization is a subtler issue but has much greater impact over the long term. The problem is that all of the data in the database is entered in different formats.

Can you use third-party data cleaning tools for CRM data?

Deduplication: This is a process that's best carried out by third party tools. This is because it is so much faster and more accurate at matching up records than a human manually trying to work out the best match. Tools like Insycle for HubSpot CRM and DemandTools for Salesforce CRM are two examples of third party tools that can match up on many different fields. Data Standardization: As with the Deduplication, third party tooling is worth paying for for large amounts of data. However for smaller amounts of data within the CRM, the native CRM functions and processes such as a bulk update of data to Standardized data would be sufficient.

Gable Innovation is a technology consultancy that works with growing businesses to select the right CRM, AI and automation tools. For businesses with a messy Salesforce or HubSpot CRM database and aren't sure where to start with the clean up process, Gable can complete an audit of the issues with data. They can then work with businesses to create a plan of action and complete the most critical elements of the clean up. They can also help create processes and workflows to maintain clean data on an ongoing basis. Gable is offering free 30 minute discovery sessions to discuss requirements and create a plan that fits needs. Book a session here: gableinnovation.com.

Ready to put this into practice?

We help growing businesses implement CRM, build custom software, and deploy AI tools that actually work.

Book a Discovery Call