top of page

Safeguarding Privacy with Synthetic Data

Q&A with Alys Woodward, Sr Director Analyst, Gartner



A major problem with AI development today is the burden involved in obtaining real-world data and labeling it. In fact, data availability was selected as one of the top five barriers to implementing generative AI (GenAI) in a Gartner survey of 644 organizations done in the fourth quarter of 2023. Synthetic data can help solve this problem. With orders of magnitude less privacy risk than real data, synthetic data can open a range of opportunities to train machine learning (ML) models and analyze data that would not be available if real data were the only option.


Here is a sharing from Alys Woodward, Sr Director Analyst at Gartner on how synthetic data can overcome privacy, compliance, and data anonymization challenges, while also delving into the issues impeding its widespread adoption.


Q: How can synthetic data help organizations address privacy challenges while training their AI/ML or computer vision (CV) models?


A: Synthetic data can bridge information silos by acting as a substitute for real data and not revealing sensitive information, such as personal details and intellectual property. Since synthetic datasets maintain statistical properties that closely resemble the original data, they can produce precise training and testing data that is crucial for model development.


Training CV models often requires a large and diverse set of labeled data to build highly accurate models. Obtaining and using real data for this purpose can be challenging, especially when it involves personally identifiable information (PII).


Two common use cases that require PII data are ID verification and automated driver assistance systems (ADAS), which monitor movements and actions in the driver’s area. In these situations, synthetic data can be useful for generating a range of facial expressions, skin color and texture, as well as additional objects like hats, masks, and sunglasses. ADAS also requires AI to be trained for low-light conditions, such as driving in the dark.


Q: How can synthetic data reduce the challenges associated with data anonymization?


A: Efforts to manually anonymize and deidentify datasets – remove information that links a data record to a specific individual – are often time-consuming, labor-intensive, and prone to errors. Ultimately, this can delay projects and lengthen the iteration cycle time for development of machine learning (ML) algorithms and models. Synthetic data can overcome many of these pitfalls by providing faster, cheaper and easier access to data that is similar to the original source, suitable for use, and protects privacy.


Furthermore, if manually anonymized data is combined with other publicly available data sources, there's a risk it could inadvertently reveal information that could lead to data reidentification, thus breaching data privacy. Leaders can use techniques such as differential privacy to ensure any synthetic data generated from real data is at very low risk of deanonymization.


Q: Despite the clear benefits of using synthetic data, what are some of the challenges hindering its widespread adoption?


A: Creating a synthetic tabular dataset involves striking a balance between privacy and utility, ensuring the data remains useful and accurately represents the original dataset. If the utility is too high, privacy may be compromised, especially for unique or distinctive records, as the synthetic dataset could be matched with other data sources. Conversely, methods to enhance privacy, such as disconnecting certain attributes or introducing ‘noise’ via differential privacy, can inherently diminish the dataset’s utility.


Over the past decades of data management, low quality of transaction data has been an ongoing challenge. For example, call center agents might fail to complete full address data, or customer information. This missing data can prevent analysis. To counteract this, IT organizations needed to educate business users on how important good data quality is to both applications and analytics. “Garbage in means garbage out” was the commonly accepted principle. However, this now affects people’s attitudes to synthetic data as they believe it must be inferior because it’s not real data, which delays adoption. In reality, synthetic data can be better than real data, not in how it represents the current world, but in how it can train AI models to work with the ideal or future world.


A synthetic dataset mirrors the original dataset. Therefore, if the original does not include unusual occurrences or “edge cases,” these won’t appear in the synthetic dataset either. This is particularly important for image and video synthetic data in areas like autonomous driving, where many hours of driving footage are used to train the AI. However, unusual situations like emergency vehicles, driving in snow or animals on the road need to be created.

Comments


connexion_panel_edited.jpg
CXO_8-in-1.png
subscribe_button.png

Disclaimer: The "Industry Events" section in Inno-Thought website serves as a platform for event organizers and vendors to list their events for free. Ho Hon Asia reserves the right, at its discretion, to not proceed with publication/posting at any time or to remove the content following publication.

 

By providing your email address and submitting this form, you agree to receive updates about the event listed, including schedule changes, reminders, and important information.

 

The event information contained in the listing above is for reference only. While we have made every attempt to ensure that the info has been obtained from reliable sources, we are not responsible for any errors or omissions, or for the results obtained from the use of this info. In no event will Ho Hon Asia Limited, its related partnerships or corporations, or the partners, agents or employees thereof be liable to you or anyone else for any decision made or action taken in reliance on the information in this site or for any consequential, special or similar damages, even if advised of the possibility of such damages.

 

Information subject to change; check official sources. The Organisers reserve the right to modify the Event program, schedule, speakers, and activities without prior notice.

 

Also, the event organizers reserve the right to accept or reject any registration application at its sole discretion, without providing reasons or explanation. Submission of a registration does not guarantee participation in the event.

2026 @ Inno-Thought and its affiliates. All rights reserved.

bottom of page