Synthetic data, as the name suggests, is data that is artificially created rather than being generated by actual events. It is often created with the help of algorithms and is used for a wide range of activities, including as test data for new products and tools, for model validation, and in AI needs.
The questions that this post sets out to answer include:
The Importance of Synthetic Data
Synthetic data is important because it can be generated to meet very specific needs or conditions that are not available in existing (real) data. This can be useful when either privacy needs limit the availability or usage of the data or when the data needed for a test environment simply does not exist.
Though synthetic data first started to be used in the 90’s, an abundance of computing power and storage space of 2010’s brought more widespread use of synthetic data.
Business functions that can benefit from synthetic data include:
- Machine learning: Self driving car simulations pioneered the use of synthetic data.
- Agile development and DevOps: When it comes time for software testing and quality assurance, artificially generated data is often the better choice as it eliminates the need to wait for ‘real’ data. Often referred to under this circumstance as ‘test data’. This can ultimately lead to decreased test time and increased flexibility and agility during development
- Clinical and scientific trials: Synthetic data can be used as a baseline for future studies and testing when no real data yet exists.
- Research: To help better understand the format of real data not yet recorded, develop understanding of its specific statistical properties, tune parameters for related algorithms, or build preliminary models.
And some industries that can benefit from synthetic data:
- Financial services: Fraud protection is a major part of any financial service and with synthetic data, new fraud detection methods can be tested and evaluated for their effectiveness.
- Healthcare: Synthetic data enables healthcare data professionals to allow the public use of record data while still maintaining patient confidentiality.
Generally speaking, synthetic data allows us to continue developing new and innovative products and solutions when the data necessary to do so otherwise wouldn’t be present or available.
Data is used in applications and the most direct measure of data quality is data’s effectiveness when in use. Machine learning is one of the most common use cases for data today. MIT scientists wanted to measure if machine learning models from synthetic data could perform as well as models built from real data. In an 2017 study, they split data scientists into two groups: one using synthetic data and another using real data. 70% of the time group using synthetic data was able to produce results on par with the group using real data.
Being able to generate data that mimics the real thing may seem like a limitless way to create scenarios for testing and development. While there is much truth to this, it is important to remember that any synthetic models deriving from data can only replicate specific properties of the data, meaning that they’ll ultimately only be able to simulate general trends.
However, there are still a number of benefits that synthetic data has over real data:
- Overcoming real data usage restrictions: Real data may have usage constraints due to privacy rules or other regulations. Synthetic data can replicate all important statistical properties of real data without exposing real data, thereby eliminating the issue.
- Creating data to simulate not yet encountered conditions: Where real data does not exist, synthetic data is the only solution.
- Immunity to some common statistical problems: These can include item nonresponse, skip patterns, and other logical constraints.
- Focuses on relationships: Synthetic data aims to preserve the multivariate relationships between variables instead of specific statistics alone.
These benefits demonstrate that the creation and usage of synthetic data will only stand to grow as our data becomes more complex; and more closely guarded.
When determining the best method for creating synthetic data, it is important to first consider what type of synthetic data you aim to have. There are two broad categories to choose from, each with their own benefits and drawbacks:
Fully synthetic: This data does not contain any original data. This means that re-identification of any single unit is almost impossible and all variables are still fully available.
Partially synthetic: Only data that is sensitive is replaced with synthetic data. This requires a heavy dependency on the imputation model. This leads to decreased model dependence, but does mean that some disclosure is possible owing to the true values that remain within the dataset.
Two general strategies for building synthetic data include:
Drawing numbers from a distribution: This method works by observing real statistic distributions and reproducing fake data. This can also include the creation of generative models.
Agent-based modeling: To achieve synthetic data in this method, a model is created that explains an observed behavior, and then reproduces random data using the same model. It emphasizes understanding the effects of interactions between agents that are had on a system as a whole.
Though there are a wide range of benefits that can be derived with the aid of synthetic data, it is not without its own challenges. Some of these challenges include:
- Synthetic data is not accepted as valid by users.
- Difficulty in generating synthetic data
- Dependency on the quality of the data model
- Inconsistencies when trying to replicate complexities within original datasets
- Difficulty in tracking all necessary features required to replicate the data
- The presence of bias within the synthetic data
- May require validation against real world data
- Simplified representations within datasets can have hidden effects on the performance of an algorithm when used in a real world setting
Machine Learning and Synthetic Data: Building AI
The role of synthetic data in machine learning is increasing rapidly. This is because machine learning algorithms are trained with an incredible amount of data; which without synthetic data could be extremely difficult to obtain or generate. It can also play an important role in the creation of algorithms for image recognition and similar tasks that are becoming the baseline for AI.
There are a number of additional benefits to using synthetic data to aid in the development of machine learning:
- Ease in data production once an initial synthetic model/environment has been established
- Accuracy in labeling that would be expensive or even impossible to obtain by hand
- The flexibility of the synthetic environment to be adjusted as needed to improve the model
- Usability as a substitute for data that contains sensitive information
2 synthetic data use cases that are gaining widespread adoption in their respective machine learning communities are:
Learning by real life experiments is hard in life and hard for algorithms as well. It is especially hard for people that end up getting hit by self-driving cars as in Uber’s deadly crash in Arizona. While Uber scales back their Arizona operation, they should probably ramp up their simulations to train their models.
Industry leaders such as Google have been relying on simulations to create millions of hours of synthetic driving data to train their algorithms.
Generative Adversarial Networks (GAN):
These networks, also called GAN or Generative adversarial neural networks were introduced by Ian Goodfellow et al. in 2014. These networks are a recent breakthrough in image recognition. They are composed of one discriminator and one generator network. While generator network generates synthetic images that are as close to reality as possible, discriminator network aims to identify real images from synthetic ones. Both networks build new nodes and layers to learn to become better at their tasks.
While this method is popular in neural networks used in image recognition, it has uses beyond neural networks. It can be applied to other machine learning approaches as well. It is generally called Turing learning as a reference to the Turing test. In the Turing test, a human converses with an unseen talker trying to understand whether it is a machine or a human.
The tools related to synthetic data are often developed to meet one of the following needs:
- Test data for software development and similar
- The creation of machine learning models (referred to in the chart as ‘training data’)
Some common vendors that are working to create tools in this space include:
|Name||Founded||Status||Number of Employees|
|CA Technologies Datamaker||1976||Public||10,001+|
|Deep Vision Data by Kinetic Vision||1985||Private||51-200|
|Delphix Test Data Management||2008||Private||201-500|
|Informatica Test Data Management Tool||1993||Private||1,001-5,000|
These 10 tools are just a small representation of a growing market of tools and platforms related to the creation and usage of synthetic data.
Synthetic data is one way that the world is evolving to deal with no only an increasing volume of data, but also with data that is oftentimes sensitive and requires additional protections. To learn more about related topics on data, be sure to see our blog’s data section.