AI Research

Introducing Text2Data: A Low-Resource, Text-to-Anything AI for Data Generation

Our team at Salesforce Research introduces Text2Data, an innovative framework specifically designed to generate high-quality, controllable data from limited textual input.

Shiyu Wang

Ran Xu

1 additional authors

March 26, 2025 5 min read

Natural language has long served as a powerful bridge between humans and machines, enabling intuitive interaction and precise control over AI systems. While fields such as image editing, audio synthesis, and video generation thrive on abundant data, text pairs to train a powerful generative model, many critical domains are facing a significant challenge: the scarcity of labeled data. Complex fields such as molecular research, motion generation, and time series often lack sufficient textual annotations, limiting the effectiveness and potential of current generative AI methods.

To overcome this critical gap, our team at Salesforce Research introduces Text2Data, an innovative framework specifically designed to generate high-quality, controllable data from limited textual input. Text2Data effectively addresses the complexities inherent in low-resource scenarios, making generative AI accessible for more specialized, challenging applications.

Why Text2Data Matters

Current methods typically rely on substantial labeled training data to achieve effective text-to-data control. However, practical scenarios involve costly or impractical labeling processes. This primarily restricts supervised learning and limits the use of advanced generative models for text-to-data generation tasks. When training generative models with limited resources, issues like poor generation quality, model overfitting, bias, and lack of diversity arise. Traditional strategies, such as data augmentation and semi-supervised learning, often fall short. This is due to the nuances and ambiguities of text, computational inefficiency, or catastrophic forgetting, where previously learned information deteriorates when new data is introduced.

How Text2Data Works

Figure 1: Overview of Text2Data. The model leverages unlabeled data (i.e., blue module) to discern the overall data distribution while the optimal set of model parameters Θ is obtained. Then the model is fine-tuned on labeled data (i.e., red module) by constraint optimization that gives the optimal set of parameters as Θ ∩ Θ′ , where Θ′ is the optimal set of parameters if fine-tune the model without constraint.

As illustrated in Figure 1, Text2Data introduces a two-step approach leveraging powerful unsupervised diffusion models:

Unsupervised Distribution Mastery: Using unlabeled data, Text2Data first captures the inherent data distribution without any textual annotations, laying a robust foundation of a base model for the subsequent fine-tuning.
Controllable Fine-tuning: The model is then carefully fine-tuned using limited textual labels through a novel constraint optimization strategy. This technique ensures the model (parameters) remains close to its originally learned distribution, effectively mitigating catastrophic forgetting.

From a theoretical perspective, our method relies on the following constraint optimization on learning objective:

The first line is the main learning objective of a controllable generative model.
The second to the third line enforce the parameter space not to deviate far from the original parameter space learned during pre-training.

Key Innovations

Unique Constraint-Based Optimization: This method prevents overfitting and ensures the fine-tuned model maintains its integrity, crucially preserving previous learning.
Theoretical Backing: Provides confidence bounds that enhance our method’s reliability and effectiveness in various low-resource environments.
Comprehensive Experimentation: Across molecules, human motion, and time series, Text2Data showcases its versatility and superiority over current baselines.

Proven Results

Figure 2: Evaluate controllability on Molecule dataset according to different proportions of paired data. Green solid line corresponds to Text2Data and two dashed lines are baseline comparisons, in which blue line is EDM and orange line is EDM-finetune. Properties of generated molecules are predicted by classifier ϕc. MAE is computed between properties of generated molecules and intended properties. Lower MAE indicates better performance.

Figure 3: Visualization of generated molecules when the polarizability increases from “very low” to “very high” suggested in textual description.

Across diverse datasets—including molecules (QM9), human motions (HumanML3D), and financial time series—Text2Data consistently demonstrates enhanced controllability and superior data quality compared to existing methods. Notably, it outperforms baseline diffusion models significantly in scenarios with sparse textual annotations, underscoring its potential to revolutionize AI applications in specialized domains.

For example, Figure 2 illustrates the MAE trend between properties of generated molecules and the intended one as the proportion of labeled training data rises. Text2Data achieves superior performance than EDM-finetune and EDM (baseline model) on all properties by a remarkable margin. We also depict the molecules generated as the text descriptor for polarizability shifts from “very low” to “very high” in Figure 3. Polarizability indicates the inclination of the molecule to form an electric dipole moment under an external electric field. As α values rise, we expect to see molecules with less symmetrical forms, as evidenced in Figure 3. This trend suggests the validity of generated molecules by Text2Data and its fine-grained controllability. More experimental results on more modalities can be found in our paper.

Conclusion

Text2Data represents a significant leap forward in generative AI, particularly in low-resource scenarios. By effectively leveraging both unlabeled and labeled data, it addresses the critical challenge of data scarcity and enhances the controllability and quality of generated data. This innovative framework not only opens new avenues for research and application in specialized domains but also sets a new standard for generative AI models. As we continue to refine and expand Text2Data, we are confident that it will play a pivotal role in advancing the capabilities of AI systems across a wide range of industries and applications.

Explore More

Read our paper
Check out our code on GitHub
Check out more AI Research blogs
Salesforce AI Research Website
Follow us on X: @SFResearch, @Salesforce

Acknowledgments

Full Author List: Shiyu Wang, Yihao Feng, Tian Lan, Ning Yu, Yu Bai, Ran Xu, Huan Wang, Caiming Xiong, Silvio Savarese

The Agentic AI Era: After the Dawn, Here’s What to Expect

11 min read

2024 in AI Research: Building Blocks for the Agentic Era Ahead

5 min read

Shiyu Wang Applied Scientist, AI Research

Shiyu Wang is an Applied Scientist working on generative models, large action model evaluation and AIOps. He is also interested in reasoning and graph machine learning. He received my PhD from Emory University, and my Master's and Bachelor's from Yale and Fudan University, respectively.

More by Shiyu

Ran Xu Director, AI Research

Ran Xu received his Ph.D. in computer science from University at Buffalo from 2015. Currently, he leads a group of exceptional computer vision and multimodal AI researchers at Salesforce to push the boundary of research and productive AI for CRM.

More by Ran

Huan Wang

Huan Wang is a Research Director at Salesforce Research. He works on various topics including deep learning theory, reinforcement learning, time series analytics, operational and data intelligence.

More by Huan

Introducing Text2Data: A Low-Resource, Text-to-Anything AI for Data Generation

Our team at Salesforce Research introduces Text2Data, an innovative framework specifically designed to generate high-quality, controllable data from limited textual input.

Shiyu Wang

Ran Xu

1 additional authors

Why Text2Data Matters

How Text2Data Works

Key Innovations

Proven Results

Conclusion

Explore More

Acknowledgments

Just For You

The Agentic AI Era: After the Dawn, Here’s What to Expect

2024 in AI Research: Building Blocks for the Agentic Era Ahead

Just For You

Automating the Adversary: Designing a Scalable Framework for Red Teaming AI

The LLMs behind Agentforce for Developers

GIFT-Eval: A Benchmark for General Time Series Forecasting Model Evaluation

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning

Accelerating Your Model Evaluation and Fine-tuning with SFR-Judge

Ethical Hacking Practices Prove Successful in Building Trusted AI Products

Building Contextually Faithful RAG Applications with SFR-RAG

Share article

Why Text2Data Matters

How Text2Data Works

Key Innovations

Proven Results

Conclusion

Explore More

Acknowledgments

Share article

Explore related content by topic

Get the latest articles in your inbox.

360 Highlights

IT

Commerce

Marketing

Service

Sales

Thanks, you're subscribed!