A DF-GAN enhancement model combined with self-attention mechanism was proposed for low semantic relevance, fuzzy details and inadequate structural integrity in text-to-image tasks. Firstly, the BERT model is used to mine the semantic features of text context, and the ? deep text-image fusion block is combined to realize the matching of deep text semantics and image regional features. Secondly, a self-attention mechanism module is introduced as a supplement to the convolution module at the model architecture level, aiming to better establish long-distance and multi-level dependencies. The experimental results show that the proposed enhancement model not only strengthens the semantic relationship between the text and the image, but also ensures the precise details and overall integrity of the generated image.