• Stovetop@lemmy.world
    link
    fedilink
    English
    arrow-up
    16
    arrow-down
    1
    ·
    10 months ago

    This is only going to be adding recent Reddit data.

    A growing amount of which I would wager is already the product of LLMs trying to simulate actual content while selling something. It’s going to corrupt itself over time unless they figure out how to sanitize the input from other LLM content.

    • kromem@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      10 months ago

      It’s not really. There is a potential issue of model collapse with only synthetic data, but the same research on model collapse found a mix of organic and synthetic data performed better than either or. Additionally that research for cost reasons was using worse models than what’s typically being used today, and there’s been separate research that you can enhance models significantly using synthetic data from SotA models.

      The actual impact will be minimal on future models and at least a bit of a mixture is probably even a good thing for future training given research to date.