The Data Problem.

With seemingly-SOTA models like Google’s Gemini generating wildly historically-inaccurate pictures [1], the need for diverse data representation in all domains is now more urgent than ever. I think the road to achieving this will be a long journey, but some things I think can help facilitate this transition include:

  1. Unlocking the potential of massive, unstructured datasets that sit, unused in the datalakes and warehouses of large legacy companies who serve the general public.

    1. Related to this, more work on creating or improving data exchange pipelines, such as synthetic data generation for transaction verification, secure data transfer, structuring economic incentives around data exchanges, etc.

  2. Creating a government body that oversees and regulates AI development. This body can then set the industry standard for dataset fidelity and representation, as well as uses of AI. (The fact that AI development is too sprawling to be fully regulated is not a valid argument against any regulation, but in fact for more regulation.) Incentivizes companies to improve representativeness of their datasets, through government policies or initiatives.

I’m really curious to hear your thoughts! Feel free to email me (you can find it on my home page).

Till the next post,

Denise

References:

[1] https://www.npr.org/2024/02/28/1234532775/google-gemini-offended-users-images-race#:~:text=Last%20week%2C%20Google%20paused%20Gemini's,German%20solider%20with%20dark%20skin.