Le Chat's Out of the Bag: Mistral AI is Whiskering Away the Competition (Updates for Week of 2.20.24)

This week in AI: The significant bias concerns within Google Gemini, Mistral AI launches new LLM and Le Chat, Reddit is licensing user data to AI models, and Google launches open-sourced model, Gemma.

Feb 28, 2024

Don’t have time to read? Watch the video briefing on YouTube.

Advancement #1: Mistral AI Launches Le Chat and a New LLM

Mistral AI, a Paris-based startup, is emerging as a formidable alternative to OpenAI by launching a new flagship large language model called Mistral Large. Originating from a background of Google’s DeepMind and Meta alumni, Mistral recently raised over 400 million in funding.

The company initially embraced an open-source spirit with previous models, but has slowly shifted to OpenAI’s monetization strategy by offering access through a paid API that is about 20% cheaper than GPT-4 Turbo. Mistral Large also caters to a multilingual audience with support for English, French, Spanish, German, and Italian and accommodates a 32k token context window (in comparison to 128k with GPT-4 and 1 million with Gemini). Performance is second behind GPT-4, but keep in mind the company is only 10 months old.

Mistral AI also introduced their version of ChatGPT, “Le Chat”. While still under beta (the company warns of potential quirks) and lacking web access, the platform is available for free to test out at chat.mistral.ai.

My Initial Thoughts: I don’t know if it’s the fact that they seem like the coolest team ever or that their logo looks straight out of the 90’s and reminds me of my childhood… but there's something inexplicably captivating about this company that has me hooked.

I first heard of Mistral about a month back, catching bits and pieces of whispers in more the open-source arena. That little spark of interest turned into a full-on fire when I accidentally ran into a thread on Discord hinting about another open-sourced model they were launching by one of their developers (Mistral Next).

Then boom, they drop the news about Mistral Large and Le Chat, and suddenly, I'm all in. Seriously if you’re hiring Mistral, I will work for you (I’m half kidding).

They seem to be a really lean but incredibly smart team, and I just may think they may be the underdog in this race. After all, Microsoft also announced a partnership with them a few days ago, so I think my feelings are not alone.

Advancement #2: The Embedded Bias Concerns in Google's Gemini Model

In a recent development that has sparked widespread concern among critics and observers alike, Google's latest AI model Gemini has come under scrutiny for its inherent biases and the consequential impact these biases may have on the public's perception of history and ethnicity. (As this is a highly sensitive topic, I will be providing a significant amount of examples and references for you all to come to your own conclusions, but I must admit I’m a bit nervous writing about this one. I however, do believe it is extremely important to cover).

Users started to first take notice of Gemini’s bias when the model refused to generate any images of Caucasian individuals, going so far as to alter the ethnicity of figures like George Washington (Example 1). Furthermore, when explicitly requested to generate an image of a Caucasian family, the model cited its DEI (Diversity, Equity, and Inclusion) policies as a reason for refusal, yet this constraint was not applied to requests for generating images of families from other ethnic backgrounds within the exact same session (Example 2).

The controversy notably intensified with Gemini's problematic responses to its refusal to denounce pedophilia, and to prompts comparing the societal impacts of Elon Musk's social media activities to Adolf Hitler's historical atrocities, stating that it is up to the individual to decide which was more harmful, and that “it is difficult to say who had a greater negative impact on society” (Example 3).

The AI's hesitancy to address such comparisons, coupled with its representation of historical figures and events through an inaccurate ethnic lens in its image generation, has revealed an alarmingly deep-seated bias within the AI's algorithms developed by Google. The Gemini dilemma has also raised pressing questions about the ethical programming of AI models, as the Gemini’s tendency for output not only distorts historical truths but also underscores the potential for AI technologies to perpetuate misinformation and biases at the behest of their creators.

Google has apologized for the issues with Gemini and has temporarily halted its image generation capabilities.

How does Google proceed from here? Likely, there are two main ways this can play out:

1. Many companies (including Google and OpenAI) have interpretation engines, which is basically an additional model that sits in front of the larger model and acts as a tunable interface. As Google stated that Gemini should be available for use again within a few weeks, there is a strong chance that they will be able to adjust their interpretation engine without much hassle.

2. Experts worry there is a strong chance that Google may have trained their models with a significant amount of bias to the point they run the risk of having to start from scratch and start over. The performance of a model is directly correlated to the accuracy and truthfulness of the model’s output. If the training process was tainted, it could be hard for them to backpedal.

My Initial Thoughts: A few weeks ago in a different blog post when I started to touch the realm of Safe AI, I stated what I believed to be two fundamental principles that are crucial for establishing guardrails around AI technologies:

The necessity for information to be factual and vetted;
The importance of preventing the limitation of access to information for societal influence.

And while I am quite concerned on multiple fronts, the biggest conclusion I have is it would be better if the system would refuse to answer the question altogether instead of communicating a deeply biased answer as truth.

We've all encountered situations where AI models withhold information, choosing silence over answers in the name of Safe AI and ethical standards dictated by corporate rulebooks. Frustrating as it may be to hit these walls of refusal, there's a comforting aspect to this silence: it safeguards us against being deceived or misled.

Yet, the Gemini episode casts a revealing spotlight on a much more serious problem: AI systems dispensing responses infused with bias that have been intricately embedded into their core by the entities and people who create them. Presenting distorted information as helpful guidance proves to be a much more dangerous outcome than any outright denial to provide an answer.

Furthermore, the significant infiltration of bias within Gemini prompts another critical question: to what extent has Google influenced the informational landscape that users encounter through its search engine every day?

Before, Google was able to hide their bias more easily because of search result rankings (if they didn’t like a source, they could just move it down, or further up if it did align with their principals). Despite the concerns of even influencing elections (de-prioritizing conservative sources), they have always been able to hide under the guise of the algorithm, but were never forced to show how the algorithm ranks.

Both scenarios not only underscores the frequency of embedded bias, but also puts a finer point on the broader implications for trust in Google's ability to fairly curate and present information.

These developments hint at the possibility that bias isn't an isolated incident but a pervasive element within Google's ecosystem that is now being brought into sharper focus… casting a long shadow over the reliability of one of the world's most utilized information gateways. If you need me, I’ll be using literally, any other search engine. DuckDuckGo is even on the list (and they actually have a privacy focused browser now).

Hidden Headline

Advancement #3: Reddit’s IPO Filing Reveals It’s Licensing User Data to Train AI Models

Reddit's recent IPO filing made waves in financial circles last week, shedding light on a previously undisclosed facet of its revenue stream: data licensing agreements with major AI model trainers. Among these agreements, one deal involves a lucrative exclusive arrangement with Google, valued at a staggering $60 million annually. The IPO prospectus underscores the pivotal role these agreements play in Reddit's financial outlook, with a total licensing contract value (lasting 2-3 years) to be approximately $203 million, and revenue projections for licensing alone to be $66.4 million for the current fiscal year.

Reddit's decision to capitalize on its vast repository of over 1 billion posts and 16 billion comments through data licensing agreements reflects a strategic shift in leveraging its valuable content assets by monetizing user-generated data. Other agreements similar to Reddit’s story are known, including OpenAI’s licensing agreements with Shutterstock and Axel Springer (the owner of Politico and Business Insider).

My Initial Thoughts: As someone deeply interested in data privacy, my mind immediately turns to the implications for user rights and the privacy of platform users. While Reddit's move to monetize its data assets through licensing agreements with AI vendors like Google and OpenAI may be financially lucrative, it raises important questions about consent and control over personal information. Users contribute to these platforms with the expectation that their data will be used responsibly and in accordance with privacy regulations. However, the reality is that their content can be packaged and sold without explicit consent, and many Reddit users have communicated frustration on threads.

It's imperative for both companies and regulators to prioritize transparency and user empowerment, ensuring that individuals retain agency over their own data and how it's utilized in the AI-driven marketplace, and unfortunately, it was not the case with Reddit.

Advancement #4: Google Launches Open-Sourced Gemma Model

Google has unveiled Gemma, an open-source AI model developed from the technology behind its Gemini AI models. Gemma offers two sizes, 2B and 7B, and despite its compact size, Gemma demonstrates superior performance compared to larger models and other open alternatives. However, its performance currently falls below that of Mistral-7B, particularly in reasoning and ethical decision-making capabilities, although it excels in specific technical tasks such as coding and certain areas of mathematics and science. The Gemma announcement marks Google's return to releasing new research into the open-source ecosystem, with the models being developed by Google DeepMind and other teams within the company.

My Initial Thoughts: Although Google refers to Gemma as an LLM, I've noticed a trend (first seen by Intento) towards categorizing smaller parameter models as SLMs (Small Language Models) rather than LLMs… and I endorse this shift.

Nonetheless, Google's decision to release open models like Gemma in an era where cutting-edge LLMs are typically proprietary reflects a strategic shift towards acknowledging the diverse needs of developers. With Google previously lacking an option in the open-sourced space, developers would opt for alternative open-sourced models (such as Mistral-2 or Llama-2) instead of building within the Google ecosystem. This move not only fosters loyalty to Google's platform but also promotes the use of Google Cloud services for hosting models (thus increasing revenue).

One last thing. Stirling has something he’d like to say to you all.

Share The AI Almanac

See you next week, everyone! :)