Text Generation

Enhance your application's capabilities by leveraging powerful open-source models through our API. Select the right model tailored to your specific use case with detailed guidelines and best practices below.

Available Models

DeepSeek-R1 ($0.0/1M tokens): SOTA Reasoning Model for the most challenging tasks
llama-3.3-70b ($0.8/1M tokens): Meta's largest model, best for complex tasks
llama-3.2-8b ($0.2/1M tokens): Efficient model for development and testing
mistral-nemo ($0.3/1M tokens): Great performance/cost ratio for production

Making API Calls

Chat Completions

from openai import OpenAI

client = OpenAI(
    base_url="https://api.brilliantai.co",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantum computing?"}
    ],
    temperature=0.7,
    max_tokens=500
)

Parameters

model: Model ID (e.g., "llama-3.3-70b", "mistral-nemo")
messages: Array of messages in the conversation
temperature: Controls randomness (0-1)
max_tokens: Maximum tokens in response
top_p: Nucleus sampling parameter
presence_penalty: Penalize new tokens based on presence in text
frequency_penalty: Penalize new tokens based on frequency

Model Selection Guide

Our cloud platform provides access to a range of open-source models optimized for various application needs. You can use different models for different parts of your applications based on your needs without managing different providers. For parts of your application that require speed, you can use a smaller model like llama-3.2-8b, and for use cases where accuracy is essential, you can bring in the big guns like deepseek-r1 and llama-3.3-70b. This flexibility ensures that you get the best performance without sacrificing efficiency.

Top 3 Benefits of Multiple Model Sizes in a Single API

Optimized Performance & Cost Efficiency
- Use lightweight models for quick responses and heavy-duty models only when needed, reducing compute costs while maintaining quality.
Seamless Scalability & Adaptability
- Adjust dynamically based on real-time application needs without switching providers or modifying infrastructure.
Faster Development & Deployment
- Streamline your workflow by integrating various model sizes under one API, avoiding complex model-switching logic and reducing development overhead.

Available Models

deepseek-r1

Ideal for: Agentic workflows, complex planning, and accurate reasoning tasks.
Strengths: Excels in mathematics, programming, and logical reasoning.
Architecture: Uses a mixture-of-experts approach with efficient parameter activation for optimal performance.

llama-3.3-70b

Ideal for: Complex reasoning tasks, enterprise applications, research, and analysis.
Strengths: High accuracy.
Architecture: A 70-billion-parameter model fine-tuned for instruction-based tasks with a long context window.
Performance: Optimized for coding and reasoning

llama-3.2-8b

Ideal for: Development, testing, cost-sensitive applications, and quick prototyping.
Strengths: Optimized for environments with limited computational resources, ensuring efficiency.
Architecture: An 8-billion-parameter model supporting an extended context length.

mistral-nemo

Ideal for: Production chat applications, general-purpose tasks, and customer support bots.
Strengths: Offers a balanced performance-to-cost ratio for scalable deployments.
Architecture: Designed for efficient inference, ensuring responsive interactions.

Streaming and non-streaming responses

In non-streaming mode, the model generates the entire response before transmitting it to the client. This approach is straightforward but can lead to increased latency, especially with complex queries that require more processing time. Users must wait for the full response to be generated, which can impact the overall user experience. For implementation details on non-streaming responses, refer to the Getting Started section.

Streaming

Conversely, streaming responses allow the model to send data incrementally as it's generated, token by token. This method enhances user experience by providing immediate feedback, reducing perceived latency, and keeping users engaged. Streaming is particularly beneficial in chat interfaces, where real-time interaction is crucial. To enable streaming, set the stream parameter to True when making a request to the model. This configuration returns a stream of server-sent events (SSE), allowing the client to process each chunk of data as it arrives. Here's a basic example of how to implement streaming in Python:

Streaming example

from openai import OpenAI
client = openai.OPENAI(
                base_url = "https://api.llamacloud.co",
                api_key = "<LLAMACLOUD_API_KEY>")

completion = client.chat.completions.create(
                model="llama-3.2-8b",
                messages=[
                {"role": "system", "content": "You are a helpful AI assistant"},
                {"role": "user", "content": "How are you doing today?"}
                ],
                stream=True)
                
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

When you stream a chat completion, the responses has a delta field rather than a message field. The delta field can hold a role token, content token, or nothing.

Non streaming example

from openai import OpenAI
client = openai.OPENAI(
                base_url = "https://api.llamacloud.co",
                api_key = "<LLAMACLOUD_API_KEY>")

completion = client.chat.completions.create(
                model="llama-3.2-8b",
                messages=[
                {"role": "system", "content": "You are a helpful AI assistant"},
                {"role": "user", "content": "How are you doing today?"}
                ])
                
print(response.choices[0].message.content)

Best Practices

1. Model Selection

deepseek-r1: Best for tasks requiring intricate reasoning, advanced coding, and mathematical problem-solving.
llama-3.3-70b: Suitable for enterprise applications demanding robust reasoning and multilingual support.
llama-3.2-8b: Ideal for development, testing, and cost-effective applications.
mistral-nemo: Optimal for deploying scalable chat solutions and customer support systems.

2. Prompt Engineering

Specificity: Provide clear, detailed instructions for more precise responses.
System Messages: Use system-level directives to set context and guide behavior.
Conversation Structure: Maintain logical flow and context in interactions.

3. Performance Optimization

Max Tokens: Set appropriate limits to balance response completeness with efficiency.
Streaming: Enable response streaming for faster initial feedback.
Retry Logic: Implement mechanisms for handling incomplete or suboptimal responses.

4. Cost Management

Token Monitoring: Track token usage to optimize high-consumption areas.
Model Selection: Choose models based on application needs to minimize costs.
Usage Limits: Set thresholds and alerts to prevent unexpected expenses.

By selecting the right model and following these best practices, you can maximize the efficiency and performance of your applications while keeping costs under control.

Error Handling

try:
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain quantum entanglement."}
        ]
    )
except openai.APIError as e:
    print(f"API Error: {e}")
except openai.RateLimitError as e:
    print(f"Rate Limit Error: {e}")

Next Steps

Explore Embeddings
Try building a chatbot
Learn about streaming responses

Available Models​

Making API Calls​

Chat Completions​

Parameters​

Model Selection Guide​

Top 3 Benefits of Multiple Model Sizes in a Single API​

Available Models​

deepseek-r1​

llama-3.3-70b​

llama-3.2-8b​

mistral-nemo​

Streaming and non-streaming responses​

Streaming​

Streaming example​

Non streaming example​

Best Practices​

1. Model Selection​

2. Prompt Engineering​

3. Performance Optimization​

4. Cost Management​

Error Handling​

Next Steps​

Available Models

Making API Calls

Chat Completions

Parameters

Model Selection Guide

Top 3 Benefits of Multiple Model Sizes in a Single API

Available Models

deepseek-r1

llama-3.3-70b

llama-3.2-8b

mistral-nemo

Streaming and non-streaming responses

Streaming

Streaming example

Non streaming example

Best Practices

1. Model Selection

2. Prompt Engineering

3. Performance Optimization

4. Cost Management

Error Handling

Next Steps