Text Generation
Enhance your application's capabilities by leveraging powerful open-source models through our API. Select the right model tailored to your specific use case with detailed guidelines and best practices below.
Available Models
- DeepSeek-R1 ($0.0/1M tokens): SOTA Reasoning Model for the most challenging tasks
- llama-3.3-70b ($0.8/1M tokens): Meta's largest model, best for complex tasks
- llama-3.2-8b ($0.2/1M tokens): Efficient model for development and testing
- mistral-nemo ($0.3/1M tokens): Great performance/cost ratio for production
Making API Calls
Chat Completions
from openai import OpenAI
client = OpenAI(
base_url="https://api.brilliantai.co",
api_key="your-api-key"
)
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is quantum computing?"}
],
temperature=0.7,
max_tokens=500
)
Parameters
model
: Model ID (e.g., "llama-3.3-70b", "mistral-nemo")messages
: Array of messages in the conversationtemperature
: Controls randomness (0-1)max_tokens
: Maximum tokens in responsetop_p
: Nucleus sampling parameterpresence_penalty
: Penalize new tokens based on presence in textfrequency_penalty
: Penalize new tokens based on frequency
Model Selection Guide
Our cloud platform provides access to a range of open-source models optimized for various application needs. You can use different models for different parts of your applications based on your needs without managing different providers. For parts of your application that require speed, you can use a smaller model like llama-3.2-8b
, and for use cases where accuracy is essential, you can bring in the big guns like deepseek-r1
and llama-3.3-70b
. This flexibility ensures that you get the best performance without sacrificing efficiency.
Top 3 Benefits of Multiple Model Sizes in a Single API
-
Optimized Performance & Cost Efficiency
- Use lightweight models for quick responses and heavy-duty models only when needed, reducing compute costs while maintaining quality.
-
Seamless Scalability & Adaptability
- Adjust dynamically based on real-time application needs without switching providers or modifying infrastructure.
-
Faster Development & Deployment
- Streamline your workflow by integrating various model sizes under one API, avoiding complex model-switching logic and reducing development overhead.
Available Models
deepseek-r1
- Ideal for: Agentic workflows, complex planning, and accurate reasoning tasks.
- Strengths: Excels in mathematics, programming, and logical reasoning.
- Architecture: Uses a mixture-of-experts approach with efficient parameter activation for optimal performance.
llama-3.3-70b
- Ideal for: Complex reasoning tasks, enterprise applications, research, and analysis.
- Strengths: High accuracy.
- Architecture: A 70-billion-parameter model fine-tuned for instruction-based tasks with a long context window.
- Performance: Optimized for coding and reasoning
llama-3.2-8b
- Ideal for: Development, testing, cost-sensitive applications, and quick prototyping.
- Strengths: Optimized for environments with limited computational resources, ensuring efficiency.
- Architecture: An 8-billion-parameter model supporting an extended context length.
mistral-nemo
- Ideal for: Production chat applications, general-purpose tasks, and customer support bots.
- Strengths: Offers a balanced performance-to-cost ratio for scalable deployments.
- Architecture: Designed for efficient inference, ensuring responsive interactions.
Streaming and non-streaming responses
In non-streaming mode, the model generates the entire response before transmitting it to the client. This approach is straightforward but can lead to increased latency, especially with complex queries that require more processing time. Users must wait for the full response to be generated, which can impact the overall user experience. For implementation details on non-streaming responses, refer to the Getting Started section.
Streaming
Conversely, streaming responses allow the model to send data incrementally as it's generated, token by token. This method enhances user experience by providing immediate feedback, reducing perceived latency, and keeping users engaged. Streaming is particularly beneficial in chat interfaces, where real-time interaction is crucial. To enable streaming, set the stream parameter to True when making a request to the model. This configuration returns a stream of server-sent events (SSE), allowing the client to process each chunk of data as it arrives. Here's a basic example of how to implement streaming in Python:
Streaming example
from openai import OpenAI
client = openai.OPENAI(
base_url = "https://api.llamacloud.co",
api_key = "<LLAMACLOUD_API_KEY>")
completion = client.chat.completions.create(
model="llama-3.2-8b",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "How are you doing today?"}
],
stream=True)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
When you stream a chat completion, the responses has a delta
field rather than a message field. The delta
field can hold a role token, content token, or nothing.
Non streaming example
from openai import OpenAI
client = openai.OPENAI(
base_url = "https://api.llamacloud.co",
api_key = "<LLAMACLOUD_API_KEY>")
completion = client.chat.completions.create(
model="llama-3.2-8b",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "How are you doing today?"}
])
print(response.choices[0].message.content)
Best Practices
1. Model Selection
- deepseek-r1: Best for tasks requiring intricate reasoning, advanced coding, and mathematical problem-solving.
- llama-3.3-70b: Suitable for enterprise applications demanding robust reasoning and multilingual support.
- llama-3.2-8b: Ideal for development, testing, and cost-effective applications.
- mistral-nemo: Optimal for deploying scalable chat solutions and customer support systems.
2. Prompt Engineering
- Specificity: Provide clear, detailed instructions for more precise responses.
- System Messages: Use system-level directives to set context and guide behavior.
- Conversation Structure: Maintain logical flow and context in interactions.
3. Performance Optimization
- Max Tokens: Set appropriate limits to balance response completeness with efficiency.
- Streaming: Enable response streaming for faster initial feedback.
- Retry Logic: Implement mechanisms for handling incomplete or suboptimal responses.
4. Cost Management
- Token Monitoring: Track token usage to optimize high-consumption areas.
- Model Selection: Choose models based on application needs to minimize costs.
- Usage Limits: Set thresholds and alerts to prevent unexpected expenses.
By selecting the right model and following these best practices, you can maximize the efficiency and performance of your applications while keeping costs under control.
Error Handling
try:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum entanglement."}
]
)
except openai.APIError as e:
print(f"API Error: {e}")
except openai.RateLimitError as e:
print(f"Rate Limit Error: {e}")
Next Steps
- Explore Embeddings
- Try building a chatbot
- Learn about streaming responses