OpenAI
Call the OpenAI API from your FlareX app — chat completions, streaming, embeddings, retries, and cost control.
Updated
OpenAI's API is one HTTP call to /v1/chat/completions (or its successors). The SDK is a thin wrapper. This page covers what trips most people up: streaming, retries, token limits, and cost control.
API key
Get one at platform.openai.com → API keys. Add to Secrets as OPENAI_API_KEY.
Mirror the Anthropic walkthrough for Claude — same shape, different env var (ANTHROPIC_API_KEY).
Pattern 1: Single completion
Smallest useful integration:
Add a /summarize endpoint. POST { text: string } → { summary: string }.
Use gpt-4o-mini. Truncate input to 8K tokens. Cap output at 200 tokens.
import OpenAI from 'openai';
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const resp = await client.chat.completions.create({
model: 'gpt-4o-mini',
max_tokens: 200,
messages: [
{ role: 'system', content: 'Summarize the user\'s text in 2-3 sentences.' },
{ role: 'user', content: text.slice(0, 32_000) },
],
});
return { summary: resp.choices[0]!.message.content };
Pattern 2: Streaming
For chat-like interfaces, streaming makes the response feel instant. Use SSE end-to-end:
app.get('/chat-stream', async (req, reply) => {
reply.raw.writeHead(200, {
'content-type': 'text/event-stream',
'cache-control': 'no-cache',
'connection': 'keep-alive',
});
const stream = await client.chat.completions.create({
model: 'gpt-4o-mini',
stream: true,
messages: [...],
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content ?? '';
if (delta) reply.raw.write(`data: ${JSON.stringify({ delta })}\n\n`);
}
reply.raw.write('data: [DONE]\n\n');
reply.raw.end();
});
Tell FlareX:
Add a streaming /chat endpoint. SSE response. Frame format: data: {json}\n\n.
End with data: [DONE]. On client disconnect, abort the OpenAI stream
to stop the cost meter.
If you forget to abort on client disconnect, the OpenAI request keeps running and you keep paying until it completes. Always wire up req.on('close', () => stream.controller.abort()).
Pattern 3: Tool use / function calling
For structured output, use OpenAI's tool calling rather than parsing free-text:
Add a /classify endpoint. POST { text: string } → { category, confidence }.
Use gpt-4o-mini with tool calling — declare a `submit_classification`
tool with category enum and confidence number, force tool_choice to that
tool, parse the args from the response.
This is way more reliable than asking for "respond with JSON" — the model is constrained by the schema, not just instructed.
Cost control
Three knobs, in order of effectiveness:
1. Pick a smaller model
gpt-4o-mini is ~30× cheaper than gpt-4o and good enough for most jobs (classification, summarization, simple drafting). Don't reach for the flagship model unless the smaller one is actually failing.
2. Cap max_tokens
The output cost is per-token. If your users only need a paragraph, cap at 200 — not 4096.
3. Cache aggressively
Same input → same output. Cache by content hash:
import crypto from 'node:crypto';
function hashKey(messages: any[]) {
return crypto.createHash('sha256').update(JSON.stringify(messages)).digest('hex');
}
const key = `openai:${hashKey(messages)}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const fresh = await openaiCall(messages);
await redis.setex(key, 3600, JSON.stringify(fresh));
return fresh;
Tell FlareX:
Cache /summarize responses by SHA-256 of the input text + model id.
TTL 24h. Skip cache if request has ?fresh=1.
Retries + rate limits
OpenAI returns 429 with a Retry-After header when you're throttled. The SDK retries 2× by default with exponential backoff — bump it for production:
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
maxRetries: 5,
timeout: 60_000,
});
For sustained throughput beyond your default rate limit, request a higher tier in the OpenAI dashboard.
Token counting + truncation
Models have hard input limits (gpt-4o-mini: ~128K). For long documents, you must chunk + summarize, then summarize the summaries. Tell FlareX:
For inputs over 100K tokens, chunk into 50K-token slices, summarize
each chunk in parallel, then summarize the chunk-summaries. Use
tiktoken for accurate token counting.
Errors you'll see
| Status | Meaning | What to do |
|---|---|---|
| 401 | Invalid key | Check Secrets — keys are revocable, may have been rotated |
| 429 | Rate limited or out of credits | Honor Retry-After. If credits, top up or downgrade model |
| 500 | OpenAI hiccup | Retry with backoff (SDK does this automatically) |
| 503 | Overloaded | Same — usually transient |
context_length_exceeded | Input too long | Truncate or chunk |
What's next
- 3rd-party APIs overview — fetch + retry + cache patterns
- Webhooks — for async OpenAI workflows (file uploads, fine-tuning)
- Build an API service — wrap an LLM as your own API