Having configured Nginx and SSL, this section will guide you through sending an API POST request to the OpenLLM endpoint that’s responsible for generating a response from the given prompts.
Send a curl
request to the API endpoint.
curl -X POST -H "Content-Type: application/json" -d '{
"prompt": "What is the meaning of life?",
"stop": ["n"],
"llm_config": {
"max_new_tokens": 128,
"min_length": 0,
"early_stopping": false,
"num_beams": 1,
"num_beam_groups": 1,
"use_cache": true,
"temperature": 0.75,
"top_k": 15,
"top_p": 0.9,
"typical_p": 1,
"epsilon_cutoff": 0,
"eta_cutoff": 0,
"diversity_penalty": 0,
"repetition_penalty": 1,
"encoder_repetition_penalty": 1,
"length_penalty": 1,
"no_repeat_ngram_size": 0,
"renormalize_logits": false,
"remove_invalid_values": false,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"encoder_no_repeat_ngram_size": 0,
"logprobs": 0,
"prompt_logprobs": 0,
"n": 1,
"presence_penalty": 0,
"frequency_penalty": 0,
"use_beam_search": false,
"ignore_eos": false,
"skip_special_tokens": true
},
"adapter_name": null
}' https://example.com/v1/generate
You can adjust the intensity of the response by changing the values of various parameters. Here’s an explanation of what each parameter does:
top_p
: Responsible for choosing the best probability tokens for the output, making the output more focused and relevant.epsilon_cutoff
: Responsible for ignoring the tokens with a probability lower than the epsilon value, thereby ignoring low probability options.diversity_penalty
: Responsible for influencing the diversity of the output. A higher parameter value will create a more diverse and less repetitive response.repetition_penalty
: Responsible for posing a penalty on the tokens that repeat consecutively in the generated output.length_penalty
: Responsible for controlling the length of the response; a higher parameter value generates a longer response and vice-versa.no_repeat_ngram_size
: Responsible for penalizing the tokens forming n-grams (sequence of n tokens) that have already appeared in the response.remove_invalid_values
: Responsible for automatically removing tokens with invalid values from the generated response.num_return_sequences
: Responsible for controlling the different number of sequences a model should generate in a response.frequency_penalty
: Responsible for manipulating the frequency at which certain tokens are selected by the model when generating the response.use_beam_search
: Responsible for finding relevant continuations for response generation using beam search if the parameter value is set to true.ignore_eos
: Responsible for ignoring the “end of sentence” tokens during response generation if the parameter value is set to true.n
: Responsible for representing the number of tokens in each generated response.
This is a sample output of the curl
request:
{
"prompt": "What is the meaning of life?",
"finished": true,
"outputs": [
{
"index": 0,
"text": " What is the meaning of the universe? How does the universe work?",
"token_ids": [
1634, 304, 248, 4113, 275, 248, 10314, 42, 1265, 960, 248, 10314, 633,
42, 193, 1265, 960, 248, 10314, 633, 42, 193
],
"cumulative_logprob": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"prompt_token_ids": [1562, 304, 248, 4113, 275, 1063, 42],
"prompt_logprobs": null,
"request_id": "openllm-e1b145f3e9614624975f76e7fae6050c"
}