
Build AI-powered applications using OpenLLM and Vultr Cloud GPU

Having configured Nginx and SSL, this section will guide you through sending an API POST request to the OpenLLM endpoint that’s responsible for generating a response from the given prompts.

Send a curl request to the API endpoint.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the meaning of life?",
    "stop": ["n"],
    "llm_config": {
        "max_new_tokens": 128,
        "min_length": 0,
        "early_stopping": false,
        "num_beams": 1,
        "num_beam_groups": 1,
        "use_cache": true,
        "temperature": 0.75,
        "top_k": 15,
        "top_p": 0.9,
        "typical_p": 1,
        "epsilon_cutoff": 0,
        "eta_cutoff": 0,
        "diversity_penalty": 0,
        "repetition_penalty": 1,
        "encoder_repetition_penalty": 1,
        "length_penalty": 1,
        "no_repeat_ngram_size": 0,
        "renormalize_logits": false,
        "remove_invalid_values": false,
        "num_return_sequences": 1,
        "output_attentions": false,
        "output_hidden_states": false,
        "output_scores": false,
        "encoder_no_repeat_ngram_size": 0,
        "logprobs": 0,
        "prompt_logprobs": 0,
        "n": 1,
        "presence_penalty": 0,
        "frequency_penalty": 0,
        "use_beam_search": false,
        "ignore_eos": false,
        "skip_special_tokens": true
    "adapter_name": null
}' https://example.com/v1/generate

You can adjust the intensity of the response by changing the values of various parameters. Here’s an explanation of what each parameter does:

  • top_p: Responsible for choosing the best probability tokens for the output, making the output more focused and relevant.
  • epsilon_cutoff: Responsible for ignoring the tokens with a probability lower than the epsilon value, thereby ignoring low probability options.
  • diversity_penalty: Responsible for influencing the diversity of the output. A higher parameter value will create a more diverse and less repetitive response.
  • repetition_penalty: Responsible for posing a penalty on the tokens that repeat consecutively in the generated output.
  • length_penalty: Responsible for controlling the length of the response; a higher parameter value generates a longer response and vice-versa.
  • no_repeat_ngram_size: Responsible for penalizing the tokens forming n-grams (sequence of n tokens) that have already appeared in the response.
  • remove_invalid_values: Responsible for automatically removing tokens with invalid values from the generated response.
  • num_return_sequences: Responsible for controlling the different number of sequences a model should generate in a response.
  • frequency_penalty: Responsible for manipulating the frequency at which certain tokens are selected by the model when generating the response.
  • use_beam_search: Responsible for finding relevant continuations for response generation using beam search if the parameter value is set to true.
  • ignore_eos: Responsible for ignoring the “end of sentence” tokens during response generation if the parameter value is set to true.
  • n: Responsible for representing the number of tokens in each generated response.

This is a sample output of the curl request:

{ "prompt": "What is the meaning of life?", "finished": true, "outputs": [ { "index": 0, "text": " What is the meaning of the universe? How does the universe work?", "token_ids": [ 1634, 304, 248, 4113, 275, 248, 10314, 42, 1265, 960, 248, 10314, 633, 42, 193, 1265, 960, 248, 10314, 633, 42, 193 ], "cumulative_logprob": 0, "logprobs": null, "finish_reason": "stop" } ], "prompt_token_ids": [1562, 304, 248, 4113, 275, 1063, 42], "prompt_logprobs": null, "request_id": "openllm-e1b145f3e9614624975f76e7fae6050c" } 

