Having configured Nginx and SSL, this section will guide you through sending an API POST request to the OpenLLM endpoint that’s responsible for generating a response from the given prompts.

Send a curl request to the API endpoint.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the meaning of life?",
    "stop": ["n"],
    "llm_config": {
        "max_new_tokens": 128,
        "min_length": 0,
        "early_stopping": false,
        "num_beams": 1,
        "num_beam_groups": 1,
        "use_cache": true,
        "temperature": 0.75,
        "top_k": 15,
        "top_p": 0.9,
        "typical_p": 1,
        "epsilon_cutoff": 0,
        "eta_cutoff": 0,
        "diversity_penalty": 0,
        "repetition_penalty": 1,
        "encoder_repetition_penalty": 1,
        "length_penalty": 1,
        "no_repeat_ngram_size": 0,
        "renormalize_logits": false,
        "remove_invalid_values": false,
        "num_return_sequences": 1,
        "output_attentions": false,
        "output_hidden_states": false,
        "output_scores": false,
        "encoder_no_repeat_ngram_size": 0,
        "logprobs": 0,
        "prompt_logprobs": 0,
        "n": 1,
        "presence_penalty": 0,
        "frequency_penalty": 0,
        "use_beam_search": false,
        "ignore_eos": false,
        "skip_special_tokens": true
    },
    "adapter_name": null
}' https://example.com/v1/generate

You can adjust the intensity of the response by changing the values of various parameters. Here’s an explanation of what each parameter does:

  • top_p: Responsible for choosing the best probability tokens for the output, making the output more focused and relevant.
  • epsilon_cutoff: Responsible for ignoring the tokens with a probability lower than the epsilon value, thereby ignoring low probability options.
  • diversity_penalty: Responsible for influencing the diversity of the output. A higher parameter value will create a more diverse and less repetitive response.
  • repetition_penalty: Responsible for posing a penalty on the tokens that repeat consecutively in the generated output.
  • length_penalty: Responsible for controlling the length of the response; a higher parameter value generates a longer response and vice-versa.
  • no_repeat_ngram_size: Responsible for penalizing the tokens forming n-grams (sequence of n tokens) that have already appeared in the response.
  • remove_invalid_values: Responsible for automatically removing tokens with invalid values from the generated response.
  • num_return_sequences: Responsible for controlling the different number of sequences a model should generate in a response.
  • frequency_penalty: Responsible for manipulating the frequency at which certain tokens are selected by the model when generating the response.
  • use_beam_search: Responsible for finding relevant continuations for response generation using beam search if the parameter value is set to true.
  • ignore_eos: Responsible for ignoring the “end of sentence” tokens during response generation if the parameter value is set to true.
  • n: Responsible for representing the number of tokens in each generated response.

This is a sample output of the curl request:

{
  "prompt": "What is the meaning of life?",
  "finished": true,
  "outputs": [
    {
      "index": 0,
      "text": " What is the meaning of the universe? How does the universe work?",
      "token_ids": [
        1634, 304, 248, 4113, 275, 248, 10314, 42, 1265, 960, 248, 10314, 633,
        42, 193, 1265, 960, 248, 10314, 633, 42, 193
      ],
      "cumulative_logprob": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "prompt_token_ids": [1562, 304, 248, 4113, 275, 1063, 42],
  "prompt_logprobs": null,
  "request_id": "openllm-e1b145f3e9614624975f76e7fae6050c"
}

Similar Posts