婚纱网站html源码,学技巧网站制作,html如何做网站,深度网络技术一、前言 GLM-4是智谱AI团队于2024年1月16日发布的基座大模型#xff0c;旨在自动理解和规划用户的复杂指令#xff0c;并能调用网页浏览器。其功能包括数据分析、图表创建、PPT生成等#xff0c;支持128K的上下文窗口#xff0c;使其在长文本处理和精度召回方面表现优异旨在自动理解和规划用户的复杂指令并能调用网页浏览器。其功能包括数据分析、图表创建、PPT生成等支持128K的上下文窗口使其在长文本处理和精度召回方面表现优异且在中文对齐能力上超过GPT-4。与之前的GLM系列产品相比GLM-4在各项性能上提高了60%并且在指令跟随和多模态功能上有显著强化适合于多种应用场景。尽管在某些领域仍逊于国际一流模型GLM-4的中文处理能力使其在国内大模型中占据领先地位。该模型的研发历程自2020年始经过多次迭代和改进最终构建出这一高性能的AI系统。 在开源模型应用落地-glm模型小试-glm-4-9b-chat-快速体验一已经掌握了glm-4-9b-chat的基本入门。 在开源模型应用落地-glm模型小试-glm-4-9b-chat-批量推理二已经掌握了glm-4-9b-chat的批量推理。 在开源模型应用落地-glm模型小试-glm-4-9b-chat-Gradio集成三已经掌握了如何集成Gradio进行页面交互。 本篇将介绍如何集成vLLM进行推理加速。 二、术语 
2.1.GLM-4-9B 是智谱 AI 推出的一个开源预训练模型属于 GLM-4 系列。它于 2024 年 6 月 6 日发布专为满足高效能语言理解和生成任务而设计并支持最高 1M约两百万字的上下文输入。该模型拥有更强的基础能力支持26种语言并且在多模态能力上首次实现了显著进展。 
GLM-4-9B的基础能力包括 
- 中英文综合性能提升 40%在特别的中文对齐能力、指令遵从和工程代码等任务中显著增强 
- 较 Llama 3 8B 的性能提升尤其在数学问题解决和代码编写等复杂任务中表现优越 
- 增强的函数调用能力提升了 40% 的性能 
- 支持多轮对话还支持网页浏览、代码执行、自定义工具调用等高级功能能够快速处理大量信息并给出高质量的回答 
2.2.GLM-4-9B-Chat 是智谱 AI 在 GLM-4-9B 系列中推出的对话版本模型。它设计用于处理多轮对话并具有一些高级功能使其在自然语言处理任务中更加高效和灵活。 2.3.vLLM vLLM是一个开源的大模型推理加速框架通过PagedAttention高效地管理attention中缓存的张量实现了比HuggingFace Transformers高14-24倍的吞吐量。 三、前置条件 
3.1.基础环境及前置条件 1. 操作系统centos7 2. NVIDIA Tesla V100 32GB   CUDA Version: 12.2  3.最低硬件要求 3.2.下载模型 
huggingface 
https://huggingface.co/THUDM/glm-4-9b-chat/tree/main ModelScope 
魔搭社区 使用git-lfs方式下载示例 3.3.创建虚拟环境 
conda create --name glm4 python3.10
conda activate glm4 
3.4.安装依赖库 
pip install torch2.5.0
pip install torchvision0.20.0
pip install transformers4.46.0
pip install huggingface-hub0.25.1
pip install sentencepiece0.2.0
pip install jinja23.1.4
pip install pydantic2.9.2
pip install timm1.0.9
pip install tiktoken0.7.0
pip install numpy1.26.4 
pip install accelerate1.0.1
pip install sentence_transformers3.1.1
pip install openai1.51.0
pip install einops0.8.0
pip install pillow10.4.0
pip install sse-starlette2.1.3
pip install bitsandbytes0.43.3# using with VLLM Framework
pip install vllm0.6.3 四、技术实现 
4.1.vLLM服务端实现 
# -*- coding: utf-8 -*-
import time
from asyncio.log import logger
import re
import uvicorn
import gc
import json
import torch
import random
import string
from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
from fastapi import FastAPI, HTTPException, Response
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from typing import List, Literal, Optional, Union
from pydantic import BaseModel, Field
from transformers import AutoTokenizer, LogitsProcessor
from sse_starlette.sse import EventSourceResponseEventSourceResponse.DEFAULT_PING_INTERVAL  1000MAX_MODEL_LENGTH  8192asynccontextmanager
async def lifespan(app: FastAPI):yieldif torch.cuda.is_available():torch.cuda.empty_cache()torch.cuda.ipc_collect()app  FastAPI(lifespanlifespan)app.add_middleware(CORSMiddleware,allow_origins[*],allow_credentialsTrue,allow_methods[*],allow_headers[*],
)def generate_id(prefix: str, k29) - str:suffix  .join(random.choices(string.ascii_letters  string.digits, kk))return f{prefix}{suffix}class ModelCard(BaseModel):id: str  object: str  modelcreated: int  Field(default_factorylambda: int(time.time()))owned_by: str  ownerroot: Optional[str]  Noneparent: Optional[str]  Nonepermission: Optional[list]  Noneclass ModelList(BaseModel):object: str  listdata: List[ModelCard]  [glm-4]class FunctionCall(BaseModel):name: Optional[str]  Nonearguments: Optional[str]  Noneclass ChoiceDeltaToolCallFunction(BaseModel):name: Optional[str]  Nonearguments: Optional[str]  Noneclass UsageInfo(BaseModel):prompt_tokens: int  0total_tokens: int  0completion_tokens: Optional[int]  0class ChatCompletionMessageToolCall(BaseModel):index: Optional[int]  0id: Optional[str]  Nonefunction: FunctionCalltype: Optional[Literal[function]]  functionclass ChatMessage(BaseModel):# “function” 字段解释# 使用较老的OpenAI API版本需要注意在这里添加 function 字段并在 process_messages函数中添加相应角色转换逻辑为 observationrole: Literal[user, assistant, system, tool]content: Optional[str]  Nonefunction_call: Optional[ChoiceDeltaToolCallFunction]  Nonetool_calls: Optional[List[ChatCompletionMessageToolCall]]  Noneclass DeltaMessage(BaseModel):role: Optional[Literal[user, assistant, system]]  Nonecontent: Optional[str]  Nonefunction_call: Optional[ChoiceDeltaToolCallFunction]  Nonetool_calls: Optional[List[ChatCompletionMessageToolCall]]  Noneclass ChatCompletionResponseChoice(BaseModel):index: intmessage: ChatMessagefinish_reason: Literal[stop, length, tool_calls]class ChatCompletionResponseStreamChoice(BaseModel):delta: DeltaMessagefinish_reason: Optional[Literal[stop, length, tool_calls]]index: intclass ChatCompletionResponse(BaseModel):model: strid: Optional[str]  Field(default_factorylambda: generate_id(chatcmpl-, 29))object: Literal[chat.completion, chat.completion.chunk]choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]created: Optional[int]  Field(default_factorylambda: int(time.time()))system_fingerprint: Optional[str]  Field(default_factorylambda: generate_id(fp_, 9))usage: Optional[UsageInfo]  Noneclass ChatCompletionRequest(BaseModel):model: strmessages: List[ChatMessage]temperature: Optional[float]  0.8top_p: Optional[float]  0.8max_tokens: Optional[int]  Nonestream: Optional[bool]  Falsetools: Optional[Union[dict, List[dict]]]  Nonetool_choice: Optional[Union[str, dict]]  Nonerepetition_penalty: Optional[float]  1.1class InvalidScoreLogitsProcessor(LogitsProcessor):def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) - torch.FloatTensor:if torch.isnan(scores).any() or torch.isinf(scores).any():scores.zero_()scores[..., 5]  5e4return scoresdef process_response(output: str, tools: dict | List[dict]  None, use_tool: bool  False) - Union[str, dict]:lines  output.strip().split(\n)arguments_json  Nonespecial_tools  [cogview, simple_browser]tools  {tool[function][name] for tool in tools} if tools else {}if len(lines)  2 and lines[1].startswith({):function_name  lines[0].strip()arguments  \n.join(lines[1:]).strip()if function_name in tools or function_name in special_tools:try:arguments_json  json.loads(arguments)is_tool_call  Trueexcept json.JSONDecodeError:is_tool_call  function_name in special_toolsif is_tool_call and use_tool:content  {name: function_name,arguments: json.dumps(arguments_json if isinstance(arguments_json, dict) else arguments,ensure_asciiFalse)}if function_name  simple_browser:search_pattern  re.compile(rsearch\((.?)\s*,\s*recency_days\s*\s*(\d)\))match  search_pattern.match(arguments)if match:content[arguments]  json.dumps({query: match.group(1),recency_days: int(match.group(2))}, ensure_asciiFalse)elif function_name  cogview:content[arguments]  json.dumps({prompt: arguments}, ensure_asciiFalse)return contentreturn output.strip()torch.inference_mode()
async def generate_stream_glm4(params):messages  params[messages]tools  params[tools]tool_choice  params[tool_choice]temperature  float(params.get(temperature, 1.0))repetition_penalty  float(params.get(repetition_penalty, 1.0))top_p  float(params.get(top_p, 1.0))max_new_tokens  int(params.get(max_tokens, 8192))messages  process_messages(messages, toolstools, tool_choicetool_choice)inputs  tokenizer.apply_chat_template(messages, add_generation_promptTrue, tokenizeFalse)params_dict  {n: 1,best_of: 1,presence_penalty: 1.0,frequency_penalty: 0.0,temperature: temperature,top_p: top_p,top_k: -1,repetition_penalty: repetition_penalty,stop_token_ids: [151329, 151336, 151338],ignore_eos: False,max_tokens: max_new_tokens,logprobs: None,prompt_logprobs: None,skip_special_tokens: True,}sampling_params  SamplingParams(**params_dict)async for output in engine.generate(promptinputs, sampling_paramssampling_params, request_idf{time.time()}):output_len  len(output.outputs[0].token_ids)input_len  len(output.prompt_token_ids)ret  {text: output.outputs[0].text,usage: {prompt_tokens: input_len,completion_tokens: output_len,total_tokens: output_len  input_len},finish_reason: output.outputs[0].finish_reason,}yield retgc.collect()torch.cuda.empty_cache()def process_messages(messages, toolsNone, tool_choicenone):_messages  messagesprocessed_messages  []msg_has_sys  Falsedef filter_tools(tool_choice, tools):function_name  tool_choice.get(function, {}).get(name, None)if not function_name:return []filtered_tools  [tool for tool in toolsif tool.get(function, {}).get(name)  function_name]return filtered_toolsif tool_choice ! none:if isinstance(tool_choice, dict):tools  filter_tools(tool_choice, tools)if tools:processed_messages.append({role: system,content: None,tools: tools})msg_has_sys  Trueif isinstance(tool_choice, dict) and tools:processed_messages.append({role: assistant,metadata: tool_choice[function][name],content: })for m in _messages:role, content, func_call  m.role, m.content, m.function_calltool_calls  getattr(m, tool_calls, None)if role  function:processed_messages.append({role: observation,content: content})elif role  tool:processed_messages.append({role: observation,content: content,function_call: True})elif role  assistant:if tool_calls:for tool_call in tool_calls:processed_messages.append({role: assistant,metadata: tool_call.function.name,content: tool_call.function.arguments})else:for response in content.split(\n):if \n in response:metadata, sub_content  response.split(\n, maxsplit1)else:metadata, sub_content  , responseprocessed_messages.append({role: role,metadata: metadata,content: sub_content.strip()})else:if role  system and msg_has_sys:msg_has_sys  Falsecontinueprocessed_messages.append({role: role, content: content})if not tools or tool_choice  none:for m in _messages:if m.role  system:processed_messages.insert(0, {role: m.role, content: m.content})breakreturn processed_messagesapp.get(/health)
async def health() - Response:Health check.return Response(status_code200)app.get(/v1/models, response_modelModelList)
async def list_models():model_card  ModelCard(idglm-4)return ModelList(data[model_card])app.post(/v1/chat/completions, response_modelChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):if len(request.messages)  1 or request.messages[-1].role  assistant:raise HTTPException(status_code400, detailInvalid request)gen_params  dict(messagesrequest.messages,temperaturerequest.temperature,top_prequest.top_p,max_tokensrequest.max_tokens or 1024,echoFalse,streamrequest.stream,repetition_penaltyrequest.repetition_penalty,toolsrequest.tools,tool_choicerequest.tool_choice,)logger.debug(f request \n{gen_params})if request.stream:predict_stream_generator  predict_stream(request.model, gen_params)output  await anext(predict_stream_generator)if output:return EventSourceResponse(predict_stream_generator, media_typetext/event-stream)logger.debug(fFirst result output\n{output})function_call  Noneif output and request.tools:try:function_call  process_response(output, request.tools, use_toolTrue)except:logger.warning(Failed to parse tool call)if isinstance(function_call, dict):function_call  ChoiceDeltaToolCallFunction(**function_call)generate  parse_output_text(request.model, output, function_callfunction_call)return EventSourceResponse(generate, media_typetext/event-stream)else:return EventSourceResponse(predict_stream_generator, media_typetext/event-stream)response  async for response in generate_stream_glm4(gen_params):passif response[text].startswith(\n):response[text]  response[text][1:]response[text]  response[text].strip()usage  UsageInfo()function_call, finish_reason  None, stoptool_calls  Noneif request.tools:try:function_call  process_response(response[text], request.tools, use_toolTrue)except Exception as e:logger.warning(fFailed to parse tool call: {e})if isinstance(function_call, dict):finish_reason  tool_callsfunction_call_response  ChoiceDeltaToolCallFunction(**function_call)function_call_instance  FunctionCall(namefunction_call_response.name,argumentsfunction_call_response.arguments)tool_calls  [ChatCompletionMessageToolCall(idgenerate_id(call_, 24),functionfunction_call_instance,typefunction)]message  ChatMessage(roleassistant,contentNone if tool_calls else response[text],function_callNone,tool_callstool_calls,)logger.debug(f message \n{message})choice_data  ChatCompletionResponseChoice(index0,messagemessage,finish_reasonfinish_reason,)task_usage  UsageInfo.model_validate(response[usage])for usage_key, usage_value in task_usage.model_dump().items():setattr(usage, usage_key, getattr(usage, usage_key)  usage_value)return ChatCompletionResponse(modelrequest.model,choices[choice_data],objectchat.completion,usageusage)async def predict_stream(model_id, gen_params):output  is_function_call  Falsehas_send_first_chunk  Falsecreated_time  int(time.time())function_name  Noneresponse_id  generate_id(chatcmpl-, 29)system_fingerprint  generate_id(fp_, 9)tools  {tool[function][name] for tool in gen_params[tools]} if gen_params[tools] else {}delta_text  async for new_response in generate_stream_glm4(gen_params):decoded_unicode  new_response[text]delta_text  decoded_unicode[len(output):]output  decoded_unicodelines  output.strip().split(\n)# 检查是否为工具# 这是一个简单的工具比较函数不能保证拦截所有非工具输出的结果比如参数未对齐等特殊情况。##TODO 如果你希望做更多处理可以在这里进行逻辑完善。if not is_function_call and len(lines)  2:first_line  lines[0].strip()if first_line in tools:is_function_call  Truefunction_name  first_linedelta_text  lines[1]# 工具调用返回if is_function_call:if not has_send_first_chunk:function_call  {name: function_name, arguments: }tool_call  ChatCompletionMessageToolCall(index0,idgenerate_id(call_, 24),functionFunctionCall(**function_call),typefunction)message  DeltaMessage(contentNone,roleassistant,function_callNone,tool_calls[tool_call])choice_data  ChatCompletionResponseStreamChoice(index0,deltamessage,finish_reasonNone)chunk  ChatCompletionResponse(modelmodel_id,idresponse_id,choices[choice_data],createdcreated_time,system_fingerprintsystem_fingerprint,objectchat.completion.chunk)yield yield chunk.model_dump_json(exclude_unsetTrue)has_send_first_chunk  Truefunction_call  {name: None, arguments: delta_text}delta_text  tool_call  ChatCompletionMessageToolCall(index0,idNone,functionFunctionCall(**function_call),typefunction)message  DeltaMessage(contentNone,roleNone,function_callNone,tool_calls[tool_call])choice_data  ChatCompletionResponseStreamChoice(index0,deltamessage,finish_reasonNone)chunk  ChatCompletionResponse(modelmodel_id,idresponse_id,choices[choice_data],createdcreated_time,system_fingerprintsystem_fingerprint,objectchat.completion.chunk)yield chunk.model_dump_json(exclude_unsetTrue)# 用户请求了 Function Call 但是框架还没确定是否为Function Callelif (gen_params[tools] and gen_params[tool_choice] ! none) or is_function_call:continue# 常规返回else:finish_reason  new_response.get(finish_reason, None)if not has_send_first_chunk:message  DeltaMessage(content,roleassistant,function_callNone,)choice_data  ChatCompletionResponseStreamChoice(index0,deltamessage,finish_reasonfinish_reason)chunk  ChatCompletionResponse(modelmodel_id,idresponse_id,choices[choice_data],createdcreated_time,system_fingerprintsystem_fingerprint,objectchat.completion.chunk)yield chunk.model_dump_json(exclude_unsetTrue)has_send_first_chunk  Truemessage  DeltaMessage(contentdelta_text,roleassistant,function_callNone,)delta_text  choice_data  ChatCompletionResponseStreamChoice(index0,deltamessage,finish_reasonfinish_reason)chunk  ChatCompletionResponse(modelmodel_id,idresponse_id,choices[choice_data],createdcreated_time,system_fingerprintsystem_fingerprint,objectchat.completion.chunk)yield chunk.model_dump_json(exclude_unsetTrue)# 工具调用需要额外返回一个字段以对齐 OpenAI 接口if is_function_call:yield ChatCompletionResponse(modelmodel_id,idresponse_id,system_fingerprintsystem_fingerprint,choices[ChatCompletionResponseStreamChoice(index0,deltaDeltaMessage(contentNone,roleNone,function_callNone,),finish_reasontool_calls)],createdcreated_time,objectchat.completion.chunk,usageNone).model_dump_json(exclude_unsetTrue)elif delta_text ! :message  DeltaMessage(content,roleassistant,function_callNone,)choice_data  ChatCompletionResponseStreamChoice(index0,deltamessage,finish_reasonNone)chunk  ChatCompletionResponse(modelmodel_id,idresponse_id,choices[choice_data],createdcreated_time,system_fingerprintsystem_fingerprint,objectchat.completion.chunk)yield chunk.model_dump_json(exclude_unsetTrue)finish_reason  stopmessage  DeltaMessage(contentdelta_text,roleassistant,function_callNone,)delta_text  choice_data  ChatCompletionResponseStreamChoice(index0,deltamessage,finish_reasonfinish_reason)chunk  ChatCompletionResponse(modelmodel_id,idresponse_id,choices[choice_data],createdcreated_time,system_fingerprintsystem_fingerprint,objectchat.completion.chunk)yield chunk.model_dump_json(exclude_unsetTrue)yield [DONE]else:yield [DONE]async def parse_output_text(model_id: str, value: str, function_call: ChoiceDeltaToolCallFunction  None):delta  DeltaMessage(roleassistant, contentvalue)if function_call is not None:delta.function_call  function_callchoice_data  ChatCompletionResponseStreamChoice(index0,deltadelta,finish_reasonNone)chunk  ChatCompletionResponse(modelmodel_id,choices[choice_data],objectchat.completion.chunk)yield {}.format(chunk.model_dump_json(exclude_unsetTrue))yield [DONE]if __name__  __main__:MODEL_PATH  /data/model/glm-4-9b-chattensor_parallel_size  1tokenizer  AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_codeTrue)engine_args  AsyncEngineArgs(modelMODEL_PATH,tokenizerMODEL_PATH,# 如果你有多张显卡可以在这里设置成你的显卡数量tensor_parallel_sizetensor_parallel_size,dtypetorch.float16,trust_remote_codeTrue,# 占用显存的比例请根据你的显卡显存大小设置合适的值例如如果你的显卡有80G您只想使用24G请按照24/800.3设置gpu_memory_utilization0.9,enforce_eagerTrue,worker_use_rayFalse,disable_log_requestsTrue,max_model_lenMAX_MODEL_LENGTH,)engine  AsyncLLMEngine.from_engine_args(engine_args)uvicorn.run(app, host0.0.0.0, port8000, workers1) 
4.2.vLLM服务端启动 
(glm4) [rootgpu test]# python -u glm_server.py 
WARNING 11-06 12:11:19 config.py:1668] Casting torch.bfloat16 to torch.float16.
WARNING 11-06 12:11:23 config.py:395] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 11-06 12:11:23 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model/data/model/glm-4-9b-chat, speculative_configNone, tokenizer/data/model/glm-4-9b-chat, skip_tokenizer_initFalse, tokenizer_modeauto, revisionNone, override_neuron_configNone, rope_scalingNone, rope_thetaNone, tokenizer_revisionNone, trust_remote_codeTrue, dtypetorch.float16, max_seq_len8192, download_dirNone, load_formatLoadFormat.AUTO, tensor_parallel_size1, pipeline_parallel_size1, disable_custom_all_reduceFalse, quantizationNone, enforce_eagerTrue, kv_cache_dtypeauto, quantization_param_pathNone, device_configcuda, decoding_configDecodingConfig(guided_decoding_backendoutlines), observability_configObservabilityConfig(otlp_traces_endpointNone, collect_model_forward_timeFalse, collect_model_execute_timeFalse), seed0, served_model_name/data/model/glm-4-9b-chat, num_scheduler_steps1, chunked_prefill_enabledFalse multi_step_stream_outputsTrue, enable_prefix_cachingFalse, use_async_output_procFalse, use_cached_outputsFalse, mm_processor_kwargsNone)
WARNING 11-06 12:11:24 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 11-06 12:11:24 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-06 12:11:24 selector.py:115] Using XFormers backend.
/usr/local/miniconda3/envs/glm4/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.torch.library.impl_abstract(xformers_flash::flash_fwd)
/usr/local/miniconda3/envs/glm4/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.torch.library.impl_abstract(xformers_flash::flash_bwd)
INFO 11-06 12:11:25 model_runner.py:1056] Starting to load model /data/model/glm-4-9b-chat...
INFO 11-06 12:11:25 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-06 12:11:25 selector.py:115] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/10 [00:00?, ?it/s]
Loading safetensors checkpoint shards:  10% Completed | 1/10 [00:0000:08,  1.01it/s]
Loading safetensors checkpoint shards:  20% Completed | 2/10 [00:0100:07,  1.13it/s]
Loading safetensors checkpoint shards:  30% Completed | 3/10 [00:0200:06,  1.14it/s]
Loading safetensors checkpoint shards:  40% Completed | 4/10 [00:0300:05,  1.15it/s]
Loading safetensors checkpoint shards:  50% Completed | 5/10 [00:0400:04,  1.18it/s]
Loading safetensors checkpoint shards:  60% Completed | 6/10 [00:0500:03,  1.08it/s]
Loading safetensors checkpoint shards:  70% Completed | 7/10 [00:0600:02,  1.07it/s]
Loading safetensors checkpoint shards:  80% Completed | 8/10 [00:0700:01,  1.13it/s]
Loading safetensors checkpoint shards:  90% Completed | 9/10 [00:0800:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:0800:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:0800:00,  1.11it/s]INFO 11-06 12:11:35 model_runner.py:1067] Loading model weights took 17.5635 GB
INFO 11-06 12:11:37 gpu_executor.py:122] # GPU blocks: 12600, # CPU blocks: 6553
INFO 11-06 12:11:37 gpu_executor.py:126] Maximum concurrency for 8192 tokens per request: 24.61x
INFO:     Started server process [1627618]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRLC to quit) 4.3.客户端实现 
# -*- coding: utf-8 -*-
from openai import OpenAIbase_url  http://127.0.0.1:8000/v1/
client  OpenAI(api_keyEMPTY, base_urlbase_url)
MODEL_PATH  /data/model/glm-4-9b-chatdef chat(use_streamFalse):messages  [{role: system,content: 你是一名专业的导游。,},{role: user,content: 请推荐一些广州特色的景点,}]response  client.chat.completions.create(modelMODEL_PATH,messagesmessages,streamuse_stream,max_tokens8192,temperature0.4,presence_penalty1.2,top_p0.9,)if response:if use_stream:for chunk in response:msg  chunk.choices[0].delta.contentprint(msg,end,flushTrue)else:print(response)else:print(Error:, response.status_code)if __name__  __main__:chat(use_streamTrue)4.4.客户端调用 
(glm4) [rootgpu test]# python -u glm_client.py 当然可以广州是中国广东省的省会历史悠久文化底蕴深厚同时也是一座现代化的大都市。以下是一些广州的特色景点推荐1. **白云山** - 广州著名的风景区有“羊城第一秀”之称。山上空气清新景色优美是登山和观赏广州市区全景的好地方。2. **珠江夜游** - 乘坐游船在珠江上欣赏两岸的夜景可以看到广州塔、海心沙等著名地标以及璀璨的灯光秀。3. **长隆旅游度假区** - 包括长隆野生动物世界、长隆水上乐园、长隆国际大马戏等多个主题公园适合家庭游玩。4. **陈家祠** - 又称陈氏书院是一座典型的岭南传统建筑以其精美的木雕、石雕和砖雕闻名。5. **越秀公园** - 公园内有五羊雕像是广州的象征之一。还有中山纪念碑、镇海楼等历史遗迹。6. **北京路步行街** - 这里集合了购物、餐饮、娱乐于一体是一条充满活力的商业街区。7. **上下九步行街** - 这条古老的街道以骑楼建筑为特色两旁有许多老字号商店和小吃店是体验广州传统文化的好去处。8. **广州塔小蛮腰** - 作为广州的地标性建筑游客可以从这里俯瞰整个城市的壮丽景观。9. **南越王宫博物馆** - 展示了两千多年前南越国的历史文化馆内有一座复原的宫殿模型。10. **荔湾湖公园** - 一个集自然风光与人文景观于一体的公园湖水清澈环境宜人。11. **广州动物园** - 是中国最大的城市动物园之一拥有多种珍稀动物。12. **广州艺术博物院** - 收藏了大量珍贵的艺术品和历史文物是了解广东乃至华南地区文化艺术的重要场所。这些景点不仅展示了广州的自然美景也体现了其丰富的文化遗产和现代都市的风貌。希望您在广州旅行时能有一个愉快的体验 文章转载自: http://www.morning.zbpqq.cn.gov.cn.zbpqq.cn http://www.morning.rqnhf.cn.gov.cn.rqnhf.cn http://www.morning.mzcrs.cn.gov.cn.mzcrs.cn http://www.morning.gftnx.cn.gov.cn.gftnx.cn http://www.morning.bbjw.cn.gov.cn.bbjw.cn http://www.morning.yrpd.cn.gov.cn.yrpd.cn http://www.morning.wrkhf.cn.gov.cn.wrkhf.cn http://www.morning.wjlrw.cn.gov.cn.wjlrw.cn http://www.morning.nqxdg.cn.gov.cn.nqxdg.cn http://www.morning.srtw.cn.gov.cn.srtw.cn http://www.morning.rlrxh.cn.gov.cn.rlrxh.cn http://www.morning.xpgwz.cn.gov.cn.xpgwz.cn http://www.morning.nthyjf.com.gov.cn.nthyjf.com http://www.morning.nckjk.cn.gov.cn.nckjk.cn http://www.morning.bzwxr.cn.gov.cn.bzwxr.cn http://www.morning.rqfzp.cn.gov.cn.rqfzp.cn http://www.morning.lqtwb.cn.gov.cn.lqtwb.cn http://www.morning.sxhdzyw.com.gov.cn.sxhdzyw.com http://www.morning.fqyxb.cn.gov.cn.fqyxb.cn http://www.morning.darwallet.cn.gov.cn.darwallet.cn http://www.morning.fwkpp.cn.gov.cn.fwkpp.cn http://www.morning.xcdph.cn.gov.cn.xcdph.cn http://www.morning.xiaobaixinyong.cn.gov.cn.xiaobaixinyong.cn http://www.morning.thntp.cn.gov.cn.thntp.cn http://www.morning.nbybb.cn.gov.cn.nbybb.cn http://www.morning.jpgfx.cn.gov.cn.jpgfx.cn http://www.morning.rfldz.cn.gov.cn.rfldz.cn http://www.morning.nyplp.cn.gov.cn.nyplp.cn http://www.morning.xwlmr.cn.gov.cn.xwlmr.cn http://www.morning.cwyfs.cn.gov.cn.cwyfs.cn http://www.morning.pyncm.cn.gov.cn.pyncm.cn http://www.morning.xkjrq.cn.gov.cn.xkjrq.cn http://www.morning.skkln.cn.gov.cn.skkln.cn http://www.morning.xhhzn.cn.gov.cn.xhhzn.cn http://www.morning.jhwwr.cn.gov.cn.jhwwr.cn http://www.morning.txgjx.cn.gov.cn.txgjx.cn http://www.morning.kphsp.cn.gov.cn.kphsp.cn http://www.morning.sbwr.cn.gov.cn.sbwr.cn http://www.morning.rqknq.cn.gov.cn.rqknq.cn http://www.morning.tqpr.cn.gov.cn.tqpr.cn http://www.morning.gagapp.cn.gov.cn.gagapp.cn http://www.morning.lfbzg.cn.gov.cn.lfbzg.cn http://www.morning.dpjtn.cn.gov.cn.dpjtn.cn http://www.morning.zkzjm.cn.gov.cn.zkzjm.cn http://www.morning.jjhng.cn.gov.cn.jjhng.cn http://www.morning.xqbbc.cn.gov.cn.xqbbc.cn http://www.morning.rmjxp.cn.gov.cn.rmjxp.cn http://www.morning.nrftd.cn.gov.cn.nrftd.cn http://www.morning.zwfgh.cn.gov.cn.zwfgh.cn http://www.morning.tkyxl.cn.gov.cn.tkyxl.cn http://www.morning.nmpdm.cn.gov.cn.nmpdm.cn http://www.morning.sfnr.cn.gov.cn.sfnr.cn http://www.morning.rzbgn.cn.gov.cn.rzbgn.cn http://www.morning.pmftz.cn.gov.cn.pmftz.cn http://www.morning.hxljc.cn.gov.cn.hxljc.cn http://www.morning.dbnrl.cn.gov.cn.dbnrl.cn http://www.morning.psxxp.cn.gov.cn.psxxp.cn http://www.morning.aowuu.com.gov.cn.aowuu.com http://www.morning.080203.cn.gov.cn.080203.cn http://www.morning.wlqll.cn.gov.cn.wlqll.cn http://www.morning.bpmnz.cn.gov.cn.bpmnz.cn http://www.morning.rlqqy.cn.gov.cn.rlqqy.cn http://www.morning.gsksm.cn.gov.cn.gsksm.cn http://www.morning.ndfwh.cn.gov.cn.ndfwh.cn http://www.morning.nhrkc.cn.gov.cn.nhrkc.cn http://www.morning.dbrnl.cn.gov.cn.dbrnl.cn http://www.morning.jyknk.cn.gov.cn.jyknk.cn http://www.morning.rwmq.cn.gov.cn.rwmq.cn http://www.morning.xlclj.cn.gov.cn.xlclj.cn http://www.morning.kvzvoew.cn.gov.cn.kvzvoew.cn http://www.morning.pcbfl.cn.gov.cn.pcbfl.cn http://www.morning.zwndt.cn.gov.cn.zwndt.cn http://www.morning.bybhj.cn.gov.cn.bybhj.cn http://www.morning.bqts.cn.gov.cn.bqts.cn http://www.morning.jqjnl.cn.gov.cn.jqjnl.cn http://www.morning.jkzq.cn.gov.cn.jkzq.cn http://www.morning.jjwzk.cn.gov.cn.jjwzk.cn http://www.morning.nyqnk.cn.gov.cn.nyqnk.cn http://www.morning.wqcbr.cn.gov.cn.wqcbr.cn http://www.morning.hwlmy.cn.gov.cn.hwlmy.cn