探索LLMs用于ICD编码 —— 第1部分

Sure, here's the translation in simplified Chinese while keeping the HTML structure: ```html



Sure, here's the translation in simplified Chinese while keeping the HTML structure: ```html


``` This HTML snippet contains the translated text in simplified Chinese.
A representative workflow of automated ICD coding (Image by Author)

Sure, here's the translation of the provided text into simplified Chinese while keeping the HTML structure: ```html

临床编码通常由具有医学专业知识的人类编码员执行。 这些编码员使用特定的代码导航复杂且常层次化的编码术语,用于广泛范围的诊断和程序。 因此,编码员必须对所使用的编码术语有深入的了解和经验。 但是,手动编码文件可能会很慢,容易出错,并且受到对大量人类专业知识的要求的限制。

``` You can copy and paste this HTML structure into your project or document.

Sure, here's the translated text in simplified Chinese, formatted in HTML: ```html


``` This HTML snippet contains the translated text in simplified Chinese while maintaining the structure for web display.

Here's the translated text in simplified Chinese: 在本部分中,我描述了什么是ICD编码,表征了自动编码系统必须克服的各种挑战。我还分析了大型语言模型(LLMs)如何有效地用于克服这些问题,并通过实施一篇最近的论文中利用LLMs有效地进行ICD编码的算法来加以说明。

Sure, here's the translation: ``` 目录: ```

  1. Sure, here's the translation in simplified Chinese, keeping the HTML structure: ```html


  2. Sure, here is the text translated into simplified Chinese while keeping the HTML structure intact: ```html 什么是自动ICD编码中的挑战? ``` This HTML snippet will display the text "什么是自动ICD编码中的挑战?" on the webpage.
  3. Sure, here's the translation: ```html

    LLMs 如何帮助自动 ICD 编码?

  4. Sure, here's the translated text in simplified Chinese, while keeping the HTML structure: ```html


  5. Sure, here's the translated text: ```html


  6. 结论
Sure, here's the translation of "What is ICD Coding?" in simplified Chinese while keeping the HTML structure intact: ```html



Sure, here's the translation in simplified Chinese, maintaining the HTML structure: ```html




Sure, here's the translated text in simplified Chinese: ```html



Sure, here's the text translated into simplified Chinese while keeping the HTML structure: ```html




在 HTML 结构中保持,将以下英文文本翻译为简体中文: 一个重要的挑战是标签的广泛输出空间。ICD 编码众多,每个编码在细微细节上可能不同 — 例如,影响右手与左手的情况将具有不同的编码。此外,存在一长尾罕见编码,它们在医疗记录中很少出现,这使得深度学习模型难以学习和准确预测这些编码,因为示例稀缺。


Sure, here's the translated text: 用于训练的传统数据集,如 MIMIC-III [2],虽然全面,但通常将 ICD 编码的范围限制在训练语料库中包含的范围内。这种限制意味着将ICD编码视为从医疗记录到ICD编码的多标签分类问题的深度学习模型在模型训练后难以处理引入ICD系统的新编码。这使得重新训练成为必要且可能具有挑战性的任务。


Sure, here's the translated text in simplified Chinese: ```html


An example of the coarse-to-fine grained nature of ICD Coding — The final code that is to be assigned to a diagnosis is a function of how contextualized and precise the final query is. (Image by Author)

Sure, here is the translated text in simplified Chinese while keeping the HTML structure intact: ```html


``` Would you like any modifications or additions?
A representative example of the Relation Extraction process. Relation Extraction can help associate all relevant information for the main diagnosis in the medical note. (Image by Author)

Sure, here's the translated text in simplified Chinese: ```html





Sure, here's the translated text in simplified Chinese, while keeping the HTML structure intact: ```html



Sure, here's the translation in simplified Chinese: ```html 上下文化信息: ```

Sure, here's the translated text in simplified Chinese: ```html

据发现,LLMs 在临床领域的零-shot 关系抽取方面非常有效 [3] [4]。零-shot 关系抽取允许LLMs在文本中识别和分类关系,而无需事先对这些关系进行特定训练。这有助于更好地将诊断情况与医学编码进行上下文化,从而获得更精确的ICD编码。



Sure, here's the translation: ```html



Sure, here it is: ```html






Sure, here's the translated text in simplified Chinese, maintaining the HTML structure: ```html 在LLM引导的树搜索中,搜索从根开始,并使用LLM选择要探索的分支,以迭代方式继续,直到所有路径都耗尽。实际上,通过在树的任何给定级别上提供所有代码的描述以及医疗记录,作为提示提供给LLM,并要求其识别医疗记录的相关代码来实现该过程。然后进一步遍历和探索LLM在每个实例中选择的代码。该方法确定了最相关的ICD代码,随后被分配为临床记录的预测标签。 ```

The Tree-Search algorithm starts at the first level of the ICD tree. The descriptions of all the nodes in the first level along with the Medical Note are provided to the LLM, which is prompted to identify all relevant codes for the provided note. The output of the LLM is resolved as a set of Yes/No answers for each ICD code description. (Image by Author)



Given that the LLM predicted both ICD Code 1 and 2 as relevant to the medical note, the algorithm traverses the children of each of these nodes. Each node has 2 children codes, and the LLM is again invoked for each node’s children individually to identify if the child nodes are relevant to the medical note. (Image by Author)

```html 在这种情况下,LLM确定ICD代码1和ICD代码2都与医疗记录相关。然后算法检查每个代码的子节点。每个父代码都有两个表示更具体ICD代码的子节点。从ICD代码1开始,LLM使用ICD代码1.1和ICD代码1.2的描述以及医疗记录来确定相关代码。LLM得出结论,ICD代码1.1是相关的,而ICD代码1.2不是。由于ICD代码1.1没有进一步的子节点,算法将检查它是否是可分配的代码,并将其分配给文档。接下来,算法评估ICD代码2的子节点。再次调用LLM,它确定只有ICD代码2.1是相关的。这是一个简化的例子;在现实中,ICD树是庞大且深入的,这意味着算法将继续遍历每个相关节点的子节点,直到到达树的末端或耗尽有效的遍历。 ```

To translate "Highlights" to simplified Chinese while keeping the HTML structure intact, you can use the following: ```html 要点 ``` This HTML snippet ensures that the text "Highlights" is translated to "要点" in simplified Chinese, while specifying the language code for Chinese (zh-CN) for proper language rendering and accessibility.

  1. Sure, here's the translated text in simplified Chinese: ```html


  2. 此外,本文表明,当在提示中提供相关信息时,LLMs能够有效地适应大的输出空间,在宏平均指标方面胜过PLM-ICD [6],尤其是在罕见代码方面。
  3. Sure, here's the translated text in simplified Chinese, while keeping the HTML structure intact: ```html



Certainly! Here's the translation of "Drawbacks" to simplified Chinese while keeping the HTML structure: ```html 缺点 ```

  1. 以下是 HTML 结构的翻译:算法在树的每个级别调用 LLM。 这导致在遍历树时 LLM 调用次数很高,并且由于 ICD 树的广泛性,这导致处理单个文档的延迟和成本很高。
  2. 在论文中作者还指出,为了正确预测相关代码,LLM 必须在所有层次上正确识别其父节点。即使在一个层次上出现错误,LLM 也无法达到最终相关代码。
  3. Sure, here's the translation in simplified Chinese, while keeping the HTML structure intact: ```html



Sure, here is the translated text in simplified Chinese: ```html 实现文中描述的技术 ```

Sure, here's the translated text in simplified Chinese: ```html 所有与本文相关的代码和资源都可以在此链接中找到,并且在我的原始博客相关存储库中提供了一个镜像。我想强调,我的重新实现并不完全与论文相同,并且在细微之处有所不同,我已在原始存储库中记录下来。我尝试复制了原始论文中用于调用GPT-3.5和Llama-70B的提示,以此为基础。为了将数据集从西班牙语翻译成英语,我创建了自己的提示,因为论文中未提供详细信息。 ```

Sure, here's the translated text in simplified Chinese: ```html

让我们实施这项技术,以更好地理解其运作原理。 如前所述,该论文使用CodiEsp测试集进行评估。 该数据集包括西班牙医学笔记及其ICD代码。 尽管数据集包括英文翻译版本,但作者指出他们使用GPT-3.5将西班牙医学笔记翻译成英文,声称这比使用预先翻译版本有了一定的性能提升。 我们复制了此功能,并将笔记翻译成英文。

def construct_translation_prompt(medical_note):
Construct a prompt template for translating spanish medical notes to english.

medical_note (str): The medical case note.

str: A structured template ready to be used as input for a language model.
translation_prompt = """You are an expert Spanish-to-English translator. You are provided with a clinical note written in Spanish.
You must translate the note into English. You must ensure that you properly translate the medical and technical terms from Spanish to English without any mistakes.
Spanish Medical Note:

return translation_prompt.format(medical_note = medical_note)

```html 现在我们已经准备好评估语料库,让我们来实现树搜索算法的核心逻辑。我们在 get_icd_codes 中定义功能,接受要处理的医疗笔记、模型名称和温度设置。模型名称必须是“gpt-3.5-turbo-0613”表示GPT-3.5,或“meta-llama/Llama-2–70b-chat-hf”表示Llama-2 70B Chat。这个规范确定了树搜索算法在处理过程中将调用的LLM。 ```

评估 GPT-4 可以使用相同的代码库,只需提供相应的模型名称即可,但我们选择跳过此步骤,因为它相当耗时。

def get_icd_codes(medical_note, model_name="gpt-3.5-turbo-0613", temperature=0.0):
Identifies relevant ICD-10 codes for a given medical note by querying a language model.

This function implements the tree-search algorithm for ICD coding described in https://openreview.net/forum?id=mqnR8rGWkn.

medical_note (str): The medical note for which ICD-10 codes are to be identified.
model_name (str): The identifier for the language model used in the API (default is 'gpt-3.5-turbo-0613').

list of str: A list of confirmed ICD-10 codes that are relevant to the medical note.
assigned_codes = []
candidate_codes = [x.name for x in CHAPTER_LIST]
parent_codes = []
prompt_count = 0

while prompt_count < 50:
code_descriptions = {}
for x in candidate_codes:
description, code = get_name_and_description(x, model_name)
code_descriptions[description] = code

prompt = build_zero_shot_prompt(medical_note, list(code_descriptions.keys()), model_name=model_name)
lm_response = get_response(prompt, model_name, temperature=temperature, max_tokens=500)
predicted_codes = parse_outputs(lm_response, code_descriptions, model_name=model_name)

for code in predicted_codes:
if cm.is_leaf(code["code"]):

if len(parent_codes) > 0:
parent_code = parent_codes.pop(0)
candidate_codes = cm.get_children(parent_code["code"])

prompt_count += 1

return assigned_codes

Sure, here's the translated text in simplified Chinese: 类似于论文,我们使用了simple_icd_10_cm库,该库提供对ICD-10树的访问。这使我们能够遍历树,访问每个代码的描述,并识别有效的代码。首先,我们获取树的第一级节点。

import simple_icd_10_cm as cm

def get_name_and_description(code, model_name):
Retrieve the name and description of an ICD-10 code.

code (str): The ICD-10 code.

tuple: A tuple containing the formatted description and the name of the code.
full_data = cm.get_full_data(code).split("\n")
return format_code_descriptions(full_data[3], model_name), full_data[1]


prompt_template_dict = {"gpt-3.5-turbo-0613" : """[Case note]:
<example prompt>
Gastro-esophageal reflux disease

Gastro-esophageal reflux disease: Yes, Patient was prescribed omeprazole.
Enteropotosis: No.

Consider each of the following ICD-10 code descriptions and evaluate if there are any related mentions in the case note.
Follow the format in the example precisely.


"meta-llama/Llama-2-70b-chat-hf": """[Case note]:

<code descriptions>
* Gastro-esophageal reflux disease
* Enteroptosis
* Acute Nasopharyngitis [Common Cold]
</code descriptions>

* Gastro-esophageal reflux disease: Yes, Patient was prescribed omeprazole.
* Enteroptosis: No.
* Acute Nasopharyngitis [Common Cold]: No.

Follow the format in the example response exactly, including the entire description before your (Yes|No) judgement, followed by a newline.
Consider each of the following ICD-10 code descriptions and evaluate if there are any related mentions in the Case note.


Sure, here's the translated text in simplified Chinese while maintaining the HTML structure: ```html

我们现在基于医疗记录和代码描述构建提示。在提示和编码方面的一个优势是,我们可以使用相同的OpenAI库与GPT-3.5和Llama 2进行交互,前提是Llama-2是使用deepinfra部署的,deepinfra也支持将请求发送到LLM的openai格式。

def construct_prompt_template(case_note, code_descriptions, model_name):
Construct a prompt template for evaluating ICD-10 code descriptions against a given case note.

case_note (str): The medical case note.
code_descriptions (str): The ICD-10 code descriptions formatted as a single string.

str: A structured template ready to be used as input for a language model.
template = prompt_template_dict[model_name]

return template.format(note=case_note, code_descriptions=code_descriptions)

def build_zero_shot_prompt(input_note, descriptions, model_name, system_prompt=""):
Build a zero-shot classification prompt with system and user roles for a language model.

input_note (str): The input note or query.
descriptions (list of str): List of ICD-10 code descriptions.
system_prompt (str): Optional initial system prompt or instruction.

list of dict: A structured list of dictionaries defining the role and content of each message.
if model_name == "meta-llama/Llama-2-70b-chat-hf":
code_descriptions = "\n".join(["* " + x for x in descriptions])
code_descriptions = "\n".join(descriptions)

input_prompt = construct_prompt_template(input_note, code_descriptions, model_name)
return [{"role": "system", "content": system_prompt}, {"role": "user", "content": input_prompt}]


def get_response(messages, model_name, temperature=0.0, max_tokens=500):
Obtain responses from a specified model via the chat-completions API.

messages (list of dict): List of messages structured for API input.
model_name (str): Identifier for the model to query.
temperature (float): Controls randomness of response, where 0 is deterministic.
max_tokens (int): Limit on the number of tokens in the response.

str: The content of the response message from the model.
response = client.chat.completions.create(
return response.choices[0].message.content



def remove_noisy_prefix(text):
# Removing numbers or letters followed by a dot and optional space at the beginning of the string
cleaned_text = text.replace("* ", "").strip()
cleaned_text = re.sub(r"^\s*\w+\.\s*", "", cleaned_text)
return cleaned_text.strip()

def parse_outputs(output, code_description_map, model_name):
Parse model outputs to confirm ICD-10 codes based on a given description map.

output (str): The model output containing confirmations.
code_description_map (dict): Mapping of descriptions to ICD-10 codes.

list of dict: A list of confirmed codes and their descriptions.
confirmed_codes = []
split_outputs = [x for x in output.split("\n") if x]
for item in split_outputs:
code_description, confirmation = item.split(":", 1)
if model_name == "meta-llama/Llama-2-70b-chat-hf":
code_description = remove_noisy_prefix(code_description)

if confirmation.lower().strip().startswith("yes"):
code = code_description_map[code_description]
confirmed_codes.append({"code": code, "description": code_description})
except Exception as e:
print(str(e) + " Here")
return confirmed_codes

Sure, here is the translated text in simplified Chinese: 现在让我们看一下循环的剩余部分。到目前为止,我们已经构建了提示,从LLM获取了响应,并解析了输出以识别LLM认为相关的代码。

while prompt_count < 50:
code_descriptions = {}
for x in candidate_codes:
description, code = get_name_and_description(x, model_name)
code_descriptions[description] = code

prompt = build_zero_shot_prompt(medical_note, list(code_descriptions.keys()), model_name=model_name)
lm_response = get_response(prompt, model_name, temperature=temperature, max_tokens=500)
predicted_codes = parse_outputs(lm_response, code_descriptions, model_name=model_name)

for code in predicted_codes:
if cm.is_leaf(code["code"]):

if len(parent_codes) > 0:
parent_code = parent_codes.pop(0)
candidate_codes = cm.get_children(parent_code["code"])

prompt_count += 1





Results of our implementation for GPT-3.5 and Llama-2 70B Chat



  1. Sure, here's the translated text in simplified Chinese, keeping the HTML structure: ```html


  2. Sure, here's your text translated into simplified Chinese while keeping the HTML structure: ```html



Sure, here's the translated text in simplified Chinese, maintaining the HTML structure: ```html




Sure, here's the translated text in simplified Chinese, while keeping the HTML structure: ```html



以下是保留HTML结构的文本,将以下英文文本翻译为简体中文: ```html


``` Hope this helps! Let me know if you need anything else.



  1. ```html

    LLMs 部署需要大量的计算资源。这导致需要强大的 GPU 考虑,否则应用可能会遭受高延迟的影响,这可能会限制它们的采用。

  2. 此外,医疗数据处理通常需要严格的数据安全和隐私保障。在线法律文件管理服务可能并不一定符合医疗数据处理所需的安全和隐私标准。


Sure, here's the translated text in simplified Chinese while keeping the HTML structure: ```html




