The team of Zhang Ningyu and Chen Huajun from Zhejiang University has proposed the first large language model in the field of oceanography, OceanGPT. This model can answer questions according to the instructions of oceanographers and has shown a high level of professional knowledge in various ocean science tasks, and has also gained preliminary embodied intelligence capabilities in the field of ocean engineering.
Artificial Intelligence (AI) tools, including large language models (LLMs), are gradually changing the scientific paradigm and have been listed by Nature as one of the scientific events to watch in 2024. As a core tool in the field of text data mining, large language models can extract key scientific information, patterns, and trends from massive amounts of text data, thereby deepening the understanding of different disciplines and providing strong support and insights for the scientific research process, decision-making, and solving complex problems.
For example, in the field of biomedicine, Microsoft has trained a language model called BioGPT on millions of relevant scientific papers in the PubMed database. This model is proficient in understanding complex concepts such as professional terminology, gene names, and protein sequences. Compared to non-professional models, BioGPT can quickly and accurately generate answers to biomedical questions, complete text mining, experimental report writing, molecular design, and literature review writing tasks.
Advertisement
Similarly, in the field of ocean science, using large language models to analyze massive amounts of ocean science text data, and understanding the theories and methods related to ocean characteristics, changes, and resource development and utilization, is of great importance for global climate regulation, weather pattern shaping, biodiversity maintenance, and the future economic development of humanity.
However, multi-dimensional and multi-scale ocean data are complex in scale and rich in type, and traditional data processing methods are difficult to cope with. At the same time, ocean science covers multiple fields and disciplines, each with its unique data attributes and patterns, which requires LLMs to have a richer professional knowledge reserve. However, the current mainstream LLMs still cannot fully meet the specific needs of oceanographers.
In response to this, the team of Zhang Ningyu and Chen Huajun from the College of Computer Science and Technology at Zhejiang University has proposed the first large language model in the field of oceanography, OceanGPT. This model is proficient in handling various ocean science tasks and can answer questions according to the instructions of oceanographers. Through the evaluation of the oceanography benchmark OCEANBENCH, OceanGPT not only shows a high level of professional knowledge in ocean science tasks but also has gained preliminary embodied intelligence capabilities in the field of ocean engineering. The project address of OceanGPT is: [Please provide the project address if available]In addition, to alleviate the difficulties in obtaining marine data, researchers have also proposed a marine science instruction generation framework based on multi-agent collaboration (multi-agent collaboration), in which each agent is regarded as an expert in a specific field (such as science and research, resources and development, ecology and environment, etc.) and is responsible for generating data in the corresponding field.
The research, titled "OceanGPT: A Large Language Model for Ocean Science Tasks," was recently accepted as a main conference paper at ACL 2024 (a CCF-A class conference), a top conference in natural language processing.
Research highlights: * Compared with existing open-source large language models, the large language model for the marine field, OceanGPT, can handle more professional marine tasks.
* The marine science instruction generation framework, DoInstruct, has great flexibility and can be optimized and applied to different scientific fields (such as astronomy).Here is the translation of the provided text into English:
Paper Address:
Follow the public account and reply with "Ocean Large Language Model" in the background to get the complete PDF
Open source project "awesome-ai4s" has collected over a hundred papers on AI4S, and provides a large amount of datasets and tools:
Dataset: High-quality driven, from 67,633 marine science literature
Researchers collected 67,633 marine science literature from recent years as the original corpus, and also selected some historically significant literature to help LLM understand the history of the development of the marine field. To ensure diversity, the articles come from different channels, covering various research perspectives and methods.
To ensure the quality and consistency of the data, researchers used regular expressions to filter out graphics, tables, headers, footers, page numbers, URLs, and references, remove redundant spaces, line breaks, and other non-text characters, and also replace or delete special characters, emoticons, and garbled characters. The processed documents cover various fields of marine science, such as marine physics, marine chemistry, marine biology, geology, hydrology, etc.
Please note that the original text seems to be a description of a research paper or project, and the translation provided is a direct conversion of the content into English.Subsequently, researchers used a hashing algorithm to deduplicate the data, which helps reduce the risk of overfitting during the model's pre-training process and enhances its generalization ability.
Due to the marine scientific corpus containing multiple domains and topics, each with its unique data characteristics and patterns, to effectively simulate and capture these data, researchers proposed a domain instruction generation framework called DoInstruct. *Marine Theme: Based on the professional knowledge of marine science experts, the marine scientific data is manually divided into 5 relatively independent marine themes, which are Science and Research, Resources and Development, Ecology and Environment, Technology and Engineering, Life, Culture, and Others.
High-quality/Professional/Diverse, DoInstruct can generate marine instruction data.
The domain instruction generation framework DoInstruct is based on multi-Agent collaboration and can effectively achieve marine data generation.
As shown in the above figure, under the DoInstruct framework, researchers designed 3 types of Agent roles: the Evolving Agent as Generator, the Fine-tuned Agent as Literature Extractor, and the Agent as Inspector. Each Agent is regarded as an expert in a specific domain (theme) and is responsible for generating corresponding data.
Evolving Agent as the Generator: To construct the seed dataset, researchers hired dozens of annotators with a rich background in marine science. Each annotator is responsible for several themes and manually writes some representative examples for each marine theme.Then, researchers used a large language model to mimic existing data and generate a large number of similar samples, all of which were manually checked by annotators. The final seed instruction dataset includes 5 main categories, over 500 subcategories, and more than 10,000 data samples.
After obtaining the seed instruction dataset, researchers selected samples from it and called the Agent (gpt-3.5-turbo) to evolve the selected samples.
As shown in the left figure, specifically, by supplementing and expanding the background knowledge of the seed samples, refining and enhancing the knowledge points contained in the seed data, and through multiple rounds of iteration, researchers can quickly expand the existing seed dataset and expand the breadth and depth of information.
Fine-Tuned Agent as the Literature Extractor: After fine-tuning, the literature reading Agent serves as an extractor of information from the literature.Researchers collected an expert-annotated corpus and retrieved high-quality sentences from a larger ocean corpus using the BM25 algorithm, considering both as high-quality candidates. At the same time, researchers fine-tuned the gpt-3.5-turbo with a seed instruction dataset, considering the fine-tuned agent as a literature extractor, which can extract high-quality text from the vast ocean corpus.
Ensuring Data Quality with an Auditing Agent: Agent as the Inspector with Rule Constraints
For the large number of generated instructions, researchers used grammar, semantics, basic definitions of the marine field, etc., as rule constraints. By constructing the agent through prompts and filtering the data, the quality of the generated marine instruction data is ensured to be higher.
To further ensure data quality, researchers randomly extracted 10% of the samples from the generated instruction dataset and had trained domain expert volunteers verify whether these samples had potential errors. The final data's IAA (inter-annotator agreement) score was 0.82, which meets the research purpose.
As shown in the figure below, the DoInstruct framework can quickly build a marine science dataset using multiple agents, which can be expanded to more than 150,000 instructions (Data-Evolving, Data-Extracting). In addition, the professionalism and accuracy of the data are also guaranteed.As shown in the figure below, researchers measure the data generation effect of DoInstruct from the perspectives of knowledge quality (Quality), professionalism (Expertise), and diversity (Diversity).
It can be seen that the evolving generator Agent can effectively enhance the richness of marine data. The extraction Agent can improve the professionalism of the content, and the inspector Agent can raise the quality of the generated data. In summary, multi-agent collaboration is effective for marine instruction generation.
Based on LLaMA-2, OceanGPT performs better in marine tasks
After obtaining the instruction data, researchers pre-trained OceanGPT for 7 days using 6 Nvidia A800 GPUs based on LLaMA-2.
After obtaining the pre-trained model OceanGPT, researchers fine-tuned the model using the LoRA method. To evaluate the capabilities of the large language model OceanGPT in marine tasks, researchers selected three models: LLaMA-2 (Llama-2-7b-chat-hf), Vicuna-1.5, and ChatGLM2-6B, to compare with OceanGPT.Before proceeding with the comparison, researchers designed a benchmark test called OCEANBENCH. As shown in the figure below, this benchmark includes 15 tasks related to the ocean, such as Analysis, Judgment, and so on.
As illustrated in the figure below, the researchers compared the performance of OceanGPT with three baseline models on 15 sub-tasks in the field of oceanography at the task-level. The results revealed that OceanGPT outperformed the other models in both automated and human evaluations.
The figure below demonstrates the evaluation results of the OceanGPT model on OCEANBENCH ocean science tasks. The findings indicate that OceanGPT excels over other baseline language models in the vast majority of tasks.
From nuclear contamination to underwater robots, OceanGPT achieves a dual victory in the field of oceanography.To demonstrate the application potential of OceanGPT in the field of oceanography, researchers have tested OceanGPT from the perspectives of marine science and marine engineering.
A New Powerful Tool for Radionuclide Research: OceanGPT with Greater Depth of Expertise
In the field of marine science, researchers focus on the issue of nuclear pollution in the marine environment and compare the performance of OceanGPT and Vicuna-7b-1.5 in this task.
As shown in the above figure, OceanGPT demonstrates a higher level of knowledge when describing the content of radionuclide research. Its text content is not only clear in structure and well-organized but also covers various aspects of radionuclide research, such as experimental design, data analysis, risk assessment, and handling guidelines.
In contrast, although Vicuna-7b-1.5 has clear expression and strong logic, it lacks deeper and more specific content related to radionuclides.
In summary, OceanGPT has advantages in terms of professional knowledge, quality, and richness.Ocean Engineering Intelligence: OceanGPT Achieves Precise Control of Underwater Robots
Ocean engineering is crucial for the sustainability and safety of offshore operations. To facilitate the interaction between OceanGPT and the external world, researchers synthesized robot code data and integrated these machine code instructions into the training data, assessing the model's capabilities through code or console commands.
As shown in the above figure, OceanGPT can issue commands to underwater robots via code or console commands so that the robots can perform complex tasks (based on human instructions). This indicates that OceanGPT has acquired preliminary embodied intelligence capabilities, paving the way for advanced marine models to execute complex robot control and planning tasks.
OceanGPT has once again "evolved," ushering in the intelligent era of ocean science.
Led by Professors Ningyu Zhang and Huajun Chen from Zhejiang University, the research team, which includes Zhen Bi, Yida Xue, Yixin Ou, Daxiong Ji, and Guozhu Zheng, has successfully constructed the first large language model in the field of ocean science, OceanGPT. This marks a key step in the intelligent process of the ocean field, making OceanGPT an important milestone in the field of ocean science.
However, the development of OceanGPT has not stopped here. With in-depth research and technological refinement, OceanGPT has ushered in a new round of optimization and upgrades.According to recent reports from the Knowledge Engine Laboratory at Zhejiang University (ZJUKG), the first author of the paper, Bi Zhen, announced a series of significant advancements in OceanGPT:
* Firstly, the official launch of two new versions, OceanGPT-14B and OceanGPT-2B;
* Secondly, the addition of OceanGPT based on the Qwen2 Chinese base, achieving efficient bilingual interaction in both Chinese and English;
* Concurrently, the team has also open-sourced the 20K-scale OceanInstruct dataset of oceanic model instructions, providing valuable resource support for marine scientific researchers;
Download link for the OceanInstruct dataset:
* Lastly, the multimodal version of OceanGPT-V has made its debut, not only supporting the processing of multimodal marine information such as sonar data and scientific images, but also offering an online demonstration of OceanGPT-V, opening up new perspectives and possibilities for marine scientific exploration. It is reported that this model will be open-sourced soon.To analyze the changes in capabilities after model updates, taking OceanGPT-14B as an example, researchers provided a Chinese question: "Please generate a construction plan for the underwater cables in the East China Sea."
The results showed that the content generated by OceanGPT is richer and covers more levels, with a stronger understanding and generation capability of ocean science knowledge.
At the same time, to verify the English generation capability of OceanGPT, researchers provided an English input: "Please describe the characteristics of the seabed topography and geomorphology in the East China Sea."
The results showed that the descriptions generated by OceanGPT were relatively better in terms of detail, comprehensiveness, professionalism, and regional division, providing more accurate and in-depth information on seabed topography and geomorphology.
In addition, Bi Zhen also provided the development plan for OceanGPT.It is anticipated that between August and December of 2024, a bilingual multimodal version of OceanGPT-V+ will be launched. Based on a large-scale corpus, they will continue to train OceanGPT with even larger models (such as 30B, 70B) and maintain it by adding new data and tasks, exploring more unknowns in the field of ocean science.
Looking forward to OceanGPT bringing more surprises and breakthroughs, and opening a new chapter in the research of ocean science!