MineralMLLM：矿物智能分析大模型

郭贤; 王耀祖; 张建良; 刘征建

doi:10.13374/j.issn2095-9389.2025.09.10.003

MineralMLLM：矿物智能分析大模型

MineralMLLM: Intelligent analysis model of mineral phase

摘要

摘要: 矿物分析在地质勘探与资源开发中具有重要意义. 但传统分析方法高度依赖专业地质知识及精密仪器，过程耗时且成本较高. 近年来，多模态大模型（Multimodal large language models, MLLMs）快速发展，为矿物分析提供了新思路. 目前通用领域大模型在专业知识覆盖和泛化能力上仍存在不足. 为此，本研究构建了一种基于多模态大模型的矿物分析系统（Mineral multimodal large language model, MineralMLLM），以Qwen2.5–VL为基座模型，利用矿物“图–文”数据集，分别采用Infused adapter by attention (IA3)与Low-rank adaptation (LoRA)两种高效参数微调策略进行对比优化，并结合检索增强生成（Retrieval-augmented generation, RAG）技术整合领域知识库，构建轻量化Web架构实现可视化交互. 实验结果表明，LoRA和IA3方法微调后，系统在原基准模型上BERT–Score提升约10%和1%，选择LoRA结合RAG技术后其得分相较原微调方法再提升约10%，从而显著增强了矿物识别的准确性和专业描述生成能力，有助于提升矿相分析的智能化水平和工程应用价值. 消融实验验证系统在处理模糊图像时仍保持较高稳定性与鲁棒性，证明了所提方案的有效性与可靠性.

Abstract: Mineral analysis is a fundamental task in geological exploration and resource development that provides critical technical support for ore-genesis identification, resource evaluation, and process-parameter optimization. However, conventional mineral analysis methods rely significantly on expert knowledge and specialized instrumentation, thus resulting in high costs, low efficiency, and limited applicability in complex field environments. Moreover, their dependence on single-modal data restricts their performance in fine-grained mineral recognition. Recent advances in multimodal large-language models (MLLMs) have introduced new possibilities for mineral analysis by enabling a unified understanding of visual and textual information using advanced image encoders and cross-modal alignment mechanisms. Whereas MLLMs have shown promising results in domains such as education, healthcare, and geology, general-purpose models remain inadequate in mineral analysis, including insufficient domain-specific knowledge, weak generalization in fine-grained mineral image recognition, and limited capability to generate professional and structured geological reports. Hence, this study proposes MineralMLLM, which is a multimodal mineral-analysis system developed based on Qwen2.5-VL—a state-of-the-art vision-language model optimized for Chinese scenarios with native support for high-resolution dynamic image processing, precise spatial grounding, and robust multimodal document understanding. To enhance domain adaptability, the model was fine-tuned on a self-constructed mineral image–text dataset comprising approximately 10,000 samples across 20 mineral categories. Two parameter-efficient fine-tuning strategies, i.e., Low-rank adaptation (LoRA) and Infused adapter by attention (IA3), were employed and systematically compared. The dataset was obtained from multiple sources, including mineralogy textbooks, academic literature, and online resources; subsequently, it was subjected to manual verification, data augmentation, and stratified random splitting in a training: validation: test ratio of 7∶1∶2 to ensure data quality and representativeness. Furthermore, retrieval-augmented generation (RAG) was integrated to incorporate domain-specific knowledge and establish a complete training–retrieval–inference pipeline. The RAG module adopts a hybrid retrieval strategy that combines dense vector retrieval (weight, 0.6) and BM25-based sparse retrieval (weight, 0.4), along with semantic chunking optimized at a threshold of 0.82 to balance between semantic coherence and retrieval efficiency. A lightweight web-based interactive system was implemented using Vue.js and Flask to support mineral-image upload, semantic recognition, and structured result visualization. Experimental results show that fine-tuning via LoRA and IA3 improved the BERT score by approximately 10% and 1%, respectively, compared with that of the base model. LoRA achieved superior performance owing to its larger trainable parameter capacity (approximately 190 million parameters, which constitute 2.24% of the total model parameters) and stronger feature adaptation capability. When combined with RAG, the LoRA-enhanced model further improved the BERT score by an additional 10% (reaching 0.806), with significant gains in terms of the bilingual evaluation understudy (BLEU) (0.4786 vs. 0.4151) and ROUGE–f1 (0.5755 vs. 0.2258). These improvements significantly enhance both the mineral-identification accuracy and the generation of professional structured descriptions, including mineral classification, characteristic analysis, and reference citations. Ablation studies and robustness evaluations confirm the effectiveness and stability of MineralMLLM under challenging conditions, including blurred images, low-light environments, and partial occlusions. The model consistently outperformed baseline models under these conditions. Additionally, semantic chunking threshold analysis (τ ∈ 0.70, 0.90) indicates that τ = 0.82 achieved optimal performance by balancing between chunk granularity and semantic integrity. In conclusion, MineralMLLM effectively bridges domain-specific geological knowledge with the general reasoning capabilities of MLLMs, thus providing a scalable and practical solution for intelligent mineral analysis. The proposed framework not only advances mineral phase identification but also offers a transferable technical paradigm for deploying large multimodal models in other professional and industrial domains.

HTML全文

参考文献(27)

施引文献

资源附件(0)