Abstract:
Mineral analysis is a fundamental task in geological exploration and resource development that provides critical technical support for ore-genesis identification, resource evaluation, and process-parameter optimization. However, conventional mineral analysis methods rely significantly on expert knowledge and specialized instrumentation, thus resulting in high costs, low efficiency, and limited applicability in complex field environments. Moreover, their dependence on single-modal data restricts their performance in fine-grained mineral recognition. Recent advances in multimodal large-language models (MLLMs) have introduced new possibilities for mineral analysis by enabling a unified understanding of visual and textual information using advanced image encoders and cross-modal alignment mechanisms. Whereas MLLMs have shown promising results in domains such as education, healthcare, and geology, general-purpose models remain inadequate in mineral analysis, including insufficient domain-specific knowledge, weak generalization in fine-grained mineral image recognition, and limited capability to generate professional and structured geological reports. Hence, this study proposes MineralMLLM, which is a multimodal mineral-analysis system developed based on Qwen2.5-VL—a state-of-the-art vision-language model optimized for Chinese scenarios with native support for high-resolution dynamic image processing, precise spatial grounding, and robust multimodal document understanding. To enhance domain adaptability, the model was fine-tuned on a self-constructed mineral image–text dataset comprising approximately 10,000 samples across 20 mineral categories. Two parameter-efficient fine-tuning strategies, i.e., Low-rank adaptation (LoRA) and Infused adapter by attention (IA3), were employed and systematically compared. The dataset was obtained from multiple sources, including mineralogy textbooks, academic literature, and online resources; subsequently, it was subjected to manual verification, data augmentation, and stratified random splitting in a training: validation: test ratio of 7∶1∶2 to ensure data quality and representativeness. Furthermore, retrieval-augmented generation (RAG) was integrated to incorporate domain-specific knowledge and establish a complete training–retrieval–inference pipeline. The RAG module adopts a hybrid retrieval strategy that combines dense vector retrieval (weight, 0.6) and BM25-based sparse retrieval (weight, 0.4), along with semantic chunking optimized at a threshold of 0.82 to balance between semantic coherence and retrieval efficiency. A lightweight web-based interactive system was implemented using Vue.js and Flask to support mineral-image upload, semantic recognition, and structured result visualization. Experimental results show that fine-tuning via LoRA and IA3 improved the BERT score by approximately 10% and 1%, respectively, compared with that of the base model. LoRA achieved superior performance owing to its larger trainable parameter capacity (approximately 190 million parameters, which constitute 2.24% of the total model parameters) and stronger feature adaptation capability. When combined with RAG, the LoRA-enhanced model further improved the BERT score by an additional 10% (reaching 0.806), with significant gains in terms of the bilingual evaluation understudy (BLEU) (
0.4786 vs.
0.4151) and ROUGE–f1 (
0.5755 vs.
0.2258). These improvements significantly enhance both the mineral-identification accuracy and the generation of professional structured descriptions, including mineral classification, characteristic analysis, and reference citations. Ablation studies and robustness evaluations confirm the effectiveness and stability of MineralMLLM under challenging conditions, including blurred images, low-light environments, and partial occlusions. The model consistently outperformed baseline models under these conditions. Additionally, semantic chunking threshold analysis (
τ ∈ 0.70, 0.90) indicates that
τ = 0.82 achieved optimal performance by balancing between chunk granularity and semantic integrity. In conclusion, MineralMLLM effectively bridges domain-specific geological knowledge with the general reasoning capabilities of MLLMs, thus providing a scalable and practical solution for intelligent mineral analysis. The proposed framework not only advances mineral phase identification but also offers a transferable technical paradigm for deploying large multimodal models in other professional and industrial domains.