Abstract:
Focusing on the problem that whether the visual language model ( VLM ) can provide a substantial gain beyond the traditional depth vision for welding defect detection, the relevant literature was systematically analyzed based on the three types of detection objects : surface defects, internal defects and weld formation abnormalities. The results show that the research direction is still in the early stage of exploration, which is complementary to the existing defect detection algorithms and has great development potential. The advantages of the existing VLM are high-level tasks such as semantic interpretation, few-sample identification guidance, and structured report generation. The disadvantages are as follows : the feature extraction ability of X-ray, ultrasound and other non-natural image data is insufficient ; in terms of fine-grained pixel positioning, the detection accuracy lags behind the target detection algorithm based on CNN network. The inference delay in the real-time edge deployment scenario is large. Therefore, it is considered that the current more reasonable engineering path is a hybrid complementary architecture - the traditional visual method is responsible for precise positioning and real-time pre-detection, and the VLM is responsible for the upper semantic interpretation and structured report output. The two form a hierarchical collaboration rather than an alternative relationship.