Abstract:
Robotic grasping has extensive applications in fields such as logistics sorting, automated assembly, and medical surgery. Grasping detection is an important step in robotic grasping. Recently, with the decrease in their costs, depth cameras have been gradually applied for grasping detection, which has promoted the application of pose estimation-based methods for robotic grasping. However, most publicly available RGB-D image-based pose estimation datasets rely on equipment such as expensive 3D laser scanners to obtain 3D models of target objects. Meanwhile, the annotation process relies heavily on manual operation, which is time-consuming, labor-intensive, and unfavorable for the creation of large-scale datasets. To address these issues, this study implements a dataset automatic acquisition and annotation system aimed at developing RGB-D image-based pose estimation methods for robotic grasping. The proposed system deploys easily and does not require an expensive 3D laser scanner. RGB-D image sequences are obtained only by an off-the-shelf depth camera, and the system can automatically acquire the reconstructed 3D model of the target object, annotated pose information, and 2D image segmentation masks. During the process of developing the automatic annotation algorithm for the proposed system, a novel minimum spanning tree-based normal propagation method is proposed to guarantee that consistent normal directions can be acquired so that deformations or tearing on the reconstructed 3D surface caused by inconsistent normal directions can be avoided. During the experiments, the proposed system created a pose estimation dataset containing 84 objects with 8400 RGB-D images. 3D models, image segmentation masks, and 6D poses were annotated by the system in every RGB-D image for each object. To evaluate the accuracy of the annotated segmentation masks, the annotated segmentation masks and the corresponding manually labeled results were compared. Furthermore, the accuracy of the annotation results was also assessed from the performance of an instance segmentation network trained by the annotated image masks. To evaluate the accuracy of the annotated poses, a point cloud registration mission was launched to align the model point cloud and the scene point cloud using the annotated pose parameters. Furthermore, a category-level pose estimation network was trained using the annotated pose parameters, and its performance can directly reflect the accuracy of the annotation results. The experimental results show that the overlapped area between the annotated mask and the manually labeled mask is greater than 98%. Additionally, a 100% alignment rate can be achieved, meaning that the model point cloud can be aligned to any scene point cloud through the corresponding annotated pose parameters. These results demonstrate that the designed and implemented system in this paper can be used to sufficiently create a high-quality dataset for developing real pose estimation-related solutions. A solid data foundation can be provided on the basis of the proposed system for future research and application of deep learning models aimed at robotic grasping detection.