1 Introduction
Shape segmentation aims to divide a 3D shape into meaningful parts, and to reveal its internal structure. This is the basis and prerequisite to explore the inherent characteristics of the shape. The results obtained from shape segmentation can be applied to various fields of computer graphics, such as shape editing Yu et al. (2004), deformation Yang et al. (2013), and modelling Chen et al. (2015). Shape segmentation, therefore, has become a research hotspot, yet difficulties in the fields of digital geometric model processing and instance modelling persist.
Convolutional networks have shown excellent performance in various image processing problems, such as image classification Krizhevsky et al. (2012); Szegedy et al. (2015); Simonyan and Zisserman (2014), and semantic segmentation Ciresan et al. (2012); Farabet et al. (2013); Pinheiro and Collobert (2014). With the emerging encouraging study results, many researchers have devoted their efforts to various deformation studies on CNNs, one of which is the fully convolutional network (FCN) Long et al. (2015). This method can train endtoend, pixelstopixels on semantic segmentation, with no requirement over the size of the input image. Thus, it has become one of the key research topics in CNN networks.
Although FCNs can generate good results in image segmentation, we cannot directly apply it to 3D shape segmentation. This is mainly because image is a type of a static 2D array, which has a very standard data structure and regular neighbourhood relations. Therefore, convolution and pooling can be easily operated when processing FCNs. However, the data structure of a 3D shape is irregular, so it cannot be directly represented as the data structure of an image. As triangle meshes have no regular neighbourhood relations like image pixels, direct convolution and pooling operations on a 3D shape is difficult to fulfil.
Inspired by the FCN architecture in image segmentation, we design and implement a new FCN architecture that operates directly on 3D shapes by coverting a 3D shape into a graph structure. Based on the FCN process of convolution and pooling operation on the image and existing methods of Graph Convolutional Neural Networks
Edwards and Xie (2016); Niepert et al. (2016); Defferrard et al. (2016), we design a shape convolution and pooling operation, which can be applied directly to the 3D shape. Combined with the original FCN architecture, we build a new shape fully convolutional network architecture and name it Shape Fully Convolutional Network (SFCN). Secondly, following the SFCN architecture mentioned above and the basic flow of image segmentation of FCN Long et al. (2015), we devise a novel trained trianglestotriangles model for shape segmentation. Thirdly, for higher accuracy of segmentation, we use the multiple features of the shape to complete the training on the SFCN. Utilising the complementarity between features, and combined with the multilabel graph cuts method Boykov et al. (2001); Kolmogorov and Zabih (2004); Boykov and Kolmogorov (2004), we optimise the segmentation results obtained by the SFCN prediction, through which the final shape segmentation results are obtained. Our approach can realize the trianglestotriangles learning and prediction with no requirements on the triangle numbers of the input shape. Furthermore, many experiments show that our segmentation results perform better than those of existing methods Kalogerakis et al. (2010); Guo et al. (2015); Xie et al. (2014), especially when dealing with a large dataset. Finally, our method permits mixed dataset learning and prediction. Datasets of different categories are combined together in the test, and the accuracy of the segmentation results of different shapes decreases very little. As shown in Figure 1, for a mixed shape dataset from COSEG Wang et al. (2012) with several categories of shapes, part of the training set are displayed on the left, and some corresponding segmentation results are shown on the right. Figure 2 shows the process of our method.The main contributions of this paper include:

We design a fully convolutional network architecture for shapes, named Shape Fully Convolutional Network (SFCN), and is able to achieve effective convolution and pooling operations on the 3D shapes.

We present the shape segmentation and labelling based on SFCN. It can be trianglestotriangles by three lowlevel geometric features, and outperforms the stateoftheart shape segmentation.

Excellent segmentation results on training and predicting mixed datasets of different categories of shapes are achieved.
2 Related Work
Fully convolutional networks. The fully convolutional network Long et al. (2015) proposed in 2015 is pioneering research which can effectively solve the problem of semantic image segmentation by pixellevel classification. Later, a great deal of research has emerged based on the FCN algorithms and achieved good results in various fields such as edge detection Xie and Tu (2015), depth regression Liu et al. (2015), optical flow Dosovitskiy et al. (2015), simplifying sketches SimoSerra et al. (2016) and so on. However, the existing research on FCN is mainly restricted to image processing, largely because image has a standard data structure, easy for convolution, pooling and other operations.
Graphs CNN. Standard CNN cannot work directly on data which have graph structures. However, there are some previous works Bruna et al. (2014); Duvenaudy et al. (2015); Niepert et al. (2016); Defferrard et al. (2016) that discuss how to train CNN on graphs. These works Bruna et al. (2014); Defferrard et al. (2016) use a spectral method by computing bases of the graph Laplacian matrix, and perform the convolution in the spectral domain. Niepert et al. (2016)
use a PATCHYSAN architecture which maps from a graph representation onto a vector space representation by extracting locally connected regions from graphs.
Deep learning on 3D shapes. Recent works introduce various methods of deep architectures for 3D shapes. By using volumetric representation Wu et al. (2015); Qi et al. (2016), standard CNN 2d convolution and pooling methods can be easily extended to 3D convolution and pooling. However, it leads to some problems which cannot be completely solved, such as data sparsity and the computation cost of 3D convolution. Kalogerakis et al. (2017) proposed multiview imagebased Fully Convolutional Networks (FCNs) and surfacebased CRFs to do the 3D mesh labelling by using RGBD data as input. However, it may need a lot of time to compute the surfacepixel reference, and more viewpoints to maximally cover the shape surface. These methods Yi et al. (2017); Su et al. (2017) proposed different deep architectures which can be trained directly on point clouds to achieve points segmentation and other 3D points analysis. However, the point cloud structure sampled from the origin meshes may lack the intrinsic structure of the meshes. For better understanding the intrinsic structures, many works directly adapt neural networks to meshes. The first mesh CNN method GCNNMasci et al. (2015) used convolution filters in local geodesic polar coordinates. ACNNBoscaini et al. (2016) used anisotropic heat kernels as filters. MoNetMonti et al. (2017) considered a more generalized CNN architectures by using mixture model. Maron et al. (2017) using a global seamless parameterization for which the convolution operator is well defined but can only be used on spherelike meshes. As Boscaini et al. (2015); Litany et al. (2017) show that intrinsic CNN methods can achieve better result on shape correspondence and descriptor learning with less training data. More details of these techniques can be found in Bronstein et al. (2017).
Supervised methods for segmentation and labelling.
When using the supervised method to train a collection of labelled 3D shapes, an advanced machine learning approach is used to complete the related training
Kalogerakis et al. (2010); Guo et al. (2015); Xie et al. (2014); Wang et al. (2012, 2013). For example, Kalogerakis et al. (2010) used Conditional Random Fields (CRF) to model and learn the sample example, in order to realize the component segmentation and labelling of 3D mesh shapes. Wang et al. (2013) first projected 3D shapes to 2D space, and the labelling results in 2D space were then projected back to 3D shapes for mesh labelling. Xie et al. (2014) use Extreme Learning Machines (ELM), which can be used to achieve consistent segmentation for unknown shapes. Guo et al. (2015); Kalogerakis et al. (2017) applied deep architectures to produce the mesh segmentation.Unsupervised segmentation and labelling. A lot of research methods can build a segmentation model Huang et al. (2011); Sidi et al. (2011); Hu et al. (2012); Meng et al. (2013); Xu et al. (2010)
and achieve joint segmentation without any label information. There are predominantly two unsupervised learning methods: matching and clustering. Using the matching method, the matching relation between pair 3D shapes is obtained based on the similarity of a relative unit given by correlation calculation. The segmentation shape of matching shape is then established to realize the joint segmentation
Kreavoy et al. (2007); Huang et al. (2011) of 3D shapes. By contrast, clustering methods analyze all the 3D shapes in the model set and cluster the consistent correlation units of 3D shapes into the same class. A segmentation model is then obtained and applied to consistent segmentation Hu et al. (2012); Meng et al. (2013); Xu et al. (2010).3 Shape Fully Convolution Network
In the process of image semantic segmentation, it is mainly through operations such as convolution and pooling that fully convolution network architecture completes the image segmentation Long et al. (2015). As analyzed above, the regular data structure amongst the pixels of the image makes it easy to implement these operations. By analogy with images, triangles of the 3D shape can be seen as pixels on the image, but unlike pixels, triangles of the shape have no ordered sequence rules. Figure 3(a) represents the regular pixels on the image, while Figure 3(b) represents the irregular triangles on the shape. It is difficult to complete the convolution and pooling operation on the 3D shape like that of the image. Therefore, based on the characteristics of 3D shape data structure and analogous to the operation on the image, we need to design a new shape convolution and pooling operation.
As a 3D shape is mainly composed of triangles and the connections among them, we can use graph structure to describe it. Each 3D shape can be represented as , with vertex for a triangle and edge for the connection of adjacent triangles. The small triangle shown in 3(b) corresponds to the graph structure shown in Figure 3(c). Based on the graph structure, we design and implement a shape convolution and pooling operation, which will be detailed in the following section.
3.1 Convolution on Shape
Convolution is one of the two key operations in the FCN architecture, allowing for locally receptive features to be highlighted in the input image. When convolution is applied to images, a receptive field (a square grid) moves over each image with a particular step size. The receptive field reads the feature values of the pixels for each channel once, and a patch of values is created for each channel. Since the pixels of an image have an implicit arrangement, a spatial order, the receptive fields always move from left to right and top to bottom. Analogous to the convolution operation on the image, therefore, we need to focus on the following two key ideas when employing convolution operation on the shapes:

Determining the neighbourhood sets around each triangle for the convolution operation according to the size of the receptive field.

Determining the order of execution of the convolution operation on each neighbourhood set.
For the first point, the convolution operation on the image is mainly based on the neighbourhood relationship between pixels. Accordingly, we need to construct locally connected neighbourhoods from the input shape. These neighbourhoods are generated efficiently and serve as the receptive fields of a convolutional architecture, permitting the architecture to learn shape representations effectively.
Shape has a neighbourhood relationship just like an image, but its irregularity restrains it to be directly represented and applied to the FCN learning. Yet when expressed as graph structures, the locally connected neighbourhoods of each triangle of the shape can be easily determined with various search strategies. In this paper, each triangle of the shape is viewed as a source node. We use the breadthfirst search to expand its neighbourhood nodes on the constructed graph so as to obtain the neighbourhood sets of each triangle in the shape. Suppose the receptive field is set as , the size of the neighbourhood set will be the same, including neighbourhood nodes and a source node, all of which will be used for a followup convolution operation. Figure 4(a) shows the graph structure of the shape, while Figure 4(b) shows the neighbourhood sets of each source node (that is, each triangle on the 3D shape) we obtained with the method outlined above.
As for the second point, when performing the convolution operation on the image, it is easy to determine the order of the convolution operation according to the spatial arrangement of the pixels. However, it is rather difficult to determine the spatial orders among triangles on the 3D shape. A new strategy is thus needed to reasonably sort the elements in the neighbourhood sets. Sorting is to ensure that the elements in each neighbourhood set can be convolved by the same rules, so that the convolution operation can better activate features. For each node, all nodes in its neighbourhood set can be sorted by feature similarity (L2 similarity in the input feature space). Using this method, we can not only determine the order of the convolution operation of each neighbourhood set, but also ensure that the nodes in different sets have the same contribution regularity to their own source nodes in the convolution operation. The final convolution order for each neighbourhood set is shown in Figure 4(c). As shown in Figure 4(b), the execution order of the convolution operation of the neighbourhood set obtained from the source node is determined by calculating the input feature similarity.
3.2 Pooling on Shape
Pooling is the other key operation in the FCN architecture. The pooling operation is utilised to compress the resolution of each feature map (the result of the convolution operation) in the spatial dimensions, leaving the number of feature maps unchanged. Applying a pooling operation across a feature map enables the algorithm to handle a growing number of feature maps and generalises the feature maps by resolution reduction. Common pooling operations are those of taking the average and maximum of receptive fields over the input map Edwards and Xie (2016). We share the same pooling operation on shape fully convolutional networks with the operation mentioned above. However, we need to address a key concern; that is, to determine the pooling operating order of the SFCN on the shape feature map.
Similar to the convolution operation, we cannot directly determine the pooling operation order on SFCN based on spatial relationships among the triangles of the 3D shape. Since the 3D shape has been expressed as a graph structure, we can determine the pooling operation order according to the operation of convolutional neural networks on the graph. In this paper, the pooling operation on the SFCN is computed by adopting the fast pooling of graphs Niepert et al. (2016); Defferrard et al. (2016).
The pooling method for graphs Niepert et al. (2016); Defferrard et al. (2016) coarsens the graph with the Graclus multilevel clustering algorithm Dhillon et al. (2007). Graph coarsening aims to determine the new structure of the graph after pooling. We first present each shape as a graph structure, then we exploit the feature information on the triangles of the shape and the Graclus multilevel clustering algorithm Dhillon et al. (2007) to complete shape coarsening; that is, to determine the new connection relationship of the shape feature map after pooling, which is shown in Figure 5(a).
In the pooling process, traversing the feature map in a certain order according to the size of the receptive field is a key step to complete the calculation, namely, to determine the operation order of pooling. Following the method of pooling for the graph proposed by Defferrard et al. (2016), the vertices of the input graph and its coarsened versions are irregularly arranged after graph coarsening. To define the pooling order, therefore, a balanced binary tree is built by adding fake code to sort the vertices. Lastly, the pooling operation of the graph is completed based on the nodes order and the use of a 1D signal pooling method. After shape coarsening, we apply the same approach in this study to determine the order of pooling operations on the shape fully convolutional networks architecture, as shown in Figure 5(b).
4 Shape Segmentation via Shape Fully Convolution Network
We design a novel shape segmentation method based on SFCN, analogous to the basic process of image semantic segmentation on FCN Long et al. (2015). Firstly, we extract three categories of commonly used geometric features of the shape as an input for SFCN training and learning. Secondly, based on the shape convolution and pooling operation proposed in Section 3 and the basic process of image semantic segmentation on FCN, we design a lattice structure suitable for 3D shape segmentation. Through training and learning of the network, we produce trianglestotriangles labelling prediction. Finally, we introduce the optimisation process of shape segmentation.
4.1 Geometric Feature Extraction
Our approach is designed to complete the network training and learning based on some common and effective lowlevel features. In this paper, therefore, we extract three from the existing commonly used ones as the main features for the network training and learning. The three features include: average geodesic distance (AGD) Hilaga et al. (2001), shape context (SC) Belongie et al. (2002) and spin image (SI) Johnson and Hebert (1999). These features describe the characteristics of each triangle on a shape from multiple perspectives well. We also found in the experiment that these three features are complementary, which will be analyzed in detail in the experimental part.
4.2 Shape Segmentation Networks Structure
As the convolution and pooling operations on a shape are different from those on an image, the FCN architecture originally used in image segmentation cannot be directly applied on the 3D shape. We modify the original FCN architecture according to the convolution and pooling characteristics, so that it can be conducted in shape segmentation.
Our training network is made up of four parts: convolution, pooling, generating and deconvolution layers, as shown in Figure 7. The convolution layer corresponds to a feature extractor that transforms the input shape to multidimensional feature representation. The convolution operation is completed by the method proposed in section 3.1. The pooling layer is used to reduce the feature vector of the convolution layer, and expand its receptive field to integrate feature points in the small neighbourhood into the new ones as output. The pooling operation is completed by the method proposed in section 3.2.
As our convolution and pooling operations are designed for shapes, the original FCN architecture cannot be used directly. Compared with the original FCN’s architecture, the architecture of the SFCN in this paper needs to record every neighbourhood set of each shape that participated in the convolution obtained in Section 3, as well as the convolution and pooling order of each shape. Thus, we add a generating layer in the original FCN architecture, whose diagram of concrete meaning is shown in Figure 6.
Firstly, as shown in Figure 6(a), we can calculate the pooling order between nodes of the graph by the shape pooling method proposed in Section 3.2 (these nodes are equivalent to the triangle of the shape). We store the nodes in the calculation order on the generating layer, as shown in Figure 6(c). Figure 6(a) gives the pooling order on a shape, where the figures represent the number of nodes (i.e. triangles).
Secondly, we need to record the neighbourhood sets of each node (i.e. each triangle of the shape) involved in the convolution computation, as shown in Figure 6(b), where we store the neighbourhood sets by reading the offset in Figure 6(a). After the nodes are sorted by column on the generating layer, we record their neighbourhood sets, in which the nodes are sorted according to the convolution order calculated in Section 3.1. As shown in Figure 6(c), each row that is sequenced in the convolution order represents a neighbourhood set of a node, where the figures represent the number of nodes (i.e. triangles). By storing the data in this way, we can achieve the convolution operation by row and the pooling operation by column, as shown in Figure 6(c). Another advantage of such storage is that, after pooling, the new neighbourhood set and the pooling order required by the next convolution layer of each new node can still be obtained and be applied to the next generating layer using the method in Section 3.2.
The deconvolution layer can be used to perform upsampling and densify our graph representation. As mention above, we regard pooling as a graph coarsening by clustering 4 vertices, as shown in Figure7. Conversely, the upsampling operation can be regarded as a reverse process of the graph coarsening. We record how vertices are being matched before and after graph coarsening. Therefore it is easy to reuse the efficient deconvolution implementation based on the imagebased FCNLong et al. (2015)
. In this paper, the width of the convolution kernel in the deconvolution layer of the original FCN architecture is changed to 1 and the height is set to the size of the pooling we use, thereby obtaining the deconvolution layer of SFCN. The final output of the layer is a probability map, indicating the probability of each triangle on a shape that belongs to one of the predefined classes.
Based on the FCN architecture proposed by Long et al. (2015), we design an SFCN architecture suitable for shape segmentation. Our convolution has five convolutional layers altogether, with each convolution layer having a generating layer before generating data for the next convolution and pooling, and followed by a pooling layer. Two fully connected layers are augmented at the end to impose classspecific projection. Corresponding to the pooling layer, there are five deconvolution layers, through which we can obtain the prediction probability of each triangle of the shape that belongs to each class. In the prediction process, we used the same skip architecture Long et al. (2015). It can combine segmentation information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations as the original FCN architecture. The specific process is shown in Figure 8
. The prediction probability of each layer can be obtained by adding the results of a deconvolution layer and the corresponding results of the pooling layer after convolution, which also functions as the input for the next deconvolution layer. The number of rows will be repeated 5 times to return to the initial number of triangles, where the value of each channel is the probability of this class, realizing the triangle level prediction. Another difference from the original FCN architecture is that, in order to normalise the data, we add a batch normalisation layer after the convolution operation of the first layer of the original FCN architecture, using the default implementation of the BN in the Caffe
Jia et al. (2014).4.3 Shape Segmentation Optimisation
We train these three features separately using the network structure provided in Section 4.2. Given testing shapes, the SFCN produces the corresponding segmentation results under each feature, which can describe the segmentation of 3D shapes from different perspectives. Besides, due to the different input geometry features, there may be some differences among the predicted segmentation results of the same shape. To obtain the final segmentation results, we leverage the complementarity among features and the multilabel graph cuts method Boykov et al. (2001); Kolmogorov and Zabih (2004); Boykov and Kolmogorov (2004) to optimise the segmentation results of the testing shape. The final segmentation result is obtained mainly through the optimisation of the following formula.
(1) 
In this formula, and are labels of triangles and , data item describes the energy consumption of triangle marked as label , and smoothing item describes the energy consumption of neighboring triangles marked using different labels.
The first item of the formula is optimised mainly based on the probability that triangle is marked as label . We predict the shape under the three features respectively, so the same triangle will have its own prediction probability under each feature. In this paper, utilising the feature’s complementarity, we vote the labelling results to get the final prediction probability, and serve its negative logarithm similar to the paper Guo et al. (2015) as the first item of the multilabel graph cut. The second item in the formula smooths the segmentation results mainly through the calculation of the dihedral angle of the triangle and its neighbouring one. In this paper, the dihedral angle multiplied by the side length makes the second item of the formula to complete optimisation. Energy E is minimized by employing Multilabel Graph Cuts OptimisationBoykov et al. (2001); Kolmogorov and Zabih (2004); Boykov and Kolmogorov (2004), through which we can obtain more accurate shape segmentation results.
5 Experimental Results and Discussion
Data. In this paper, deep learning is used to segment the shape. Therefore to verify the effectiveness of this method, we first tested 3 existing large datasets from COSEG Wang et al. (2012), including the chair dataset of 400 shapes, the vase dataset of 300 shapes and the alien dataset of 200 shapes. The experiment of the mixed dataset was also carried out on the Princeton Segmentation Benchmark (PSB) Chen et al. (2009) dataset. In addition, to confirm the robustness of the above method, we also selected 11 categories in the PSBChen et al. (2009) dataset for testing, each of them containing 20 shapes. For COSEG datasets, we chose the groundtruth as in the paperWang et al. (2012). For PSB datasets, we chose the groundtruth as in the dissertationKalogerakis et al. (2010).
SFCN Parameters.
We trained by SGD with momentum, using momentum 0.9 and weight decay of 1e3. We chose ReLU as the activation function. Dropout used in the original classifier nets is included. The pertriangle, unnormalised softmax loss is a natural choice for segmenting shapes of any size, with which we trained our nets.
Computation Time. We used two Intel(R) Xeon(R) CPU E52620 v3 @ 2.40GHz with 12 cores and NVIDIA GeForce TITAN X GPU. In large datasets, when we used 50 shapes for training and the triangles of each shape range from 1000 to 5000, the training would take approximately 120 minutes (including the time to compute the input descriptors), and the test and optimisation of a shape took about 30 seconds.
Results. In the experiment, we used the classification accuracy proposed by Kalogerakis Kalogerakis et al. (2010) and Sidi Sidi et al. (2011) for the quantitative evaluation of our method.
Firstly, to compare with the methods of Guo et al. (2015) in the COSEG’s large dataset, we randomly selected 50 shapes for training from the chair dataset and 20 shapes for training from the vase dateset. We compared our method with three shape segmentation methods Sidi et al. (2011); Kim et al. (2013); Guo et al. (2015). The results are presented in Table 1. It should be noted that the results and data obtained by other methods come from Guo et al. (2015). The table shows that the results obtained by our method outperform the existing methods, thus proving that our method is effective.
TOG [2011]  TOG [2013]  TOG [2015]  Ours  
Chair  80.20%  91.20%  92.52%  93.11% 
Vase  69.90%  85.60%  88.54%  88.91% 
ToG[2011] is short for Sidi et al. [TOG 2011], ToG[2013] is short for Kim
et al.[TOG 2013],ToG[2015] is short for Guo et al.[2015].
Secondly, to verify the effectiveness of our method for large datasets, we randomly selected 25%, 50% and 75% of the shapes for training from each of the three large datasets from COSEG. To verify whether the SFCN prediction accuracy becomes higher as the number of training sets increases, we used the same 25% shapes for testing in each experiment for each large dateset. In other words, the training sets gradually increased, while the test set remained the same. The results of this can be seen in Table 2. Because the work of Xie et al. (2014) carried out a similar experiment under a 75% training condition with their method, we also make a comparison with their results in Table 2. It can be seen from the table that the results obtained by our method perform better than theirs. Furthermore, with the increase of the training set, the classification accuracy of our method grows steadily in the same test set. This shows that with the increase of the training set, both the learning ability and the generalisation ability of the network architecture become stronger, which also proves the effectiveness of the designed network in this paper.
Ours 25%  Ours 50%  Ours 75%  CGF[2014] 75%  
Chair  93.43%  94.38%  95.91%  87.09% 
Vase  88.04%  90.95%  91.17%  85.86% 
Telealien  91.02%  92.76%  93.00%  83.23% 
CGF[2014] is short for Xie et al.[CGF 2014]
Thirdly, we visualize the segmentation results obtained by using our method in the three large datasets of COSEG, as shown in Figure 9. All the results shown are optimised and obtained using the 75% training set. The segmentation results appear visually accurate, which proves the effectiveness of this method.
Mixed dataset training and testing were performed as well. We respectively mixed airplane and bird, human and teddy, which are of similar classes. It must be noted that, unlike the method of Guo et al. (2015) which merges similar class labels, we retained these different labels. So the body of the plane and bird have different labels; their wings as well. 12 shapes of each dataset were selected as the training set, and the remaining as the test set. Our approach was repeated three times to compute the average performance, the segmentation results of which are shown in Figure 10. Figure 10(a) is part of the result of the training set while Figure 10(b) is part of the testing set. The classification accuracy of the two datasets is shown in Table 3, which suggests that we can obtain good segmentation results when mixing similar data together. Although the segmentation accuracy is not as high as training, the average is above 90%. Additionally, the basic segmentation is correct according to the visualisation of the results, proving that SPFCN network architecture is powerful in distinguishing features and learning.
Airplane & Bird  Human & Teddy  
Accuracy  90.04%  92.28% 
In addition, we mixed glasses and pliers, glasses, pliers, fish and tables of different categories. From the mixed datasets, we selected some data for training, with the remaining used for testing. Here 12 shapes of each dataset were selected as the training set, and the remaining as the test set. We also mixed two large datasets with 400 chairs and 300 vases. We selected 50% of the shapes for training from each dataset and the remaining 50% as the test set. Our approach was repeated three times to compute the average performance, the segmentation results of which are shown in Figure 11 and Figure 1. Figure 11(a) is part of the result of the training set while Figure 11(b) is part of the testing set. The classification accuracy of the three datasets is shown in Table 4, which also indicates that both the segmentation results and classification accuracy achieve impressive performance. In other words, this method can be used to train and predict any shape set, and proves once again that our SFCN architecture has good feature distinguishing ability and learning ability.
Glasses & Plier  Chair(400) & Vase(300)  Glasses & Plier & Fish & Table  
Accuracy  96.53%  87.71%  91.82% 
Lastly, as in the papers of Guo et al. (2015), we separately trained several small datasets of PSB when and (i.e., SB6, SB12), in which is the number of randomly selected shapes in each training process. For each category of dataset, as in the papers of Guo et al. (2015), we repeated our approach five times to compute the average performance. The comparison results with the related methods are presented in Table 5. It should be noted that the results and data obtained by other methods come from Guo et al. (2015). As shown in Table 5, the results of several datasets of PSB obtained by our method perform much better in most categories than the existing methods Kalogerakis et al. (2010); Guo et al. (2015); Xie et al. (2014); Wang et al. (2013); Chang and Lin (2007); Torralba et al. (2007), which proves the effectiveness of the method. On a few individual datasets, such as airplane, chair and table etc., our results do not go beyond those of other methods, yet are very close to the best ones, which also serves to prove that our method is effective. Moreover, the segmentation effect gradually strengthens as the training data increases, which shows that the learning ability of the SFCN architecture is enhanced with the increase of training samples. We also visualize the segmentation results of several datasets of PSB including teddy, pliers and fish etc. on the condition of SB12, the optimised ones of which are shown in Figure 12. Just like the large dataset, the results of the small one are visually accurate, which indicates that our method is feasible.
SVM SB6  JointBoost SB6  ToG[2010]^{1} SB6  ToG[2013]^{2} SB6  ToG[2015]^{3} SB6  Ours SB6  ToG[2010]^{1} SB12  ToG[2013]^{2} SB12  ToG[2015]^{3} SB12  Ours SB12  
Cup  94.11%  93.12%  99.1%  97.5%  99.49%  99.59%  99.60%  99.60%  99.73%  99.74% 
Glasses  95.92%  93.59%  96.10%    96.78%  97.15%  97.20%    97.60%  97.79% 
Airplane  80.43%  91.16%  95.50%    95.56%  93.90%  96.10%    96.67%  95.30% 
Chair  81.38%  95.67%  97.80%  97.90%  97.93%  97.05%  98.40%  99.60%  98.67%  98.26% 
Octopus  97.04%  96.26%  98.60%    98.61%  98.67%  98.40%    98.79%  98.80% 
Table  90.16%  97.37%  99.10%  99.60%  99.11%  99.25%  99.30%  99.60%  99.55%  99.41% 
Teddy  91.01%  85.68%  93.30%    98.00%  98.04%  98.10%    98.24%  98.27% 
Plier  92.04%  81.55%  94.30%    95.01%  95.71%  96.20%    96.22%  96.55% 
Fish  87.05%  90.76%  95.60%    96.22%  95.63%  95.60%    95.64%  95.76% 
Bird  81.49%  81.80%  84.20%    87.51%  89.03%  87.90%    88.35%  89.48% 
Mech  81.87%  75.73%  88.90%  90.20%  90.56%  91.72%  90.50%  91.30%  95.60%  96.05% 
Average  88.41%  89.34%  94.77%  96.30%  95.89%  95.98%  96.12%  97.53%  96.82%  96.85% 
ToG[2010] is short for Kalogerakis et al. [2010], ToG[2013] is short for Wang et al. [2013],ToG[2015] is short for Guo et al.[2015].
Feature sensibility. In this paper, we combine three features, average geodesic distance (AGD) Hilaga et al. (2001), shape context (SC) Belongie et al. (2002) and spin image (SI) Johnson and Hebert (1999) to produce the shape segmentation. To verify the validity of this combination, we carried out a comparative experiment. The classification accuracy of each dataset of PSB under a single feature, and the one obtained with our method are compared in the condition of SB6, as shown in Figure 13
. It can be seen that the classification accuracy of individual datasets under individual features is higher, but not significantly more so than that of our method. On the contrary, most datasets perform much better under our method. This shows that the features selected in this paper are complementary and the feature combination is effective.
In the papers of Kalogerakis et al. (2010) and Guo et al. (2015), seven features are combined to perform segmentation experiments. In addition to the three features used in this paper, they also used four other features including curvature (CUR) Kavukcuoglu et al. (2009), PCA feature (PCA) Shapira et al. (2010), shape diameter function (SDF) Liu et al. (2009) and distance from medial surface (DIS) Long et al. (2015). In this paper, several datasets of PSB are randomly selected to perform experiments with the combination of seven features. Under the same experimental conditions, the comparison results with the combination of three features are presented in Figure 14. The experiment tells us that for most datasets, the combination of three features brings higher classification accuracy than seven features, and for the remaining few datasets, the classification accuracy of the two is very close to each other. This indicates that the three features used in this paper can not only better complement, but also are more suitable for the network structure. In addition, we also trained the combining three features in one SFCN network. However, it didn’t produce a higher accuracy result and nearly doubled the training time. So, as a tradeoff between performance and efficiency, we empirically trained 3 features separately to perform a segmentation task.
Skip architecture. During the SFCN training process, we utilised skip architecture of five layers and integrated the cross layer information with different steps to improve the SFCN segmentation results, and gradually improve the segmentation details. To verify the skip architecture can do mesh labelling learning and improve the segmentation results, we visualised the crosslayer prediction results with different steps. The segmentation results of several shapes crossing one to five layers are shown in Figure 15, where (f) is the groundtruth of the corresponding shape. Through tracking of the network computing process, we find that the network convergence is faster with the increase of the cross layers. In addition, it can be seen from the comparison results of cross layers and groundtruth in Figure 15, that with the increase of cross layers, the classification quality is gradually improved.
Comparison before and after optimisation. In this paper, we use the multilabel graph cut optimisation method Boykov et al. (2001); Kolmogorov and Zabih (2004); Boykov and Kolmogorov (2004) to optimise the segmentation results of testing shapes, based on the complementarity between features. The comparison results before and after optimisation of several shapes are shown in Figure 16. As shown in Figure 16(a), the results before optimisation are the final ones obtained by voting on the three different features tested by SFCN architecture. Figure 16(b) represents results after optimisation. As the results before optimisation are predicted in the triangle level, the boundary may be not smooth or there may be prediction errors of individual triangles in some areas. The above problems are well addressed after the optimisation with the multilabel graph cuts optimisation method Boykov et al. (2001); Kolmogorov and Zabih (2004); Boykov and Kolmogorov (2004), which proves that optimisation plays a significant role. In addition, the number below each figure is the classification accuracy of the corresponding shape, which shows that optimisation can improve classification accuracy.
Limitations. Although the proposed method is effective at completing the 3D shape segmentation task, it has some limitations. Firstly, the shapes involved in the calculation must be manifold meshes, for they can easily determine the connection between triangles. Secondly, the designed SFCN architecture has no feature selection function, thus we carry on the separating training for the three features. Thirdly, to obtain better parameter sharing, the SFCN structure may need all training meshes to be the same triangulation granularity. Therefore, in future work, we will seek to improve the SFCN architecture, making it possible for automatic feature selection, and build an endtoend structure.
6 Conclusions
We design a novel shape fully convolutional network architecture, which can automatically carry out trianglestotriangles learning and prediction, and complete the segmentation task with high quality. In the SFCN architecture proposed here, similar to convolution and pooling operation on images, we design a novel shape convolution and pooling operation with a 3D shape represented as a graph structure. Moreover, based on the original image segmentation FCN architecture Long et al. (2015), we first design and implement a new generating operation, which functions as a bridge to facilitate the execution of shape convolution and pooling operations directly on 3D shapes. Additionally, accurate and detailed segmentation of 3D shapes is completed through skip architecture. To produce more accurate segmentation results, we optimise the segmentation results obtained by SFCN prediction by utilising the complementarity between the three geometric features and the multilabel graph cut method Boykov et al. (2001); Kolmogorov and Zabih (2004); Boykov and Kolmogorov (2004), which can also improve the local smoothness of triangle labels. The experiments show that the proposed method can not only obtain good segmentation results both in large datasets and small ones with the use of a small number of features, but also outperform some existing stateoftheart shape segmentation methods. More importantly, our method can effectively learn and predict mixed shape datasets either of similar or of different characters, and achieve excellent segmentation results, which demonstrates that our method has strong generalisation ability.
In the future, we want to strengthen our method to overcome some limitations mentioned above and produce more results based on various challenging datasets, such as ShapeNet. Additionally, we hope that the SFCN architecture proposed can be applied to other shape fields, such as shape synthesis, line drawing extraction and so on.
Acknowledgements
We would like to thank all anonymous reviewers for their constructive comments. This research has been supported by the National Science Foundation of China (61321491, 61100110, 61272219) and the Science and Technology Program of Jiangsu Province (BY2012190, BY201307204).
References
 Yu et al. (2004) Yu, Y, Zhou, K, Xu, D, Shi, X, Bao, H, Guo, B, et al. Mesh editing with poissonbased gradient field manipulation. ACM Transactions on Graphics 2004;23(3):644–651.
 Yang et al. (2013) Yang, Y, Xu, W, Guo, X, Zhou, K, Guo, B. Boundaryaware multidomain subspace deformation. IEEE Transactions on Visualization & Computer Graphics 2013;19(19):1633–45.
 Chen et al. (2015) Chen, X, Zhou, B, Lu, F, Tan, P, Bi, L, Tan, P. Garment modeling with a depth camera. ACM Transactions on Graphics 2015;34(6):1–12.
 Krizhevsky et al. (2012) Krizhevsky, A, Sutskever, I, Hinton, GE. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 2012;25(2):2012.

Szegedy et al. (2015)
Szegedy, C, Liu, W,
Jia, Y, Sermanet, P,
Reed, S, Anguelov, D, et al.
Going deeper with convolutions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015;:1–9.
 Simonyan and Zisserman (2014) Simonyan, K, Zisserman, A. Very deep convolutional networks for largescale image recognition. Computer Science 2014;.

Ciresan et al. (2012)
Ciresan, D, Giusti, A,
Gambardella, LM, Schmidhuber,
J.
Deep neural networks segment neuronal membranes in electron microscopy images.
In: Advances in neural information processing systems. 2012, p. 2843–2851.  Farabet et al. (2013) Farabet, C, Couprie, C, Najman, L, Lecun, Y. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis & Machine Intelligence 2013;35(8):1915–29.
 Pinheiro and Collobert (2014) Pinheiro, P, Collobert, R. Recurrent convolutional neural networks for scene parsing. ICML 2014;.
 Long et al. (2015) Long, J, Shelhamer, E, Darrell, T. Fully convolutional networks for semantic segmentation. CVPR 2015;.
 Edwards and Xie (2016) Edwards, M, Xie, X. Graph based convolutional neural network. BMVC 2016;.
 Niepert et al. (2016) Niepert, M, Ahmed, M, Kutzkov, K. Learning convolutional neural networks for graphs. ICML 2016;.
 Defferrard et al. (2016) Defferrard, M, Bresson, X, Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. NIPS 2016;.
 Boykov et al. (2001) Boykov, Y, Veksler, O, Zabih, R. Efficient approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis & Machine Intelligence 2001;20(12):1222–1239.
 Kolmogorov and Zabih (2004) Kolmogorov, V, Zabih, R. What energy functions can be minimizedvia graph cuts? Pattern Analysis & Machine Intelligence IEEE Transactions on 2004;26(2):147–59.
 Boykov and Kolmogorov (2004) Boykov, Y, Kolmogorov, V. An experimental comparison of mincut/maxflow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis & Machine Intelligence 2004;26(9):1124–37.
 Kalogerakis et al. (2010) Kalogerakis, E, Hertzmann, A, Singh, K. Learning 3d mesh segmentation and labeling. Acm Transactions on Graphics 2010;29(4):157–166.
 Guo et al. (2015) Guo, K, Zou, D, Chen, X. 3d mesh labeling via deep convolutional neural networks. ACM Transactions on Graphics 2015;35(1):3.
 Xie et al. (2014) Xie, Z, Xu, K, Liu, L, Xiong, Y. 3d shape segmentation and labeling via extreme learning machine. Computer Graphics Forum 2014;33(5):85–95.
 Wang et al. (2012) Wang, Y, Asafiy, S, van Kaickz, O, Zhang, H, CohenOr, D, Chen, B. Active coanalysis of a set of shapes. Acm Transactions on Graphics 2012;31(6):157:1–157:10.
 Xie and Tu (2015) Xie, S, Tu, Z. Holisticallynested edge detection. IEEE International Conference on Computer Vision 2015;:1395–1403.
 Liu et al. (2015) Liu, F, Shen, C, Lin, G, Reid, I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis & Machine Intelligence 2015;38(10):1–1.
 Dosovitskiy et al. (2015) Dosovitskiy, A, Fischer, P, Ilg, E, H’́ausser, P, Hazırbaş, C, Golkov, V, et al. Flownet: Learning optical flow with convolutional networks. IEEE International Conference on Computer Vision & (ICCV) 2015;:2758–2766.
 SimoSerra et al. (2016) SimoSerra, E, Iizuka, S, Sasaki, K, Ishikawa, H. Learning to simplify: Fully convolutional networks for rough sketch cleanup. Acm Transactions on Graphics 2016;35(4):1–11.
 Bruna et al. (2014) Bruna, J, Zaremba, W, Szlam, A, LeCun, Y. Spectral networks and locally connected networks on graphs. ICLA 2014;.
 Duvenaudy et al. (2015) Duvenaudy, D, Maclauriny, D, AguileraIparraguirre, J, GomezBombarelli, R, Hirzel, T, AspuruGuzik, A, et al. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems 2015;:2224–2232.
 Wu et al. (2015) Wu, Z, Song, S, Khosla, A, Yu, F, Zhang, L, Tang, X, et al. A deep representation for volumetric shapes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015;.
 Qi et al. (2016) Qi, CR, Su, H, Nießner, M, Dai, A, Yan, M, Guibas, L. Volumetric and multiview cnns for object classification on 3d data. Computer Vision and Pattern Recognition& (CVPR) 2016;.
 Kalogerakis et al. (2017) Kalogerakis, E, Averkiou, M, Maji, S, Chaudhuri, S. 3d shape segmentation with projective convolutional networks. Proceedings of the IEEE Computer Vision and Pattern Recognition 2017;.
 Yi et al. (2017) Yi, L, Su, H, Guo, X, Guibas, L. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. Proceedings of the IEEE Computer Vision and Pattern Recognition 2017;.
 Su et al. (2017) Su, H, Qi, C, Mo, K, Guibas, L. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proceedings of the IEEE Computer Vision and Pattern Recognition 2017;.
 Masci et al. (2015) Masci, J, Boscaini, D, Bronstein, MM, Vandergheynst, P. Geodesic convolutional neural networks on riemannian manifolds. 3DRR 2015;.
 Boscaini et al. (2016) Boscaini, D, Masci, J, Rodolà, E, Bronstein, MM. Learning shape correspondence with anisotropic convolutional neural networks. NIPS 2016;.
 Monti et al. (2017) Monti, F, Boscaini, D, Masci, J, Rodolà, E, Svoboda, J, Bronstein, MM. Geometric deep learning on graphs and manifolds using mixture model cnns. CVPR 2017;.
 Maron et al. (2017) Maron, H, Galun, M, Aigerman, N, Trope, M, Dym, N, Yumer, E, et al. Convolutional neural networks on surfaces via seamless toric covers. SIGGRAPH 2017;.
 Boscaini et al. (2015) Boscaini, D, Masci, J, Bronstein, MM, Castellani, U. Learning classspecific descriptors for deformable shapes using localized spectral convolutional networks. Eurographics Symposium on Geometry Processing 2015;34(5):13–23.
 Litany et al. (2017) Litany, O, Remez, T, Rodolà, E, Bronstein, AM, Bronstein, MM. Deep functional maps: Structured prediction for dense shape correspondence. ICCV 2017;.
 Bronstein et al. (2017) Bronstein, MM, Bruna, J, LeCun, Y, Szlam, A, Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Sig Proc Magazine 2017;.
 Wang et al. (2013) Wang, Y, Gongy, M, Wang, T, CohenOr, D, Zhang, H, Chen, B. Projective analysis for 3d shape segmentation. Acm Transactions on Graphics 2013;32(6):1–12.

Huang et al. (2011)
Huang, Q, Koltun, V,
Guibas, L.
Joint shape segmentation with linear programming.
ACM Transactions on Graphics(TOG) 2011;30(6):125:1–125:12. 
Sidi et al. (2011)
Sidi, O, van Kaick, O,
Kleiman, Y, Zhang, H,
CohenOr, D.
Unsupervised cosegmentation of a set of shapes via descriptorspace spectral clustering.
In: SIGGRAPH Asia Conference. 2011, p. 1.  Hu et al. (2012) Hu, R, Fan, L, Liu, L. Cosegmentation of 3d shapes via subspace clustering. Computer Graphics Forum 2012;31(5):1703–1713.
 Meng et al. (2013) Meng, M, Xia, J, Luo, J, He, Y. Unsupervised cosegmentation for 3d shapes using iterative multilabel optimization. ComputerAided Design 2013;45(2):312–320.
 Xu et al. (2010) Xu, K, Li, H, Zhang, H, CohenOr, D, Xiong, Y, Cheng, ZQ. Stylecontent separation by anisotropic part scales. Acm Transactions on Graphics 2010;29(6):184.
 Kreavoy et al. (2007) Kreavoy, V, Julius, D, Sheffer, A. Model composition from interchangeable components. Conference on Computer Graphics & Applications 2007;:129–138.

Dhillon et al. (2007)
Dhillon, IS, Guan, Y,
Kulis, B.
Weighted graph cuts without eigenvectors: A multilevel approach.
IEEE Trans Pattern Anal Mach Intell 2007;29(11):1944–1957. 
Hilaga et al. (2001)
Hilaga, M, Shinagawa, Y,
Kohmura, T, Kunii, TL.
Topology matching for fully automatic similarity estimation of 3d shapes.
In: Conference on Computer Graphics and Interactive Techniques. 2001, p. 203–212.  Belongie et al. (2002) Belongie, S, Malik, J, Puzicha, J. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis & Machine Intelligence 2002;24(4):509–522.
 Johnson and Hebert (1999) Johnson, AE, Hebert, M. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on Pattern Analysis & Machine Intelligence 1999;21(5):433–449.
 Jia et al. (2014) Jia, Y, Shelhamer, E, Donahue, J, Karayev, S, Long, J, Girshick, R, et al. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint 2014;arXiv:1408.5093.
 Chen et al. (2009) Chen, X, leksey Golovinskiy, , Funkhouser, T. A benchmark for 3d mesh segmentation. Acm Transactions on Graphics 2009;28(3):341–352.
 Kim et al. (2013) Kim, VG, Li, W, Mitra, NJ, Chaudhuri, S, DiVerdi, S, Funkhouser, T. Learning partbased templates from large collections of 3d shapes. Acm Transactions on Graphics 2013;32(4):1.

Chang and Lin (2007)
Chang, CC, Lin, CJ.
Libsvm: A library for support vector machines.
Acm Transactions on Intelligent Systems & Technology 2007;2(3, article 27):389–396.  Torralba et al. (2007) Torralba, A, Murphy, KP, Freeman, WT. Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis & Machine Intelligence 2007;29(5):854–869.
 Kavukcuoglu et al. (2009) Kavukcuoglu, K, Ranzato, M, Fergus, R, LeCun, Y. Learning invariant features through topographic filter maps. In: IEEE Conference on Computer Vision & Pattern Recognition. 2009, p. 1605–1612.
 Shapira et al. (2010) Shapira, L, Shalom, S, Shamir, A, CohenOr, D, Zhang, H. Contextual part analogies in 3d objects. International Journal of Computer Vision 2010;89(2):309–326.
 Liu et al. (2009) Liu, R, Zhang, H, Shamir, A, CohenOr, D. A partaware surface metric for shape analysis. Computer Graphics Forum 2009;28(2):397–406.
Comments
There are no comments yet.