[NeurIPS-2021] Slow Learning and Fast Inference: Efficient Graph Similarity Computation via Knowledge Distillation
Hi, I am a little confused about Eq.(4). Could you explain why the attention layer is designed as Eq.(4), and the motivation of skip connection, i.e., '+h_{ij}', in Eq.(4)? An additional question is in the code implementation, such as [line 137](https://github.com/canqin001/Efficient_Graph_Similarity_Computation/blob/2f7bf969bd0d1aeb9864d8f426b3b40cbb0598c0/EGSC-T/src/layers.py#L137), the \varphi is set to ```tanh```, but in the paper, it is set to a sigmoid gating. Besides, in the paper, the graph-level embedding ```h``` is computed based on the original node features X, but in the code implementation, it is computed based on the attention-transformed features as [line 151](https://github.com/canqin001/Efficient_Graph_Similarity_Computation/blob/2f7bf969bd0d1aeb9864d8f426b3b40cbb0598c0/EGSC-T/src/layers.py#L151).
This issue appears to be discussing a feature request or bug report related to the repository. Based on the content, it seems to be resolved. The issue was opened by JhuoW and has received 2 comments.