[1]
|
LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998) Gradient-Based Learning Applied to Document Recognition. Pro-ceedings of the IEEE, 86, 2278-2324. https://doi.org/10.1109/5.726791
|
[2]
|
Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2017) Imagenet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60, 84-90. https://doi.org/10.1145/3065386
|
[3]
|
He, K.M., Zhang, X.Y., Ren, S.Q. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778.
https://doi.org/10.1109/CVPR.2016.90
|
[4]
|
Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv: 1409.1556.
|
[5]
|
Szegedy, C., Ioffe, S., Vanhoucke, V. and Alemi, A. (2017) Incep-tion-v4, Inception-ResNet and the Impact of Residual Connections on Learning. Proceedings of the AAAI Conference on Artifi-cial Intelligence, 31, 4278-4284.
https://doi.org/10.1609/aaai.v31i1.11231
|
[6]
|
Xie, S., Girshick, R., Dollár, P., Tu, Z.W. and He, K.M. (2017) Aggregated Residual Transformations for Deep Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 5987-5995. https://doi.org/10.1109/CVPR.2017.634
|
[7]
|
Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K.Q. (2017) Densely Connected Convolutional Networks. Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 2261-2269. https://doi.org/10.1109/CVPR.2017.243
|
[8]
|
Hu, J., Shen, L. and Sun, G. (2018) Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 7132-7141.
https://doi.org/10.1109/CVPR.2018.00745
|
[9]
|
Tan, M. and Le, Q. (2019) Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks.
http://proceedings.mlr.press/v97/tan19a.html?ref=jina-ai-gmbh.ghost.io
|
[10]
|
Liu, Z., Mao, H., Wu, C.Y., et al. (2022) A ConvNet for the 2020s. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 11966-11976.
https://doi.org/10.1109/CVPR52688.2022.01167
|
[11]
|
Iandola, F.N., Han, S., Moskewicz, M.W., et al. (2016) SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and < 0.5 MB Model Size. arXiv: 1602.07360.
|
[12]
|
Howard, A.G., Zhu, M., Chen, B., et al. (2017) Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv: 1704.04861.
|
[13]
|
Sandler, M., Howard, A., Zhu, M.L., Zhmoginov, A. and Chen, L.C. (2018) Mobilenetv2: Inverted Residuals and Linear Bottlenecks. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4510-4520. https://doi.org/10.1109/CVPR.2018.00474
|
[14]
|
Howard, A., Sandler, M., Chu, G., et al. (2019) Searching for MobileNetV3. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 27 October-2 November 2019, 1314-1324.
https://doi.org/10.1109/ICCV.2019.00140
|
[15]
|
Han, K., Wang, Y., Tian, Q., et al. (2020) Ghostnet: More Features from Cheap Operations. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 1577-1586.
https://doi.org/10.1109/CVPR42600.2020.00165
|
[16]
|
Tang, Y., Han, K., Guo, J., et al. (2022) GhostNetV2: Enhance Cheap Operation with Long-Range Attention. arXiv: 2211.12905.
|
[17]
|
Zhang, X.Y., Zhou, X.Y., Lin, M.X. and Sun, J. (2018) ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 6848-6856. https://doi.org/10.1109/CVPR.2018.00716
|
[18]
|
Ma, N., Zhang, X., Zheng, H.T. and Sun, J. (2018) ShuffleNet v2: Prac-tical Guidelines for Efficient CNN Architecture Design. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018, European Conference on Computer Vision, Springer, Cham, 122-138. https://doi.org/10.1007/978-3-030-01264-9_8
|
[19]
|
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017.
|
[20]
|
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Transformers for Im-age Recognition at Scale. arXiv: 2010.11929.
|
[21]
|
Liu, Z., Lin, Y., Cao, Y., et al. (2021) Swin Transformer: Hierarchical Vi-sion Transformer Using Shifted Windows. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 9992-10002. https://doi.org/10.1109/ICCV48922.2021.00986
|
[22]
|
Liu, Z., Hu, H., Lin, Y., et al. (2022) Swin Transformer v2: Scaling up Capacity and Resolution. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 11999-12009. https://doi.org/10.1109/CVPR52688.2022.01170
|
[23]
|
Han, K., Wang, Y., Chen, H., et al. (2022) A Survey on Vision Transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 87-110. https://doi.org/10.1109/TPAMI.2022.3152247
|
[24]
|
Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2018) Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv: 1810.04805.
|
[25]
|
Wang, W., Xie, E., Li, X., et al. (2021) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 548-558. https://doi.org/10.1109/ICCV48922.2021.00061
|
[26]
|
Valanarasu, J.M.J. and Patel, V.M. (2022) UNeXt: Mlp-Based Rapid Medical Image Segmentation Network. 25th International Conference of Medical Image Computing and Computer As-sisted Intervention—MICCAI 2022, Singapore, 18-22 September2022, 23-33. https://doi.org/10.1007/978-3-031-16443-9_3
|
[27]
|
Wang, Z., Cun, X., Bao, J., et al. (2022) Uformer: A General U-Shaped Transformer for Image Restoration. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 17662-17672. https://doi.org/10.1109/CVPR52688.2022.01716
|
[28]
|
Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolu-tional Networks for Biomedical Image Segmentation. 18th International Conference of Medical Image Computing and Com-puter-Assisted Intervention—MICCAI 2015, Munich, 5-9 October 2015, 234-241. https://doi.org/10.1007/978-3-319-24574-4_28
|
[29]
|
Dong, X., Bao, J., Chen, D., et al. (2022) Cswin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. Proceedings of the 2022 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 12114-12124. https://doi.org/10.1109/CVPR52688.2022.01181
|
[30]
|
Yuan, Y., Fu, R., Huang, L., et al. (2021) HRFormer: High-Resolution Transformer for Dense Prediction. arXiv: 2110.09408.
|
[31]
|
Sun, K., Xiao, B., Liu, D. and Wang, J.D. (2019) Deep High-Resolution Representation Learning for Human Pose Estimation. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 5686-5696. https://doi.org/10.1109/CVPR.2019.00584
|
[32]
|
Hinton, G., Vinyals, O. and Dean, J. (2015) Distilling the Knowledge in a Neural Network. arXiv: 1503.02531.
|
[33]
|
邵仁荣, 刘宇昂, 张伟, 等. 深度学习中知识蒸馏研究综述[J]. 计算机学报, 2022, 45(8): 1638-1673.
|
[34]
|
黄震华, 杨顺志, 林威, 等. 知识蒸馏研究综述[J]. 计算机学报, 2022, 45(3): 624-653.
|
[35]
|
Touvron, H., Cord, M., Douze, M., et al. (2021) Training Data-Efficient Image Transformers & Distillation through Attention. Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18-24 July 2021, 10347-10357.
|
[36]
|
Carion, N., Massa, F., Synnaeve, G., et al. (2020) End-to-End Object Detection with Transformers. 16th European Conference of Computer Vision—ECCV 2020, Glasgow, 23-28 August 2020, 213-229.
https://doi.org/10.1007/978-3-030-58452-8_13
|
[37]
|
Dai, Z., Liu, H., Le, Q.V. and Tan, M.X. (2021) CoAtNet: Marrying Convolution and Attention for All Data Sizes. 35th Conference on Neural Information Processing Systems, Virtual, 6-14 De-cember 2021, 3965-3977.
|
[38]
|
Beal, J., Kim, E., Tzen, E., et al. (2020) Toward Transformer-Based Object Detection. arXiv: 2012.09958.
|
[39]
|
Ren, S., He, K., Girshick, R. and Sun, J. (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv: 1506.01497.
|
[40]
|
Yan, H., Li, Z., Li, W., et al. (2021) Contnet: Why Not Use Convolution and Transformer at the Same Time? arXiv: 2104.13497.
|
[41]
|
Mehta, S. and Rastegari, M. (2021) MobileVit: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv: 2110.02178.
|
[42]
|
Mehta, S. and Rastegari, M. (2022) Sep-arable Self-Attention for Mobile Vision Transformers. arXiv: 2206.02680.
|
[43]
|
Peng, Z., Huang, W., Gu, S., et al. (2021) Conformer: Local Features Coupling Global Representations for Visual Recognition. Proceedings of the 2021 IEEE/CVF Inter-national Conference on Computer Vision, Montreal, 10-17 October 2021, 357-366. https://doi.org/10.1109/ICCV48922.2021.00042
|
[44]
|
Chen, Y., Dai, X., Chen, D., et al. (2022) Mobile-Former: Bridging Mobilenet and Transformer. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 5260-5269.
https://doi.org/10.1109/CVPR52688.2022.00520
|
[45]
|
Yoo, J., Kim, T., Lee, S., et al. (2023) Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2-7 January 2023, 4945-4954. https://doi.org/10.1109/WACV56688.2023.00493
|
[46]
|
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B. and Fu, Y. (2018) Image Super-Resolution Using Very Deep Residual Channel Attention Net-works. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision— ECCV 2018, Springer, Cham, 294-310. https://doi.org/10.1007/978-3-030-01234-2_18
|
[47]
|
Xiao, T., Singh, M., Mintun, E., et al. (2021) Early Convo-lutions Help Transformers See Better. arXiv: 2106.14881.
|
[48]
|
Hassani, A., Walton, S., Shah, N., et al. (2021) Escaping the Big Data Paradigm with Compact Transformers. arXiv: 2104.05704.
|
[49]
|
Li, Y.W., Zhang, K., Cao, J.Z., Timofte, R. and Van Gool, L. (2021) LocalViT: Bringing Locality to Vision Transformers. arXiv: 2104.05707.
|
[50]
|
D’Ascoli, S., Touvron, H., Leavitt, M.L., et al. (2021) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. arXiv: 2103.10697.
|
[51]
|
Srinivas, A., Lin, T.Y., Parmar, N., et al. (2021) Bottleneck Transformers for Visual Recognition. Proceed-ings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 16514-16524.
https://doi.org/10.1109/CVPR46437.2021.01625
|
[52]
|
Graham, B., El-Nouby, A., Touvron, H., et al. (2021) LeViT: A Vision Transformer in Convnet’s Clothing for Faster Inference. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 12239-12249. https://doi.org/10.1109/ICCV48922.2021.01204
|
[53]
|
Wu, H., Xiao, B., Codella, N., et al. (2021) CvT: Introducing Con-volutions to Vision Transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 22-31.
https://doi.org/10.1109/ICCV48922.2021.00009
|
[54]
|
Yuan, K., Guo, S., Liu, Z., et al. (2021) Incorporating Convolution Designs into Visual Transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 559-568.
https://doi.org/10.1109/ICCV48922.2021.00062
|
[55]
|
Jeevan, P. (2022) Convolutional Xformers for Vision. arXiv: 2201.10271.
|
[56]
|
Pan, J., Bulat, A., Tan, F., et al. (2022) EdgeViTs: Competing Light-Weight CNNs on Mobile Devices with Vision Transformers. 17th European Conference of Computer Vision—ECCV 2022, Tel Aviv, 23-27 October 2022, 294-311.
https://doi.org/10.1007/978-3-031-20083-0_18
|
[57]
|
Guo, J., Han, K., Wu, H., et al. (2022) CMT: Convolutional Neural Networks Meet Vision Transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 12165- 12175. https://doi.org/10.1109/CVPR52688.2022.01186
|
[58]
|
Zhang, H., Hu, W. and Wang. X. (2022) ParC-Net: Position Aware Circular Convolution with Merits from Convnets and Transformer. 17th Euro-pean Conference of Computer Vision—ECCV 2022, Tel Aviv, 23-27 October 2022, 613-630. https://doi.org/10.1007/978-3-031-19809-0_35
|
[59]
|
Yu, W., Luo, M., Zhou, P., et al. (2022) Metaformer Is Actually What You Need for Vision. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Or-leans, 18-24 June 2022, 10809-10819.
https://doi.org/10.1109/CVPR52688.2022.01055
|
[60]
|
Li, J., Xia, X., Li, W., et al. (2022) Next-ViT: Next Generation Vi-sion Transformer for Efficient Deployment in Realistic Industrial Scenarios. ArXiv: 2207.05501.
|
[61]
|
Maaz, M., Shaker, A., Cholakkal, H., et al. (2023) Edgenext: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications. Computer Vision—ECCV 2022 Workshops, Tel Aviv, 23-27 October 2022, 3-20.
https://doi.org/10.1007/978-3-031-25082-8_1
|
[62]
|
Pan, X., Ge, C., Lu, R., et al. (2022) On the Integration of Self-Attention and Convolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 805-815.
https://doi.org/10.1109/CVPR52688.2022.00089
|
[63]
|
Lin, T.Y., Goyal, P., Girshick, R., He, K.M. and Dollár, P. (2017) Focal Loss for Dense Object Detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 2999-3007.
https://doi.org/10.1109/ICCV.2017.324
|