Search resumes and take the initiative to contact job applicants for higher recruiting efficiency. The Choice of Hundreds of Companies.
Mengshiun Yu is a Ph.D. candidate in the Computer Science Department at National Tsing Hua University, Taiwan, expected to graduate in June 2025. He is also a visiting scholar in the Machine Learning Department at Carnegie Mellon University (CMU), Pennsylvania, USA. As a member of the Programming Language Research Lab, he is advised by Prof. Jenq-kuen Lee. His research interests include software and hardware co-design as well as compiler optimization for machine learning and computer vision algorithms.
RFC: https://discuss.tvm.apache.org/t/rfc-109-add-a-new-backend-nnapi-for-byoc/17717
PRs: https://github.com/search?q=repo%3Amlc-ai%2Fmlc-llm+mengshyu&type=pullrequests
Model quantization has become an essential optimization strategy for improving performance in machine learning. It reduces data types from 32-bit and 16-bit to lower-bit formats, such as 8-bit, 6-bit, and 4-bit, in both integer and floating- point data types. However, the current RISC-V instruction set lacks support for floating-point formats below 8 bits, includ- ing FP8, 6-bit, 4-bit, and other sub-byte precision formats. This research proposes extending the RISC-V ISA to support sub-FP8 operations. We design custom instructions that enable RISC-V CPUs to execute sub-FP8 computations directly, enhancing machine learning workloads’ performance and energy efficiency. By integrating with AI compiler frameworks such as TVM and MLC LLM.
Modern smartphones integrate various specialized hardware units, such as CPUs, GPUs, DSPs, and accelerators, each tailored to specific computational tasks. Many mobile systems provide native neural network APIs to leverage these hardware units efficiently. However, combining compiler-aided optimizations with NN API backends can further enhance performance. This paper introduces a generic approach for hardware-aware graph partitioning to accelerate mobile inference. We profile the operators used in the model on supported hardware and then use the profiling data to create a cost model. Based on this cost model, we apply a graph partitioning strategy to maximize computation graph performance on mobile devices.
Large Language Models (LLMs) have been successfully applied in various fields, such as natural language processing, programming language development, and unstructured data organization and analysis. However, their application in the field of compiler optimization is still limited. This research aims to leverage LLMs to enhance and automate compiler optimization for embedded devices, focusing on key performance metrics such as code size, execution time, hardware utilization, and memory usage. Our goal is to use LLMs to identify the optimal combinations of LLVM compiler optimizations for embedded devices. LLVM optimizations can be categorized into target-independent IR optimizations and target-dependent instruction-level optimizations. LLVM opt provides a series of optimization functions that can be executed repeatedly or combined with other optimizations. This research will address the challenge of finding the best optimization strategies for different performance metrics among the numerous possible combinations. By leveraging open LLM models, we aim to identify the most effective optimization strategies tailored to different hardware and optimization combinations, ultimately enhancing the performance of embedded devices.
Mengshiun Yu is a Ph.D. candidate in the Computer Science Department at National Tsing Hua University, Taiwan, expected to graduate in June 2025. He is also a visiting scholar in the Machine Learning Department at Carnegie Mellon University (CMU), Pennsylvania, USA. As a member of the Programming Language Research Lab, he is advised by Prof. Jenq-kuen Lee. His research interests include software and hardware co-design as well as compiler optimization for machine learning and computer vision algorithms.
RFC: https://discuss.tvm.apache.org/t/rfc-109-add-a-new-backend-nnapi-for-byoc/17717
PRs: https://github.com/search?q=repo%3Amlc-ai%2Fmlc-llm+mengshyu&type=pullrequests
Model quantization has become an essential optimization strategy for improving performance in machine learning. It reduces data types from 32-bit and 16-bit to lower-bit formats, such as 8-bit, 6-bit, and 4-bit, in both integer and floating- point data types. However, the current RISC-V instruction set lacks support for floating-point formats below 8 bits, includ- ing FP8, 6-bit, 4-bit, and other sub-byte precision formats. This research proposes extending the RISC-V ISA to support sub-FP8 operations. We design custom instructions that enable RISC-V CPUs to execute sub-FP8 computations directly, enhancing machine learning workloads’ performance and energy efficiency. By integrating with AI compiler frameworks such as TVM and MLC LLM.
Modern smartphones integrate various specialized hardware units, such as CPUs, GPUs, DSPs, and accelerators, each tailored to specific computational tasks. Many mobile systems provide native neural network APIs to leverage these hardware units efficiently. However, combining compiler-aided optimizations with NN API backends can further enhance performance. This paper introduces a generic approach for hardware-aware graph partitioning to accelerate mobile inference. We profile the operators used in the model on supported hardware and then use the profiling data to create a cost model. Based on this cost model, we apply a graph partitioning strategy to maximize computation graph performance on mobile devices.
Large Language Models (LLMs) have been successfully applied in various fields, such as natural language processing, programming language development, and unstructured data organization and analysis. However, their application in the field of compiler optimization is still limited. This research aims to leverage LLMs to enhance and automate compiler optimization for embedded devices, focusing on key performance metrics such as code size, execution time, hardware utilization, and memory usage. Our goal is to use LLMs to identify the optimal combinations of LLVM compiler optimizations for embedded devices. LLVM optimizations can be categorized into target-independent IR optimizations and target-dependent instruction-level optimizations. LLVM opt provides a series of optimization functions that can be executed repeatedly or combined with other optimizations. This research will address the challenge of finding the best optimization strategies for different performance metrics among the numerous possible combinations. By leveraging open LLM models, we aim to identify the most effective optimization strategies tailored to different hardware and optimization combinations, ultimately enhancing the performance of embedded devices.