Current Search: Lin, Mingjie (x)
View All Items
Pages
- Title
- Exploring FPGA Implementation for Binarized Neural Network Inference.
- Creator
-
Yang, Li, Fan, Deliang, Zhang, Wei, Lin, Mingjie, University of Central Florida
- Abstract / Description
-
Deep convolutional neural network has taken an important role in machine learning algorithm. It is widely used in different areas such as computer vision, robotics, and biology. However, the models of deep neural networks become larger and more computation complexity which is a big obstacle for such huge model to implement on embedded systems. Recent works have shown the binarized neural networks (BNN), utilizing binarized (i.e. +1 and -1) convolution kernel and binarized activation function,...
Show moreDeep convolutional neural network has taken an important role in machine learning algorithm. It is widely used in different areas such as computer vision, robotics, and biology. However, the models of deep neural networks become larger and more computation complexity which is a big obstacle for such huge model to implement on embedded systems. Recent works have shown the binarized neural networks (BNN), utilizing binarized (i.e. +1 and -1) convolution kernel and binarized activation function, can significantly reduce the parameter size and computation cost, which makes it hardware-friendly for Field-Programmable Gate Arrays (FPGAs) implementation with efficient energy cost. This thesis proposes to implement a new parallel convolutional binarized neural network (i.e. PC-BNN) on FPGA with accurate inference. The embedded PC-BNN is designed for image classification on CIFAR-10 dataset and explores the hardware architecture and optimization of customized CNN topology.The parallel-convolution binarized neural network has two parallel binarized convolution layers which replaces the original single binarized convolution layer. It achieves around 86% on CIFAR-10 dataset and owns 2.3Mb parameter size. We implement our PC-BNN inference into the Xilinx PYNQ Z1 FPGA board which only has 4.9Mb on-chip Block RAM. Since the ultra-small network parameter, the whole model parameters can be stored on on-chip memory which can greatly reduce energy consumption and computation latency. Meanwhile, we design a new pipeline streaming architecture for PC-BNN hardware inference which can further increase the performance. The experiment results show that our PC-BNN inference on FPGA achieves 930 frames per second and 387.5 FPS/Watt, which are among the best throughput and energy efficiency compared to most recent works.
Show less - Date Issued
- 2018
- Identifier
- CFE0007384, ucf:52067
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007384
- Title
- Reducing the Overhead of Memory Space, Network Communication and Disk I/O for Analytic Frameworks in Big Data Ecosystem.
- Creator
-
Zhang, Xuhong, Wang, Jun, Fan, Deliang, Lin, Mingjie, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
To facilitate big data processing, many distributed analytic frameworks and storage systems such as Apache Hadoop, Apache Hama, Apache Spark and Hadoop Distributed File System (HDFS) have been developed. Currently, many researchers are conducting research to either make them more scalable or enabling them to support more analysis applications. In my PhD study, I conducted three main works in this topic, which are minimizing the communication delay in Apache Hama, minimizing the memory space...
Show moreTo facilitate big data processing, many distributed analytic frameworks and storage systems such as Apache Hadoop, Apache Hama, Apache Spark and Hadoop Distributed File System (HDFS) have been developed. Currently, many researchers are conducting research to either make them more scalable or enabling them to support more analysis applications. In my PhD study, I conducted three main works in this topic, which are minimizing the communication delay in Apache Hama, minimizing the memory space and computational overhead in HDFS and minimizing the disk I/O overhead for approximation applications in Hadoop ecosystem. Specifically, In Apache Hama, communication delay makes up a large percentage of the overall graph processing time. While most recent research has focused on reducing the number of network messages, we add a runtime communication and computation scheduler to overlap them as much as possible. As a result, communication delay can be mitigated. In HDFS, the block location table and its corresponding maintenance could occupy more than half of the memory space and 30% of processing capacity in master node, which severely limit the scalability and performance of master node. We propose Deister that uses deterministic mathematical calculations to eliminate the huge table for storing the block locations and its corresponding maintenance. My third work proposes to enable both efficient and accurate approximations on arbitrary sub-datasets of a large dataset. Existing offline sampling based approximation systems are not adaptive to dynamic query workloads and online sampling based approximation systems suffer from low I/O efficiency and poor estimation accuracy. Therefore, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system at a very small cost, and make good use of such information to facilitate online sampling.
Show less - Date Issued
- 2017
- Identifier
- CFE0007299, ucf:52149
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007299
- Title
- Simulation Study of a GPRAM System: Error Control Coding and Connectionism.
- Creator
-
Schultz, Steven, Wei, Lei, Lin, Mingjie, Yuan, Jiann-Shiun, University of Central Florida
- Abstract / Description
-
A new computing platform, the General Purpose Reprsentation and Association Machine is studied and simulated. GPRAM machines use vague measurements to do a quick and rough assessment on a task; then use approximated message-passing algorithms to improve assessment; and finally selects ways closer to a solution, eventually solving it. We illustrate concepts and structures using simple examples.
- Date Issued
- 2012
- Identifier
- CFE0004437, ucf:49361
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004437
- Title
- Performance Evaluation of TCP Multihoming for IPV6 Anycast Networks and Proxy Placement.
- Creator
-
Alsharfa, Raya, Bassiouni, Mostafa, Guha, Ratan, Lin, Mingjie, University of Central Florida
- Abstract / Description
-
In this thesis, the impact of multihomed clients and multihomed proxy servers on the performance of modern networks is investigated. The network model used in our investigation integrates three main components: the new one-to-any Anycast communication paradigm that facilitates server replication, the next generation Internet Protocol Version 6 (IPv6) that offers larger address space for packet switched networks, and the emerging multihoming trend of connecting devices and smart phones to more...
Show moreIn this thesis, the impact of multihomed clients and multihomed proxy servers on the performance of modern networks is investigated. The network model used in our investigation integrates three main components: the new one-to-any Anycast communication paradigm that facilitates server replication, the next generation Internet Protocol Version 6 (IPv6) that offers larger address space for packet switched networks, and the emerging multihoming trend of connecting devices and smart phones to more than one Internet service provider thereby acquiring more than one IP address. The design of a previously proposed Proxy IP Anycast service is modified to integrate user device multihoming and Ipv6 routing. The impact of user device multihoming (single-homed, dual-homed, and triple-homed) on network performance is extensively analyzed using realistic network topologies and different traffic scenarios of client-server TCP flows. Network throughput, packet latency delay and packet loss rate are the three performance metrics used in our analysis. Performance comparisons between the Anycast Proxy service and the native IP Anycast protocol are presented. The number of Anycast proxy servers and their placement are studied. Five placement methods have been implemented and evaluated including random placement, highest traffic placement, highest number of active interface placements, K-DS placement and a new hybrid placement method. The work presented in this thesis provides new insight into the performance of some new emerging communication paradigms and how to improve their design. Although the work has been limited to investigating Anycast proxy servers, the results can be beneficial and applicable to other types of overlay proxy services such as multicast proxies.
Show less - Date Issued
- 2015
- Identifier
- CFE0005919, ucf:50825
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005919
- Title
- High-Performance Composable Transactional Data Structures.
- Creator
-
Zhang, Deli, Dechev, Damian, Leavens, Gary, Zou, Changchun, Lin, Mingjie, University of Central Florida
- Abstract / Description
-
Exploiting the parallelism in multiprocessor systems is a major challenge in the post ``power wall'' era. Programming for multicore demands a change in the way we design and use fundamental data structures. Concurrent data structures allow scalable and thread-safe accesses to shared data. They provide operations that appear to take effect atomically when invoked individually.A main obstacle to the practical use of concurrent data structures is their inability to support composable operations,...
Show moreExploiting the parallelism in multiprocessor systems is a major challenge in the post ``power wall'' era. Programming for multicore demands a change in the way we design and use fundamental data structures. Concurrent data structures allow scalable and thread-safe accesses to shared data. They provide operations that appear to take effect atomically when invoked individually.A main obstacle to the practical use of concurrent data structures is their inability to support composable operations, i.e., to execute multiple operations atomically in a transactional manner. The problem stems from the inability of concurrent data structure to ensure atomicity of transactions composed from operations on a single or multiple data structures instances. This greatly hinders software reuse because users can only invoke data structure operations in a limited number of ways.Existing solutions, such as software transactional memory (STM) and transactional boosting, manage transaction synchronization in an external layer separated from the data structure's own thread-level concurrency control. Although this reduces programming effort, it leads to significant overhead associated with additional synchronization and the need to rollback aborted transactions. In this dissertation, I address the practicality and efficiency concerns by designing, implementing, and evaluating high-performance transactional data structures that facilitate the development of future highly concurrent software systems.Firstly, I present two methodologies for implementing high-performance transactional data structures based on existing concurrent data structures using either lock-based or lock-free synchronizations. For lock-based data structures, the idea is to treat data accessed by multiple operations as resources. The challenge is for each thread to acquire exclusive access to desired resources while preventing deadlock or starvation. Existing locking strategies, like two-phase locking and resource hierarchy, suffer from performance degradation under heavy contention, while lacking a desirable fairness guarantee. To overcome these issues, I introduce a scalable lock algorithm for shared-memory multiprocessors addressing the resource allocation problem. It is the first multi-resource lock algorithm that guarantees the strongest first-in, first-out (FIFO) fairness. For lock-free data structures, I present a methodology for transforming them into high-performance lock-free transactional data structures without revamping the data structures' original synchronization design. My approach leverages the semantic knowledge of the data structure to eliminate the overhead of false conflicts and rollbacks.Secondly, I apply the proposed methodologies and present a suite of novel transactional search data structures in the form of an open source library. This is interesting not only because the fundamental importance of search data structures in computer science and their wide use in real world programs, but also because it demonstrate the implementation issues that arise when using the methodologies I have developed. This library is not only a compilation of a large number of fundamental data structures for multiprocessor applications, but also a framework for enabling composable transactions, and moreover, an infrastructure for continuous integration of new data structures. By taking such a top-down approach, I am able to identify and consider the interplay of data structure interface operations as a whole, which allows for scrutinizing their commutativity rules and hence opens up possibilities for design optimizations.Lastly, I evaluate the throughput of the proposed data structures using transactions with randomly generated operations on two difference hardware systems. To ensure the strongest possible competition, I chose the best performing alternatives from state-of-the-art locking protocols and transactional memory systems in the literature. The results show that it is straightforward to build efficient transactional data structures when using my multi-resource lock as a drop-in replacement for transactional boosted data structures. Furthermore, this work shows that it is possible to build efficient lock-free transactional data structures with all perceived benefits of lock-freedom and with performance far better than generic transactional memory systems.
Show less - Date Issued
- 2016
- Identifier
- CFE0006428, ucf:51453
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006428
- Title
- Investigation on electrical properties of RF sputtered deposited BCN thin films.
- Creator
-
Prakash, Adithya, Sundaram, Kalpathy, Yuan, Jiann-Shiun, Lin, Mingjie, University of Central Florida
- Abstract / Description
-
The ever increasing advancements in semiconductor technology and continuous scaling of CMOS devices mandate the need for new dielectric materials with low-k values. The interconnect delay can be reduced not only by the resistance of the conductor but also by decreasing the capacitance of dielectric layer. Also cross-talk is a major issue faced by semiconductor industry due to high value of k of the inter-dielectric layer (IDL) in a multilevel wiring scheme in Si ultra large scale integrated...
Show moreThe ever increasing advancements in semiconductor technology and continuous scaling of CMOS devices mandate the need for new dielectric materials with low-k values. The interconnect delay can be reduced not only by the resistance of the conductor but also by decreasing the capacitance of dielectric layer. Also cross-talk is a major issue faced by semiconductor industry due to high value of k of the inter-dielectric layer (IDL) in a multilevel wiring scheme in Si ultra large scale integrated circuit (ULSI) devices. In order to reduce the time delay, it is necessary to introduce a wiring metal with low resistivity and a high quality insulating film with a low dielectric constant which leads to a reduction of the wiring capacitance.Boron carbon nitride (BCN) films are prepared by reactive magnetron sputtering from a B(&)#172;4C target and deposited to make metal-insulator-metal (MIM) sandwich structures using aluminum as the top and bottom electrodes. BCN films are deposited at various N2/Ar gas flow ratios, substrate temperatures and process pressures. The electrical characterization of the MIM devices includes capacitance vs. voltage (C-V), current vs voltage, and breakdown voltage characteristics. The above characterizations are performed as a function of deposition parameters.
Show less - Date Issued
- 2013
- Identifier
- CFE0004912, ucf:49625
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004912
- Title
- Assessing Approximate Arithmetic Designs in the presence of Process Variations and Voltage Scaling.
- Creator
-
Naseer, Adnan Aquib, DeMara, Ronald, Lin, Mingjie, Karwowski, Waldemar, University of Central Florida
- Abstract / Description
-
As environmental concerns and portability of electronic devices move to the forefront of priorities,innovative approaches which reduce processor energy consumption are sought. Approximatearithmetic units are one of the avenues whereby significant energy savings can be achieved. Approximationof fundamental arithmetic units is achieved by judiciously reducing the number oftransistors in the circuit. A satisfactory tradeoff of energy vs. accuracy of the circuit can be determinedby trial-and...
Show moreAs environmental concerns and portability of electronic devices move to the forefront of priorities,innovative approaches which reduce processor energy consumption are sought. Approximatearithmetic units are one of the avenues whereby significant energy savings can be achieved. Approximationof fundamental arithmetic units is achieved by judiciously reducing the number oftransistors in the circuit. A satisfactory tradeoff of energy vs. accuracy of the circuit can be determinedby trial-and-error methods of each functional approximation. Although the accuracy of theoutput is compromised, it is only decreased to an acceptable extent that can still fulfill processingrequirements.A number of scenarios are evaluated with approximate arithmetic units to thoroughly cross-checkthem with their accurate counterparts. Some of the attributes evaluated are energy consumption,delay and process variation. Additionally, novel methods to create such approximate unitsare developed. One such method developed uses a Genetic Algorithm (GA), which mimics thebiologically-inspired evolutionary techniques to obtain an optimal solution. A GA employs geneticoperators such as crossover and mutation to mix and match several different types of approximateadders to find the best possible combination of such units for a given input set. As the GA usuallyconsumes a significant amount of time as the size of the input set increases, we tackled this problemby using various methods to parallelize the fitness computation process of the GA, which isthe most compute intensive task. The parallelization improved the computation time from 2,250seconds to 1,370 seconds for up to 8 threads, using both OpenMP and Intel TBB. Apart from usingthe GA with seeded multiple approximate units, other seeds such as basic logic gates with limitedlogic space were used to develop completely new multi-bit approximate adders with good fitnesslevels.iiiThe effect of process variation was also calculated. As the number of transistors is reduced, thedistribution of the transistor widths and gate oxide may shift away from a Gaussian Curve. This resultwas demonstrated in different types of single-bit adders with the delay sigma increasing from6psec to 12psec, and when the voltage is scaled to Near-Threshold-Voltage (NTV) levels sigmaincreases by up to 5psec. Approximate Arithmetic Units were not affected greatly by the changein distribution of the thickness of the gate oxide. Even when considering the 3-sigma value, thedelay of an approximate adder remains below that of a precise adder with additional transistors.Additionally, it is demonstrated that the GA obtains innovative solutions to the appropriate combinationof approximate arithmetic units, to achieve a good balance between energy savings andaccuracy.
Show less - Date Issued
- 2015
- Identifier
- CFE0005675, ucf:50165
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005675
- Title
- Improving the performance of data-intensive computing on Cloud platforms.
- Creator
-
Dai, Wei, Bassiouni, Mostafa, Zou, Changchun, Wang, Jun, Lin, Mingjie, Bai, Yuanli, University of Central Florida
- Abstract / Description
-
Big Data such as Terabyte and Petabyte datasets are rapidly becoming the new norm for various organizations across a wide range of industries. The widespread data-intensive computing needs have inspired innovations in parallel and distributed computing, which has been the effective way to tackle massive computing workload for decades. One significant example is MapReduce, which is a programming model for expressing distributed computations on huge datasets, and an execution framework for data...
Show moreBig Data such as Terabyte and Petabyte datasets are rapidly becoming the new norm for various organizations across a wide range of industries. The widespread data-intensive computing needs have inspired innovations in parallel and distributed computing, which has been the effective way to tackle massive computing workload for decades. One significant example is MapReduce, which is a programming model for expressing distributed computations on huge datasets, and an execution framework for data-intensive computing on commodity clusters as well. Since it was originally proposed by Google, MapReduce has become the most popular technology for data-intensive computing. While Google owns its proprietary implementation of MapReduce, an open source implementation called Hadoop has gained wide adoption in the rest of the world. The combination of Hadoop and Cloud platforms has made data-intensive computing much more accessible and affordable than ever before.This dissertation addresses the performance issue of data-intensive computing on Cloud platforms from three different aspects: task assignment, replica placement, and straggler identification. Both task assignment and replica placement are subjects closely related to load balancing, which is one of the key issues that can significantly affect the performance of parallel and distributed applications. While task assignment schemes strive to balance data processing load among cluster nodes to achieve minimum job completion time, replica placement policies aim to assign block replicas to cluster nodes according to their processing capabilities to exploit data locality to the maximum extent. Straggler identification is also one of the crucial issues data-intensive computing has to deal with, as the overall performance of parallel and distributed applications is often determined by the node with the lowest performance. The results of extensive evaluation tests confirm that the schemes/policies proposed in this dissertation can improve the performance of data-intensive applications running on Cloud platforms.
Show less - Date Issued
- 2017
- Identifier
- CFE0006731, ucf:51896
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006731
- Title
- Soft-Error Resilience Framework For Reliable and Energy-Efficient CMOS Logic and Spintronic Memory Architectures.
- Creator
-
Alghareb, Faris, DeMara, Ronald, Lin, Mingjie, Zou, Changchun, Jha, Sumit Kumar, Song, Zixia, University of Central Florida
- Abstract / Description
-
The revolution in chip manufacturing processes spanning five decades has proliferated high performance and energy-efficient nano-electronic devices across all aspects of daily life. In recent years, CMOS technology scaling has realized billions of transistors within large-scale VLSI chips to elevate performance. However, these advancements have also continually augmented the impact of Single-Event Transient (SET) and Single-Event Upset (SEU) occurrences which precipitate a range of Soft-Error...
Show moreThe revolution in chip manufacturing processes spanning five decades has proliferated high performance and energy-efficient nano-electronic devices across all aspects of daily life. In recent years, CMOS technology scaling has realized billions of transistors within large-scale VLSI chips to elevate performance. However, these advancements have also continually augmented the impact of Single-Event Transient (SET) and Single-Event Upset (SEU) occurrences which precipitate a range of Soft-Error (SE) dependability issues. Consequently, soft-error mitigation techniques have become essential to improve systems' reliability. Herein, first, we proposed optimized soft-error resilience designs to improve robustness of sub-micron computing systems. The proposed approaches were developed to deliver energy-efficiency and tolerate double/multiple errors simultaneously while incurring acceptable speed performance degradation compared to the prior work. Secondly, the impact of Process Variation (PV) at the Near-Threshold Voltage (NTV) region on redundancy-based SE-mitigation approaches for High-Performance Computing (HPC) systems was investigated to highlight the approach that can realize favorable attributes, such as reduced critical datapath delay variation and low speed degradation. Finally, recently, spin-based devices have been widely used to design Non-Volatile (NV) elements such as NV latches and flip-flops, which can be leveraged in normally-off computing architectures for Internet-of-Things (IoT) and energy-harvesting-powered applications. Thus, in the last portion of this dissertation, we design and evaluate for soft-error resilience NV-latching circuits that can achieve intriguing features, such as low energy consumption, high computing performance, and superior soft errors tolerance, i.e., concurrently able to tolerate Multiple Node Upset (MNU), to potentially become a mainstream solution for the aerospace and avionic nanoelectronics. Together, these objectives cooperate to increase energy-efficiency and soft errors mitigation resiliency of larger-scale emerging NV latching circuits within iso-energy constraints. In summary, addressing these reliability concerns is paramount to successful deployment of future reliable and energy-efficient CMOS logic and spintronic memory architectures with deeply-scaled devices operating at low-voltages.
Show less - Date Issued
- 2019
- Identifier
- CFE0007884, ucf:52765
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007884
- Title
- Stochastic-Based Computing with Emerging Spin-Based Device Technologies.
- Creator
-
Bai, Yu, Lin, Mingjie, DeMara, Ronald, Wang, Jun, Jin, Yier, Dong, Yajie, University of Central Florida
- Abstract / Description
-
In this dissertation, analog and emerging device physics is explored to provide a technology plat- form to design new bio-inspired system and novel architecture. With CMOS approaching the nano-scaling, their physics limits in feature size. Therefore, their physical device characteristics will pose severe challenges to constructing robust digital circuitry. Unlike transistor defects due to fabrication imperfection, quantum-related switching uncertainties will seriously increase their sus-...
Show moreIn this dissertation, analog and emerging device physics is explored to provide a technology plat- form to design new bio-inspired system and novel architecture. With CMOS approaching the nano-scaling, their physics limits in feature size. Therefore, their physical device characteristics will pose severe challenges to constructing robust digital circuitry. Unlike transistor defects due to fabrication imperfection, quantum-related switching uncertainties will seriously increase their sus- ceptibility to noise, thus rendering the traditional thinking and logic design techniques inadequate. Therefore, the trend of current research objectives is to create a non-Boolean high-level compu- tational model and map it directly to the unique operational properties of new, power efficient, nanoscale devices.The focus of this research is based on two-fold: 1) Investigation of the physical hysteresis switching behaviors of domain wall device. We analyze phenomenon of domain wall device and identify hys- teresis behavior with current range. We proposed the Domain-Wall-Motion-based (DWM) NCL circuit that achieves approximately 30x and 8x improvements in energy efficiency and chip layout area, respectively, over its equivalent CMOS design, while maintaining similar delay performance for a one bit full adder. 2) Investigation of the physical stochastic switching behaviors of Mag- netic Tunnel Junction (MTJ) device. With analyzing of stochastic switching behaviors of MTJ, we proposed an innovative stochastic-based architecture for implementing artificial neural network (S-ANN) with both magnetic tunneling junction (MTJ) and domain wall motion (DWM) devices, which enables efficient computing at an ultra-low voltage. For a well-known pattern recognition task, our mixed-model HSPICE simulation results have shown that a 34-neuron S-ANN imple- mentation, when compared with its deterministic-based ANN counterparts implemented with dig- ital and analog CMOS circuits, achieves more than 1.5 ? 2 orders of magnitude lower energy consumption and 2 ? 2.5 orders of magnitude less hidden layer chip area.
Show less - Date Issued
- 2016
- Identifier
- CFE0006680, ucf:51921
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006680
- Title
- Bridging the Gap between Application and Solid-State-Drives.
- Creator
-
Zhou, Jian, Wang, Jun, Lin, Mingjie, Fan, Deliang, Ewetz, Rickard, Qi, GuoJun, University of Central Florida
- Abstract / Description
-
Data storage is one of the important and often critical parts of the computing systemin terms of performance, cost, reliability, and energy.Numerous new memory technologies,such as NAND flash, phase change memory (PCM), magnetic RAM (STT-RAM) and Memristor,have emerged recently.Many of them have already entered the production system.Traditional storage optimization and caching algorithms are far from optimalbecause storage I/Os do not show simple locality.To provide optimal storage we need...
Show moreData storage is one of the important and often critical parts of the computing systemin terms of performance, cost, reliability, and energy.Numerous new memory technologies,such as NAND flash, phase change memory (PCM), magnetic RAM (STT-RAM) and Memristor,have emerged recently.Many of them have already entered the production system.Traditional storage optimization and caching algorithms are far from optimalbecause storage I/Os do not show simple locality.To provide optimal storage we need accurate predictions of I/O behavior.However, the workloads are increasingly dynamic and diverse,making the long and short time I/O prediction challenge.Because of the evolution of the storage technologiesand the increasing diversity of workloads,the storage software is becoming more and more complex.For example, Flash Translation Layer (FTL) is added for NAND-flash based Solid State Disks (NAND-SSDs).However, it introduces overhead such as address translation delay and garbage collection costs.There are many recent studies aim to address the overhead.Unfortunately, there is no one-size-fits-all solution due to the variety of workloads.Despite rapidly evolving in storage technologies,the increasing heterogeneity and diversity in machines and workloadscoupled with the continued data explosionexacerbate the gap between computing and storage speeds.In this dissertation, we improve the data storage performance from both top-down and bottom-up approach.First, we will investigate exposing the storage level parallelismso that applications can avoid I/O contentions and workloads skewwhen scheduling the jobs.Second, we will study how architecture aware task scheduling can improve the performance of the application when PCM based NVRAM are equipped.Third, we will develop an I/O correlation aware flash translation layer for NAND-flash based Solid State Disks.Fourth, we will build a DRAM-based correlation aware FTL emulator and study the performance in various filesystems.
Show less - Date Issued
- 2018
- Identifier
- CFE0007273, ucf:52188
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007273
- Title
- Simulation, Analysis, and Optimization of Heterogeneous CPU-GPU Systems.
- Creator
-
Giles, Christopher, Heinrich, Mark, Ewetz, Rickard, Lin, Mingjie, Pattanaik, Sumanta, Flitsiyan, Elena, University of Central Florida
- Abstract / Description
-
With the computing industry's recent adoption of the Heterogeneous System Architecture (HSA) standard, we have seen a rapid change in heterogeneous CPU-GPU processor designs. State-of-the-art heterogeneous CPU-GPU processors tightly integrate multicore CPUs and multi-compute unit GPUs together on a single die. This brings the MIMD processing capabilities of the CPU and the SIMD processing capabilities of the GPU together into a single cohesive package with new HSA features comprising better...
Show moreWith the computing industry's recent adoption of the Heterogeneous System Architecture (HSA) standard, we have seen a rapid change in heterogeneous CPU-GPU processor designs. State-of-the-art heterogeneous CPU-GPU processors tightly integrate multicore CPUs and multi-compute unit GPUs together on a single die. This brings the MIMD processing capabilities of the CPU and the SIMD processing capabilities of the GPU together into a single cohesive package with new HSA features comprising better programmability, coherency between the CPU and GPU, shared Last Level Cache (LLC), and shared virtual memory address spaces. These advancements can potentially bring marked gains in heterogeneous processor performance and have piqued the interest of researchers who wish to unlock these potential performance gains. Therefore, in this dissertation I explore the heterogeneous CPU-GPU processor and application design space with the goal of answering interesting research questions, such as, (1) what are the architectural design trade-offs in heterogeneous CPU-GPU processors and (2) how do we best maximize heterogeneous CPU-GPU application performance on a given system. To enable my exploration of the heterogeneous CPU-GPU design space, I introduce a novel discrete event-driven simulation library called KnightSim and a novel computer architectural simulator called M2S-CGM. M2S-CGM includes all of the simulation elements necessary to simulate coherent execution between a CPU and GPU with shared LLC and shared virtual memory address spaces. I then utilize M2S-CGM for the conduct of three architectural studies. First, I study the architectural effects of shared LLC and CPU-GPU coherence on the overall performance of non-collaborative GPU-only applications. Second, I profile and analyze a set of collaborative CPU-GPU applications to determine how to best optimize them for maximum collaborative performance. Third, I study the impact of varying four key architectural parameters on collaborative CPU-GPU performance by varying GPU compute unit coalesce size, GPU to memory controller bandwidth, GPU frequency, and system wide switching fabric latency.
Show less - Date Issued
- 2019
- Identifier
- CFE0007807, ucf:52346
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007807
- Title
- Interactive Perception in Robotics.
- Creator
-
Baghbahari Baghdadabad, Masoud, Behal, Aman, Haralambous, Michael, Lin, Mingjie, Sukthankar, Gita, Xu, Yunjun, University of Central Florida
- Abstract / Description
-
Interactive perception is a significant and unique characteristic of embodied agents. An agent can discover plenty of knowledge through active interaction with its surrounding environment. Recently, deep learning structures introduced new possibilities to interactive perception in robotics. The advantage of deep learning is in acquiring self-organizing features from gathered data; however,it is computationally impractical to implement in real-time interaction applications. Moreover, it can be...
Show moreInteractive perception is a significant and unique characteristic of embodied agents. An agent can discover plenty of knowledge through active interaction with its surrounding environment. Recently, deep learning structures introduced new possibilities to interactive perception in robotics. The advantage of deep learning is in acquiring self-organizing features from gathered data; however,it is computationally impractical to implement in real-time interaction applications. Moreover, it can be difficult to attach a physical interpretation. An alternative suggested framework in such cases is integrated perception-action.In this dissertation, we propose two integrated interactive perception-action algorithms for real-time automated grasping of novel objects using pure tactile sensing. While visual sensing andprocessing is necessary for gross reaching movements, it can slow down the grasping process if it is the only sensing modality utilized. To overcome this issue, humans primarily utilize tactile perceptiononce the hand is in contact with the object. Inspired by this, we first propose an algorithm to define similar ability for a robot by formulating the required grasping steps.Next, we develop the algorithm to achieve force closure constraint via suggesting a human-like behavior for the robot to interactively identify the object. During this process, the robot adjuststhe hand through an interactive exploration of the object's local surface normal vector. After the robot finds the surface normal vector, it then tries to find the object edges to have a graspable finalrendezvous with the object. Such achievement is very important in order to find the objects edges for rectangular objects before fully grasping the object. We implement the proposed approacheson an assistive robot to demonstrate the performance of interactive perception-action strategies to accomplish grasping task in an automatic manner.
Show less - Date Issued
- 2019
- Identifier
- CFE0007780, ucf:52361
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007780
- Title
- Guided Autonomy for Quadcopter Photography.
- Creator
-
Alabachi, Saif, Sukthankar, Gita, Behal, Aman, Lin, Mingjie, Boloni, Ladislau, Laviola II, Joseph, University of Central Florida
- Abstract / Description
-
Photographing small objects with a quadcopter is non-trivial to perform with many common user interfaces, especially when it requires maneuvering an Unmanned Aerial Vehicle (C) to difficult angles in order to shoot high perspectives. The aim of this research is to employ machine learning to support better user interfaces for quadcopter photography. Human Robot Interaction (HRI) is supported by visual servoing, a specialized vision system for real-time object detection, and control policies...
Show morePhotographing small objects with a quadcopter is non-trivial to perform with many common user interfaces, especially when it requires maneuvering an Unmanned Aerial Vehicle (C) to difficult angles in order to shoot high perspectives. The aim of this research is to employ machine learning to support better user interfaces for quadcopter photography. Human Robot Interaction (HRI) is supported by visual servoing, a specialized vision system for real-time object detection, and control policies acquired through reinforcement learning (RL). Two investigations of guided autonomy were conducted. In the first, the user directed the quadcopter with a sketch based interface, and periods of user direction were interspersed with periods of autonomous flight. In the second, the user directs the quadcopter by taking a single photo with a handheld mobile device, and the quadcopter autonomously flies to the requested vantage point.This dissertation focuses on the following problems: 1) evaluating different user interface paradigms for dynamic photography in a GPS-denied environment; 2) learning better Convolutional Neural Network (CNN) object detection models to assure a higher precision in detecting human subjects than the currently available state-of-the-art fast models; 3) transferring learning from the Gazebo simulation into the real world; 4) learning robust control policies using deep reinforcement learning to maneuver the quadcopter to multiple shooting positions with minimal human interaction.
Show less - Date Issued
- 2019
- Identifier
- CFE0007774, ucf:52369
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007774
- Title
- Heterogeneous Reconfigurable Fabrics for In-circuit Training and Evaluation of Neuromorphic Architectures.
- Creator
-
Mohammadizand, Ramtin, DeMara, Ronald, Lin, Mingjie, Sundaram, Kalpathy, Fan, Deliang, Wu, Annie, University of Central Florida
- Abstract / Description
-
A heterogeneous device technology reconfigurable logic fabric is proposed which leverages the cooperating advantages of distinct magnetic random access memory (MRAM)-based look-up tables (LUTs) to realize sequential logic circuits, along with conventional SRAM-based LUTs to realize combinational logic paths. The resulting Hybrid Spin/Charge FPGA (HSC-FPGA) using magnetic tunnel junction (MTJ) devices within this topology demonstrates commensurate reductions in area and power consumption over...
Show moreA heterogeneous device technology reconfigurable logic fabric is proposed which leverages the cooperating advantages of distinct magnetic random access memory (MRAM)-based look-up tables (LUTs) to realize sequential logic circuits, along with conventional SRAM-based LUTs to realize combinational logic paths. The resulting Hybrid Spin/Charge FPGA (HSC-FPGA) using magnetic tunnel junction (MTJ) devices within this topology demonstrates commensurate reductions in area and power consumption over fabrics having LUTs constructed with either individual technology alone. Herein, a hierarchical top-down design approach is used to develop the HSC(&)#173; FPGA starting from the configurable logic block (CLB) and slice structures down to LUT circuits and the corresponding device fabrication paradigms. This facilitates a novel architectural approach to reduce leakage energy, minimize communication occurrence and energy cost by eliminating unnecessary data transfer, and support auto-tuning for resilience. Furthermore, HSC-FPGA enables new advantages of technology co-design which trades off alternative mappings between emerging devices and transistors at runtime by allowing dynamic remapping to adaptively leverage the intrinsic computing features of each device technology. HSC-FPGA offers a platform for fine-grained Logic-In-Memory architectures and runtime adaptive hardware.An orthogonal dimension of fabric heterogeneity is also non-determinism enabled by either low(&)#173; voltage CMOS or probabilistic emerging devices. It can be realized using probabilistic devices within a reconfigurable network to blend deterministic and probabilistic computational models. Herein, consider the probabilistic spin logic p-bit device as a fabric element comprising a crossbar(&)#173; structured weighted array. The programmability of the resistive network interconnecting p-bit devices can be achieved by modifying the resistive states of the array's weighted connections. Thus, the programmable weighted array forms a CLB-scale macro co-processing element with bitstream programmability. This allows field programmability for a wide range of classification problems and recognition tasks to allow fluid mappings of probabilistic and deterministic computing approaches. In particular, a Deep Belief Network (DBN) is implemented in the field using recurrent layers of co-processing elements to form an n(&)#215; m1(&)#215;m2(&)#215;...(&)#215;mi weighted array as a configurable hardware circuit with an n-input layer followed by i?1 hidden layers. As neuromorphic architectures using post-CMOS devices increase in capability and network size, the utility and benefits of reconfigurable fabrics of neuromorphic modules can be anticipated to continue to accelerate.
Show less - Date Issued
- 2019
- Identifier
- CFE0007502, ucf:52643
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007502
- Title
- Improvement of Data-Intensive Applications Running on Cloud Computing Clusters.
- Creator
-
Ibrahim, Ibrahim, Bassiouni, Mostafa, Lin, Mingjie, Zhou, Qun, Ewetz, Rickard, Garibay, Ivan, University of Central Florida
- Abstract / Description
-
MapReduce, designed by Google, is widely used as the most popular distributed programmingmodel in cloud environments. Hadoop, an open-source implementation of MapReduce, is a data management framework on large cluster of commodity machines to handle data-intensive applications. Many famous enterprises including Facebook, Twitter, and Adobehave been using Hadoop for their data-intensive processing needs. Task stragglers in MapReduce jobs dramatically impede job execution on massive datasets in...
Show moreMapReduce, designed by Google, is widely used as the most popular distributed programmingmodel in cloud environments. Hadoop, an open-source implementation of MapReduce, is a data management framework on large cluster of commodity machines to handle data-intensive applications. Many famous enterprises including Facebook, Twitter, and Adobehave been using Hadoop for their data-intensive processing needs. Task stragglers in MapReduce jobs dramatically impede job execution on massive datasets in cloud computing systems. This impedance is due to the uneven distribution of input data and computation load among cluster nodes, heterogeneous data nodes, data skew in reduce phase, resource contention situations, and network configurations. All these reasons may cause delay failure and the violation of job completion time. One of the key issues that can significantly affect the performance of cloud computing is the computation load balancing among cluster nodes. Replica placement in Hadoop distributed file system plays a significant role in data availability and the balanced utilization of clusters. In the current replica placement policy (RPP) of Hadoop distributed file system (HDFS), the replicas of data blocks cannot be evenly distributed across cluster's nodes. The current HDFS must rely on a load balancing utility for balancing the distribution of replicas, which results in extra overhead for time and resources. This dissertation addresses data load balancing problem and presents an innovative replica placement policy for HDFS. It can perfectly balance the data load among cluster's nodes. The heterogeneity of cluster nodes exacerbates the issue of computational load balancing; therefore, another replica placement algorithm has been proposed in this dissertation for heterogeneous cluster environments. The timing of identifying the straggler map task is very important for straggler mitigation in data-intensive cloud computing. To mitigate the straggler map task, Present progress and Feedback based Speculative Execution (PFSE) algorithm has been proposed in this dissertation. PFSE is a new straggler identification scheme to identify the straggler map tasks based on the feedback information received from completed tasks beside the progress of the current running task. Straggler reduce task aggravates the violation of MapReduce job completion time. Straggler reduce task is typically the result of bad data partitioning during the reduce phase. The Hash partitioner employed by Hadoop may cause intermediate data skew, which results in straggler reduce task. In this dissertation a new partitioning scheme, named Balanced Data Clusters Partitioner (BDCP), is proposed to mitigate straggler reduce tasks. BDCP is based on sampling of input data and feedback information about the current processing task. BDCP can assist in straggler mitigation during the reduce phase and minimize the job completion time in MapReduce jobs. The results of extensive experiments corroborate that the algorithms and policies proposed in this dissertation can improve the performance of data-intensive applications running on cloud platforms.
Show less - Date Issued
- 2019
- Identifier
- CFE0007818, ucf:52804
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007818
- Title
- Adaptive Architectural Strategies for Resilient Energy-Aware Computing.
- Creator
-
Ashraf, Rizwan, DeMara, Ronald, Lin, Mingjie, Wang, Jun, Jha, Sumit, Johnson, Mark, University of Central Florida
- Abstract / Description
-
Reconfigurable logic or Field-Programmable Gate Array (FPGA) devices have the ability to dynamically adapt the computational circuit based on user-specified or operating-condition requirements. Such hardware platforms are utilized in this dissertation to develop adaptive techniques for achieving reliable and sustainable operation while autonomously meeting these requirements. In particular, the properties of resource uniformity and in-field reconfiguration via on-chip processors are exploited...
Show moreReconfigurable logic or Field-Programmable Gate Array (FPGA) devices have the ability to dynamically adapt the computational circuit based on user-specified or operating-condition requirements. Such hardware platforms are utilized in this dissertation to develop adaptive techniques for achieving reliable and sustainable operation while autonomously meeting these requirements. In particular, the properties of resource uniformity and in-field reconfiguration via on-chip processors are exploited to implement Evolvable Hardware (EHW). EHW utilize genetic algorithms to realize logic circuits at runtime, as directed by the objective function. However, the size of problems solved using EHW as compared with traditional approaches has been limited to relatively compact circuits. This is due to the increase in complexity of the genetic algorithm with increase in circuit size. To address this research challenge of scalability, the Netlist-Driven Evolutionary Refurbishment (NDER) technique was designed and implemented herein to enable on-the-fly permanent fault mitigation in FPGA circuits. NDER has been shown to achieve refurbishment of relatively large sized benchmark circuits as compared to related works. Additionally, Design Diversity (DD) techniques which are used to aid such evolutionary refurbishment techniques are also proposed and the efficacy of various DD techniques is quantified and evaluated.Similarly, there exists a growing need for adaptable logic datapaths in custom-designed nanometer-scale ICs, for ensuring operational reliability in the presence of Process, Voltage, and Temperature (PVT) and, transistor-aging variations owing to decreased feature sizes for electronic devices. Without such adaptability, excessive design guardbands are required to maintain the desired integration and performance levels. To address these challenges, the circuit-level technique of Self-Recovery Enabled Logic (SREL) was designed herein. At design-time, vulnerable portions of the circuit identified using conventional Electronic Design Automation tools are replicated to provide post-fabrication adaptability via intelligent techniques. In-situ timing sensors are utilized in a feedback loop to activate suitable datapaths based on current conditions that optimize performance and energy consumption. Primarily, SREL is able to mitigate the timing degradations caused due to transistor aging effects in sub-micron devices by reducing the stress induced on active elements by utilizing power-gating. As a result, fewer guardbands need to be included to achieve comparable performance levels which leads to considerable energy savings over the operational lifetime.The need for energy-efficient operation in current computing systems has given rise to Near-Threshold Computing as opposed to the conventional approach of operating devices at nominal voltage. In particular, the goal of exascale computing initiative in High Performance Computing (HPC) is to achieve 1 EFLOPS under the power budget of 20MW. However, it comes at the cost of increased reliability concerns, such as the increase in performance variations and soft errors. This has given rise to increased resiliency requirements for HPC applications in terms of ensuring functionality within given error thresholds while operating at lower voltages. My dissertation research devised techniques and tools to quantify the effects of radiation-induced transient faults in distributed applications on large-scale systems. A combination of compiler-level code transformation and instrumentation are employed for runtime monitoring to assess the speed and depth of application state corruption as a result of fault injection. Finally, fault propagation models are derived for each HPC application that can be used to estimate the number of corrupted memory locations at runtime. Additionally, the tradeoffs between performance and vulnerability and the causal relations between compiler optimization and application vulnerability are investigated.
Show less - Date Issued
- 2015
- Identifier
- CFE0006206, ucf:52889
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006206
- Title
- Human Action Localization and Recognition in Unconstrained Videos.
- Creator
-
Boyraz, Hakan, Tappen, Marshall, Foroosh, Hassan, Lin, Mingjie, Zhang, Shaojie, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
As imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing...
Show moreAs imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing discriminative sub-regions of images and videos when performing recognition tasks. In this thesis, we address the action detection and recognition problems. Action detection in video is a particularly difficult problem because actions must not only be recognized correctly, but must also be localized in the 3D spatio-temporal volume. We introduce a technique that transforms the 3D localization problem into a series of 2D detection tasks. This is accomplished by dividing the video into overlapping segments, then representing each segment with a 2D video projection. The advantage of the 2D projection is that it makes it convenient to apply the best techniques from object detection to the action detection problem. We also introduce a novel, straightforward method for searching the 2D projections to localize actions, termed Two-Point Subwindow Search (TPSS). Finally, we show how to connect the local detections in time using a chaining algorithm to identify the entire extent of the action. Our experiments show that video projection outperforms the latest results on action detection in a direct comparison.Second, we present a probabilistic model learning to identify discriminative regions in videos from weakly-supervised data where each video clip is only assigned a label describing what action is present in the frame or clip. While our first system requires every action to be manually outlined in every frame of the video, this second system only requires that the video be given a single high-level tag. From this data, the system is able to identify discriminative regions that correspond well to the regions containing the actual actions. Our experiments on both the MSR Action Dataset II and UCF Sports Dataset show that the localizations produced by this weakly supervised system are comparable in quality to localizations produced by systems that require each frame to be manually annotated. This system is able to detect actions in both 1) non-temporally segmented action videos and 2) recognition tasks where a single label is assigned to the clip. We also demonstrate the action recognition performance of our method on two complex datasets, i.e. HMDB and UCF101. Third, we extend our weakly-supervised framework by replacing the recognition stage with a two-stage neural network and apply dropout for preventing overfitting of the parameters on the training data. Dropout technique has been recently introduced to prevent overfitting of the parameters in deep neural networks and it has been applied successfully to object recognition problem. To our knowledge, this is the first system using dropout for action recognition problem. We demonstrate that using dropout improves the action recognition accuracies on HMDB and UCF101 datasets.
Show less - Date Issued
- 2013
- Identifier
- CFE0004977, ucf:49562
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004977
- Title
- Towards High-Efficiency Data Management In the Next-Generation Persistent Memory System.
- Creator
-
Chen, Xunchao, Wang, Jun, Fan, Deliang, Lin, Mingjie, Ewetz, Rickard, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
For the sake of higher cell density while achieving near-zero standby power, recent research progress in Magnetic Tunneling Junction (MTJ) devices has leveraged Multi-Level Cell (MLC) configurations of Spin-Transfer Torque Random Access Memory (STT-RAM). However, in order to mitigate the write disturbance in an MLC strategy, data stored in the soft bit must be restored back immediately after the hard bit switching is completed. Furthermore, as the result of MTJ feature size scaling, the soft...
Show moreFor the sake of higher cell density while achieving near-zero standby power, recent research progress in Magnetic Tunneling Junction (MTJ) devices has leveraged Multi-Level Cell (MLC) configurations of Spin-Transfer Torque Random Access Memory (STT-RAM). However, in order to mitigate the write disturbance in an MLC strategy, data stored in the soft bit must be restored back immediately after the hard bit switching is completed. Furthermore, as the result of MTJ feature size scaling, the soft bit can be expected to become disturbed by the read sensing current, thus requiring an immediate restore operation to ensure the data reliability. In this paper, we design and analyze a novel Adaptive Restore Scheme for Write Disturbance (ARS-WD) and Read Disturbance (ARS-RD), respectively. ARS-WD alleviates restoration overhead by intentionally overwriting soft bit lines which are less likely to be read. ARS-RD, on the other hand, aggregates the potential writes and restore the soft bit line at the time of its eviction from higher level cache. Both of these two schemes are based on a lightweight forecasting approach for the future read behavior of the cache block. Our experimental results show substantial reduction in soft bit line restore operations. Moreover, ARS promotes advantages of MLC to provide a preferable L2 design alternative in terms of energy, area and latency product compared to SLC STT-RAM alternatives. Whereas the popular Cell Split Mapping (CSM) for MLC STT-RAM leverages the inter-block nonuniform access frequency, the intra-block data access features remain untapped in the MLC design. Aiming to minimize the energy-hungry write request to Hard-Bit Line (HBL) and maximize the dynamic range in the advantageous Soft-Bit Line (SBL), an hybrid mapping strategy for MLC STT-RAM cache (Double-S) is advocated in the paper. Double-S couples the contemporary Cell-Split-Mapping with the novel Word-Split-Mapping (WSM). Sparse cache block detector and read depth based data allocation/ migration policy are proposed to release the full potential of Double-S.
Show less - Date Issued
- 2017
- Identifier
- CFE0006865, ucf:51751
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006865
- Title
- Probabilistic-Based Computing Transformation with Reconfigurable Logic Fabrics.
- Creator
-
Alawad, Mohammed, Lin, Mingjie, DeMara, Ronald, Mikhael, Wasfy, Wang, Jun, Das, Tuhin, University of Central Florida
- Abstract / Description
-
Effectively tackling the upcoming (")zettabytes(") data explosion requires a huge quantum leapin our computing power and energy efficiency. However, with the Moore's law dwindlingquickly, the physical limits of CMOS technology make it almost intractable to achieve highenergy efficiency if the traditional (")deterministic and precise(") computing model still dominates.Worse, the upcoming data explosion mostly comprises statistics gleaned from uncertain,imperfect real-world environment. As such...
Show moreEffectively tackling the upcoming (")zettabytes(") data explosion requires a huge quantum leapin our computing power and energy efficiency. However, with the Moore's law dwindlingquickly, the physical limits of CMOS technology make it almost intractable to achieve highenergy efficiency if the traditional (")deterministic and precise(") computing model still dominates.Worse, the upcoming data explosion mostly comprises statistics gleaned from uncertain,imperfect real-world environment. As such, the traditional computing means of first-principlemodeling or explicit statistical modeling will very likely be ineffective to achieveflexibility, autonomy, and human interaction. The bottom line is clear: given where we areheaded, the fundamental principle of modern computing(-)deterministic logic circuits canflawlessly emulate propositional logic deduction governed by Boolean algebra(-)has to bereexamined, and transformative changes in the foundation of modern computing must bemade.This dissertation presents a novel stochastic-based computing methodology. It efficientlyrealizes the algorithmatic computing through the proposed concept of Probabilistic DomainTransform (PDT). The essence of PDT approach is to encode the input signal asthe probability density function, perform stochastic computing operations on the signal inthe probabilistic domain, and decode the output signal by estimating the probability densityfunction of the resulting random samples. The proposed methodology possesses manynotable advantages. Specifically, it uses much simplified circuit units to conduct complexoperations, which leads to highly area- and energy-efficient designs suitable for parallel processing.Moreover, it is highly fault-tolerant because the information to be processed isencoded with a large ensemble of random samples. As such, the local perturbations of itscomputing accuracy will be dissipated globally, thus becoming inconsequential to the final overall results. Finally, the proposed probabilistic-based computing can facilitate buildingscalable precision systems, which provides an elegant way to trade-off between computingaccuracy and computing performance/hardware efficiency for many real-world applications.To validate the effectiveness of the proposed PDT methodology, two important signal processingapplications, discrete convolution and 2-D FIR filtering, are first implemented andbenchmarked against other deterministic-based circuit implementations. Furthermore, alarge-scale Convolutional Neural Network (CNN), a fundamental algorithmic building blockin many computer vision and artificial intelligence applications that follow the deep learningprinciple, is also implemented with FPGA based on a novel stochastic-based and scalablehardware architecture and circuit design. The key idea is to implement all key componentsof a deep learning CNN, including multi-dimensional convolution, activation, and poolinglayers, completely in the probabilistic computing domain. The proposed architecture notonly achieves the advantages of stochastic-based computation, but can also solve severalchallenges in conventional CNN, such as complexity, parallelism, and memory storage.Overall, being highly scalable and energy efficient, the proposed PDT-based architecture iswell-suited for a modular vision engine with the goal of performing real-time detection, recognitionand segmentation of mega-pixel images, especially those perception-based computingtasks that are inherently fault-tolerant.
Show less - Date Issued
- 2016
- Identifier
- CFE0006828, ucf:51768
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006828