Institutional Repository
Technical University of Crete
EN  |  EL

Search

Browse

My Space

A toolkit for scalable preprocessing and neural learning with Tensorflow and Dask

Kratimenos Michail

Full record


URI: http://purl.tuc.gr/dl/dias/017C122C-98E5-49FD-954A-200AEAB2B6F4
Year 2025
Type of Item Diploma Work
License
Details
Bibliographic Citation Michail Kratimenos, "A toolkit for scalable preprocessing and neural learning with Tensorflow and Dask", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2025 https://doi.org/10.26233/heallink.tuc.104033
Appears in Collections

Summary

In modern data-intensive applications, particularly those involving neural learning, the volume and velocity of incoming data pose significant challenges for real-time preprocessing and analysis. This thesis addresses the absence of Python-based solutions for stream summarization by introducing a parallel, scalable toolkit for data synopses, implemented using Apache Dask. Unlike existing synopses data engines, our approach is seamlessly integrated into Python ecosystems and is directly compatible with TensorFlow-based learning pipelines. We present an aggregation of probabilistic data structures, such as Bloom Filters, HyperLogLog and PrioritySampler, which maintain summaries of large data streams using sublinear memory. These structures support essential operations like add, merge, and estimate and are specifically implemented to allow efficient parallel computation via Dask. The toolkit is further embedded into SuBiTO, a Bayesian optimization framework for scalable learning, in order to optimize its performance. This optimization is shown in the experimental evaluations, where our engine significantly improves the runtime of data preprocessing tasks in distributed environments, by accelerating the synopses maintenance. Then, only concise data summaries are fed into neural learning pipelines to achieve appropriate balance between training accuracy and training speed. This work provides both afoundational toolkit and an integrated path between the neural learning pipeline and the preprocessing summarization step, offering a strong basis for future work in scalable, Python-based data summarization systems.

Available Files

Services

Statistics