A toolkit for scalable preprocessing and neural learning with Tensorflow and Dask

Kratimenos Michail

Full record

URI:

http://purl.tuc.gr/dl/dias/017C122C-98E5-49FD-954A-200AEAB2B6F4

Year

2025

Type of Item

Diploma Work

License

Details

Bibliographic Citation

Michail Kratimenos, "A toolkit for scalable preprocessing and neural learning with Tensorflow and Dask", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2025 https://doi.org/10.26233/heallink.tuc.104033

Appears in Collections

Diploma Works in Community School of Electrical and Computer Engineering

Diploma Works in Community Software Technology and Network Applications Laboratory

Summary

In modern data-intensive applications, particularly those involving neural learning, the volume and velocity of incoming data pose significant challenges for real-time preprocessing and analysis. This thesis addresses the absence of Python-based solutions for stream summarization by introducing a parallel, scalable toolkit for data synopses, implemented using Apache Dask. Unlike existing synopses data engines, our approach is seamlessly integrated into Python ecosystems and is directly compatible with TensorFlow-based learning pipelines. We present an aggregation of probabilistic data structures, such as Bloom Filters, HyperLogLog and PrioritySampler, which maintain summaries of large data streams using sublinear memory. These structures support essential operations like add, merge, and estimate and are specifically implemented to allow efficient parallel computation via Dask. The toolkit is further embedded into SuBiTO, a Bayesian optimization framework for scalable learning, in order to optimize its performance. This optimization is shown in the experimental evaluations, where our engine significantly improves the runtime of data preprocessing tasks in distributed environments, by accelerating the synopses maintenance. Then, only concise data summaries are fed into neural learning pipelines to achieve appropriate balance between training accuracy and training speed. This work provides both afoundational toolkit and an integrated path between the neural learning pipeline and the preprocessing summarization step, offering a strong basis for future work in scalable, Python-based data summarization systems.

Search

Browse

My Space

A toolkit for scalable preprocessing and neural learning with Tensorflow and Dask

Kratimenos Michail

Summary

Available Files

Services

Export

Share

Statistics

Metadata & Content in a METS Package:

Metadata in Format: