Improving small molecules activity modelling capability of cell painting data through data augmentation and effective representation learning
| dc.contributor.advisor | Czodrowski, Paul | |
| dc.contributor.author | Ha, Son V. | |
| dc.date.accessioned | 2025-11-05T13:47:04Z | |
| dc.date.issued | 2024 | |
| dc.description.abstract | This thesis focuses on improving image-based activity modeling, for early-stage drug discov ery through data augmentation and representation learning of Cell Painting data. Firstly, a significant contribution is the introduction of the FSL-CP dataset, designed to support few-shot learning (FSL) benchmarking of small-molecule bioactivity prediction using cell microscopy images. Through this dataset we compared several FSL paradigms in a low-data context and study the effectiveness of transfer learning. Additionally, this work proposes an application for underused ‘low concentration images’ in activity modeling. We propose the combination of well-performing models trained at higher image concentrations, with lower image concentration for inference to identify more potent compounds. We show that this approach improves on the conventional method (directly training a high-potency model) in 65% of assays investigated in terms of AUC-ROC, and 75% of assays in terms of RIPtoP-corrected AUC-PR. The thesis further investigates cross-modality representation learning of cell painting (CP) and transcriptomics (TX), which are powerful tools in early drug discovery to gain understanding of the biological effect of compounds on a population of cells post-treatment. In this work, we benchmark two representation learning methods: contrastive learning and bimodal autoencoder. We use the setting of cross modality learning where representation learning is performed with two modalities (CP and TX), but only cell painting is available for new compounds embedding generation and downstream task. This is because for new compounds, we would only have CP data and not TX, due to high data generation cost of the RNA-Seq screen. We show that learned representation improves cluster quality for clustering of CP replicates and different modes of action (MoA). clustering of CP replicates and different modes of action (MoA). | en |
| dc.identifier.doi | https://doi.org/10.25358/openscience-13516 | |
| dc.identifier.uri | https://openscience.ub.uni-mainz.de/handle/20.500.12030/13537 | |
| dc.identifier.urn | urn:nbn:de:hebis:77-3733c68f-0879-4d6b-9af1-72566ae6dce49 | |
| dc.language.iso | eng | |
| dc.rights | CC-BY-ND-4.0 | |
| dc.rights.uri | https://creativecommons.org/licenses/by-nd/4.0/ | |
| dc.subject.ddc | 500 Naturwissenschaften | de |
| dc.subject.ddc | 500 Natural sciences and mathematics | en |
| dc.subject.ddc | 540 Chemie | de |
| dc.subject.ddc | 540 Chemistry and allied sciences | en |
| dc.subject.ddc | 660 Technische Chemie | de |
| dc.subject.ddc | 660 Chemical engineering | en |
| dc.subject.ddc | 004 Informatik | de |
| dc.subject.ddc | 004 Data processing | en |
| dc.title | Improving small molecules activity modelling capability of cell painting data through data augmentation and effective representation learning | en |
| dc.type | Dissertation | |
| jgu.date.accepted | 2025-10-02 | |
| jgu.description.extent | xx, 113 Seiten ; Illustrationen, Diagramme | |
| jgu.identifier.uuid | 3733c68f-0879-4d6b-9af1-72566ae6dce4 | |
| jgu.organisation.department | FB 09 Chemie, Pharmazie u. Geowissensch. | |
| jgu.organisation.name | Johannes Gutenberg-Universität Mainz | |
| jgu.organisation.number | 7950 | |
| jgu.organisation.place | Mainz | |
| jgu.organisation.ror | https://ror.org/023b0x485 | |
| jgu.organisation.year | 2024 | |
| jgu.rights.accessrights | openAccess | |
| jgu.subject.ddccode | 500 | |
| jgu.subject.ddccode | 540 | |
| jgu.subject.ddccode | 660 | |
| jgu.subject.ddccode | 004 | |
| jgu.type.dinitype | PhDThesis | en_GB |
| jgu.type.resource | Text | |
| jgu.type.version | Original work |