Skip to main content

Weekly news #11

·6 mins

News #


Science/Bioinformatics #

R.ROSETTA: an interpretable machine learning framework #

Garbulowski et al., BMC Bioinformatics (2021)

#machine-learning #roseta

We present the R.ROSETTA package, which is an R wrapper of ROSETTA framework. The original ROSETTA functions have been improved and adapted to the R programming environment. The package allows for building and analyzing non-linear interpretable machine learning models. R.ROSETTA gathers combinatorial statistics via rule-based modelling for accessible and transparent results, well-suited for adoption within the greater scientific community. The package also provides statistics and visualization tools that facilitate minimization of analysis bias and noise. The R.ROSETTA package is freely available at https://github.com/komorowskilab/R.ROSETTA. To illustrate the usage of the package, we applied it to a transcriptome dataset from an autism case–control study. Our tool provided hypotheses for potential co-predictive mechanisms among features that discerned phenotype classes. These co-predictors represented neurodevelopmental and autism-related genes.

Deep learning-based enhancement of epigenomics data with AtacWorks #

Chiang et al., Nat Commun (2021)

#atac-seq #machine-learning

ATAC-seq is a widely-applied assay used to measure genome-wide chromatin accessibility; however, its ability to detect active regulatory regions can depend on the depth of sequencing coverage and the signal-to-noise ratio. Here we introduce AtacWorks, a deep learning toolkit to denoise sequencing coverage and identify regulatory peaks at base-pair resolution from low cell count, low-coverage, or low-quality ATAC-seq data. Models trained by AtacWorks can detect peaks from cell types not seen in the training data, and are generalizable across diverse sample preparations and experimental platforms. We demonstrate that AtacWorks enhances the sensitivity of single-cell experiments by producing results on par with those of conventional methods using ~10 times as many cells, and further show that this framework can be adapted to enable cross-modality inference of protein-DNA interactions. Finally, we establish that AtacWorks can enable new biological discoveries by identifying active regulatory regions associated with lineage priming in rare subpopulations of hematopoietic stem cells.

Deep learning-based enhancement of epigenomics data with AtacWorks #

Lal and Chiang et al., Nat Commun (2021)

#atac-seq #denoise

ATAC-seq is a widely-applied assay used to measure genome-wide chromatin accessibility; however, its ability to detect active regulatory regions can depend on the depth of sequencing coverage and the signal-to-noise ratio. Here we introduce AtacWorks, a deep learning toolkit to denoise sequencing coverage and identify regulatory peaks at base-pair resolution from low cell count, low-coverage, or low-quality ATAC-seq data. Models trained by AtacWorks can detect peaks from cell types not seen in the training data, and are generalizable across diverse sample preparations and experimental platforms. We demonstrate that AtacWorks enhances the sensitivity of single-cell experiments by producing results on par with those of conventional methods using ~10 times as many cells, and further show that this framework can be adapted to enable cross-modality inference of protein-DNA interactions. Finally, we establish that AtacWorks can enable new biological discoveries by identifying active regulatory regions associated with lineage priming in rare subpopulations of hematopoietic stem cells.

Pattern discovery and disentanglement on relational datasets #

Wong et al., Sci Rep (2021)

#pattern-discovery #knowledge-discovery

Machine Learning has made impressive advances in many applications akin to human cognition for discernment. However, success has been limited in the areas of relational datasets, particularly for data with low volume, imbalanced groups, and mislabeled cases, with outputs that typically lack transparency and interpretability. The difficulties arise from the subtle overlapping and entanglement of functional and statistical relations at the source level. Hence, we have developed Pattern Discovery and Disentanglement System (PDD), which is able to discover explicit patterns from the data with various sizes, imbalanced groups, and screen out anomalies. We present herein four case studies on biomedical datasets to substantiate the efficacy of PDD. It improves prediction accuracy and facilitates transparent interpretation of discovered knowledge in an explicit representation framework PDD Knowledge Base that links the sources, the patterns, and individual patients. Hence, PDD promises broad and ground-breaking applications in genomic and biomedical machine learning.

Chunkflow: hybrid cloud processing of large 3D images by convolutional nets #

Wu et al., Nat Methods (2021)

#image-processing #distributed-computation

Automated microscopes with both high resolution and large field of view are generating terascale and even petascale 3D images. A local cluster might not have enough computational resources to process them in reasonable time, but public cloud platforms can provide computational resources on demand. Convolutional networks have become the state-of-the-art approach for 3D biological image analysis1,2, and cloud processing by 3D convolutional nets has been used for processing independent small image stacks3,4,5. However, cloud computing tools to perform distributed processing of terascale or petascale 3D images by convolutional nets are lacking. Here, we report chunkflow, a framework for distributing computational tasks over both cloud and local computational resources, including both GPUs and CPUs with multiple deep-learning framework back ends, to maximize efficiency, increase flexibility and reduce cost.


Programming #

R Cheat Sheets #

Resource of various cheat sheets for working with R


Tools #

omnisci/omniscidb #

OmniSci is the world’s fastest open source SQL engine, equally powerful at the heart of the OmniSci platform as it is accelerating third-party analytic apps.

rbreaves/kinto #

Linux & Windows with Mac-style shortcut keys.

70+ open-source clones of popular sites like Airbnb, Amazon, Instagram, Netflix, Tiktok, Spotify, Trello, Whatsapp, Youtube, etc. List contains source code, demo links, tech stack, and, GitHub stars count. Great for learning purpose!


Guides and Tutorials #

Data Science for Psychologists #

This book provides an introduction to data science that is tailored to the needs of psychologists, but is also suitable for students of the humanities and other biological or social sciences.

OH MY GIT #

An open source game about learning Git!

Docker Security Cheat Sheet #

The aim of this cheat sheet is to provide an easy to use list of common security mistakes and good practices that will help you secure your Docker containers.

How to Backup Docker Containers? Docker Container Backup Methods #

Guide on how to backup docker container: commin -> save -> tag -> push.

Clustering similar spatial patterns #

Code Review Checklist R Code Edition Top 3 #


Others #

authelia/authelia #

The Single Sign-On Multi-Factor portal for web apps

Fitbit is doomed: Here’s why everything Google buys turns to garbage #


Honorable mentions #

The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation #

Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning #

Deep learning-based point-scanning super-resolution imaging #

Integration and transfer learning of single-cell transcriptomes via cFIT #