HTML
-
We formulate the target sample volume V(x, y, z) as a random field on the set of all discretized axial positions Z, i.e., $I_z \in {\mathbb {R}}^{m \times n}, z \in Z$, where x, y are pixel indices on the lateral plane, m, n are the lateral dimensions of the image, and z is a certain axial position in Z. The distribution of such random fields is defined by the 3D distribution of the sample of interest, the PSF of the microscopy system, the aberrations and random noise terms present in the image acquisition system. Recurrent-MZ takes in a set of M 2D axial images, i.e., $\left\{ {I_{z_1}, I_{z_2}, \cdots , I_{z_M}} \right\}, \, 1\, < \, M \ll |Z|$, where |Z| is the cardinality of Z, defining the number of unique axial planes in the target sample. The output inference of Recurrent-MZ estimates (i.e., reconstructs) the volume of the sample and will be denoted as $V_M(x, y, z;\, I_{z_1}, I_{z_2}, \cdots , I_{z_M})$. Starting with the next sub-section we summarize Recurrent-MZ inference results using different fluorescent samples.
-
A Recurrent-MZ network was trained and validated using C. elegans samples, and then blindly tested on new specimens that were not part of the training/validation dataset. This trained Recurrent-MZ was used to reconstruct C. elegans samples with high fidelity over an extended axial range of 18 μm based on three 2D input images that were captured with an axial spacing of Δz = 6 μm; these three 2D images were fed into Recurrent-MZ in groups of two, i.e., M = 2 (Fig. 2). The comparison images of the same sample volume were obtained by scanning a wide-field fluorescence microscope with a 63×/1.4NA objective lens and capturing |Z| = 91 images with an axial spacing of Δz = 0.2 μm (see the Materials and Methods section). The inference performance of Recurrent-MZ is both qualitatively and quantitatively demonstrated in Fig. 2 and Video S1. Even in the middle of two adjacent input images (see the z = 11.4 μm row of Fig. 2), Recurrent-MZ is able to output images with a very good match to the ground truth image, achieving a normalized root mean square error (NRMSE) of 6.45 and a peak signal-to-noise ratio (PSNR) of 33.96. As also highlighted in Video S1, Recurrent-MZ is able to significantly extend the axial range of the reconstructed images using only three 2D input scans, each captured with a 1.4NA objective lens that has a depth-of-field of 0.4 μm. In addition to these, Supplementary Note 1 and Fig. S1 also compare the output images of Recurrent-MZ with the results of various interpolation algorithms, further demonstrating the advantages of Recurrent-MZ framework for volumetric imaging.
Fig. 2 Volumetric imaging of C. elegans from sparse wide-field scans using Recurrent-MZ.
The DPMs in the input sequence are used to define an arbitrary axial position (z) within the sample volume. In this implementation, Recurrent-MZ takes in 2 input scans (M = 2) to infer the image of an output plane, as indicated by the color of each output box. See Video S1 to compare the reconstructed sample volume inferred by Recurrent-MZ against the ground truth, |Z| = 91 images captured with an axial step size of 0.2 μmIt is worth noting that although Recurrent-MZ presented in Fig. 2 was trained with 2 input images (i.e., M = 2), it still can be fed with M ≥ 3 input images thanks to its recurrent scheme. Regardless of the choice of M, all Recurrent-MZ networks have the same number of parameters, where the only difference is the additional time that is required during the training and inference phases; for example the inference time of Recurrent-MZ with M = 2 and M = 3 for a single output plane (1024 × 1024 pixels) is 0.18 s and 0.28 s, respectively. In practice, using a larger M yields a better performance in terms of the reconstruction fidelity (see e.g., Fig. S2a), at the cost of a trade-off of imaging throughput and computation time. The detailed discussion about this trade-off is provided in the Discussion section.
-
Next, we demonstrated the performance of Recurrent-MZ using 50 nm fluorescence nanobeads. These nanobead samples were imaged through the TxRed channel using a 63×/1.4NA objective lens (see the Materials and Methods section). The Recurrent-MZ model was trained on a dataset with M = 3 input images, where the axial spacing between the adjacent planes was Δz = 3 μm. The ground truth images of the sample volume were captured by mechanical scanning over an axial range of 10 μm, i.e., |Z| = 101 images with Δz = 0.1 μm were obtained. Figure 3 shows both the side views and the cross-sections of the sample volume reconstructed by Recurrent-MZ (M = 3), compared against the |Z| = 101 images captured through the mechanical scanning of the same sample. The first column of Fig. 3a presents the M = 3 input images and their corresponding axial positions, which are also indicated by the blue dashed lines. Through the quantitative histogram comparison shown in Fig. 3b, we see that the reconstructed volume by Recurrent-MZ matches the ground truth volume with high fidelity. For example, the full width at half maximum (FWHM) distribution of individual nanobeads inferred by Recurrent-MZ (mean FWHM = 0.4401 μm) matches the results of the ground truth (mean FWHM = 0.4428 μm) very well. We also showed the similarity of the ground truth histogram with that of the Recurrent-MZ output by calculating the Kullback-Leibler (KL) divergence, which is a distance measure between two distributions; the resulting KL divergence of 1.3373 further validates the high fidelity of Recurrent-MZ reconstruction when compared to the ground truth, acquired through |Z| = 101 images captured via mechanical scanning of the sample with Δz = 0.1 μm.
Fig. 3 The performance of Recurrent-MZ using fluorescence nanobeads.
a Volumetric imaging using Recurrent-MZ (M = 3) and Deep-Z on 50 nm fluorescence nanobeads. There are 3 input images for Recurrent-MZ (M = 3) and to provide a fair comparison, Deep-Z always takes in the nearest input image among these 3 inputs to infer another axial plane. The PSFs generated by Recurrent-MZ, Deep-Z and mechanical scanning (Δz = 0.1 μm) are shown for comparison. b FWHM histograms for 88 individual isolated fluorescence nanobeads at z = 5.1 μm, measured from mechanical scanning (101 axial images), Deep-Z reconstruction and Recurrent-MZ reconstruction (M = 3). Also see Video S2Figure 3 also reports the comparison of Recurrent-MZ inference results with respect to another fluorescence image propagation network termed Deep-Z38. Deep-Z is designed for taking a single 2D image as input, and therefore there is an inherent trade-off between the propagation quality and the axial refocusing range (from a given focal plane), which ultimately limits the effective volumetric space-bandwidth-product (SBP) that can be achieved using Deep-Z. In this comparison between Recurrent-MZ and Deep-Z (Fig. 3), the nearest input image is used for Deep-Z based propagation; in other words, three non-overlapping volumes are separately inferred using Deep-Z from the input scans at z = 3, 6 and 9 μm, respectively (this provides a fair comparison against Recurrent-MZ with M = 3 input images). As illustrated in Fig. 3b, Deep-Z inference resulted in a mean FWHM of 0.4185 μm and a KL divergence of 2.3334, which illustrate the inferiority of single-image-based volumetric propagation, when compared to the results of Recurrent-MZ. The same conclusion regarding the performance comparison of Recurrent-MZ and Deep-Z inference is further supported using the C. elegans imaging data reported in Fig. 2 (Recurrent-MZ) and in Fig. S3 (Deep-Z). For example, Deep-Z inference results in an NRMSE of 8.02 and a PSNR of 32.08, while Recurrent-MZ (M = 2) improves the inference accuracy, achieving an NRMSE of 6.45 and a PSNR of 33.96.
-
Next, we demonstrated, through a series of experiments, the generalization performance of Recurrent-MZ on non-uniformly sampled input images, in contrast to the training regiment, which only included uniformly spaced inputs. These non-uniformly spaced input image planes were randomly selected from the same testing volume as shown in Fig. 2, with the distance between two adjacent input planes made smaller than the uniform axial spacing used in the training dataset (Δz = 6 μm). Although the Recurrent-MZ was solely trained with equidistant input scans, it generalized to successfully perform volumetric image propagation using non-uniformly sampled input images. For example, as shown in Fig. 4a, the input images of Recurrent-MZ were randomly selected at (z1, z2, z3) = (3, 7.8, 13.6) μm, respectively, and the output inference at z = 6.8 μm and z = 12.8 μm very well match the output of Recurrent-MZ that used uniformly sampled inputs acquired at (z1, z2, z3) = (3, 9, 15) μm, respectively. Figure 4b further demonstrates the inference performance of Recurrent-MZ using non-uniformly sampled inputs throughout the specimen volume. The blue (uniform inputs) and the red curves (non-uniform inputs) in Fig. 4b have very similar trends, illustrating the generalization of Recurrent-MZ, despite being only trained with uniformly-sampled input images with a fixed Δz. Figure S3 further presents another successful blind inference of Recurrent-MZ on non-uniformly sampled input images. On the other hand, the gray curve in Fig. 4b (3D U-Net with the same non-uniform inputs) clearly illustrates the generalization failure of a non-recurrent convolutional neural network (CNN) on non-uniformly sampled input images.
Fig. 4 Generalization of Recurrent-MZ to non-uniformly spaced input images.
a Recurrent-MZ was trained on C. elegans samples with equidistant inputs (M = 3, Δz = 6 μm), and blindly tested on both uniformly sampled and non-uniformly sampled input images of new samples. b The PSNR values of the output images of Recurrent-MZ with uniformly spaced and non-uniformly spaced input images, as well as the output images of 3D U-Net with non-uniformly spaced input images are all calculated with respect to the ground truth, corresponding image. Blue: Outputs of Recurrent-MZ (M = 3) for uniformly spaced inputs, Red: Outputs of Recurrent-MZ (M = 3) for non-uniformly spaced inputs, Gray: Outputs of 3D U-Net for non-uniformly spaced inputs (lower PSNR values are omitted). Dashed lines indicate the axial positions of the input 2D images. c Influence of hyperparameter Δz on Recurrent-MZ inference performance. We report the PSNR values of the output images of Recurrent-MZ (M = 3) models that were trained using different Δz = 4, 6, and 8 μm, but blindly tested on new samples imaged with Δz = 6 μm. The input images are captured at z = 3, 6, and 9 μm. d The boxplot of the PSNR values of the 3 networks (trained using Δz = 4, 6 and 8 μm)We further investigated the effect of the hyperparameter Δz on the performance of Recurrent-MZ. For this, three different Recurrent-MZ networks were trained using Δz = 4, 6, and 8 μm, respectively, and then blindly tested on a new input sequence with Δz = 6 μm. Figure 4c, d show the trade-off between the peak performance and the performance consistency over the inference axial range: by decreasing Δz, Recurrent-MZ demonstrates a better peak inference performance, indicating that more accurate propagation has been learned from smaller Δz, whereas the variance of PSNR, corresponding to the performance consistency over a larger axial range, is degraded for smaller Δz.
-
During the acquisition of the input scans, inevitable measurement errors are introduced by e.g., PSF distortions and focus drift42, which jeopardize both the precision and accuracy of the axial positioning measurements. Hence, it is necessary to take these effects into consideration and examine the stability of the Recurrent-MZ inference. For this, Recurrent-MZ was tested on the same image test set as in Fig. 2, only this time, independent and identically distributed (i.i.d.) Gaussian noise was injected into the DPM of each input image, mimicking the measurement uncertainty when acquiring the axial scans. The noise was added to the DPM as follows:
$$ {Z_{i, noised}} = {Z_i} + {z_{d, i}}J, i = 1, 2, \cdots , M $$ where Zi is the DPM (m × n matrix) of the i-th input image, zd, i ~ N(0, σ2), i = 1, 2, ..., M and J is an all-one m × n matrix.
The results of this noise analysis reveal that, as illustrated in Fig. 5b, the output images of Recurrent-MZ (M = 2) at z = 4.6 μm degrade as the variance of the injected noise increases, as expected. However, even at a relatively significant noise level, where the microscope stage or sample drift is represented with a standard variation of σ = 1 μm (i.e., 2.5-fold of the objective lens depth-of-field, 0.4 μm), Recurrent-MZ inference successfully matches the ground truth with an NRMSE of 5.94; for comparison, the baseline inference (with σ = 0 μm) has an NRMSE of 5.03. The same conclusion also holds for output images at z = 6.8 μm, which highlights the resilience of Recurrent-MZ framework against axial scanning errors and/or uncontrolled drifts in the sample/stage.
Fig. 5 Stability test of Recurrent-MZ inference.
a An additive Gaussian noise with zero mean and a standard variance of σ was injected into each DPM to test the stability of Recurrent-MZ inference. The output images and difference maps (with respect to ground truth) with no injected noise (σ = 0) and with different levels of noise injection are shown. b The NRMSE-σ boxplots for Recurrent-MZ output images at z = 4.6 μm and z = 6.8 μm are reported. NRMSE values were calculated over 50 random tests. The difference maps were normalized by the maximum difference between the input images and the ground truth -
Next, we focused on post hoc interpretation43, 44 of the Recurrent-MZ framework, without any modifications to its design or the training process. For this, we explored to see if Recurrent-MZ framework exhibits permutation invariance, i.e.,
$$ {V_M}\left( {{I_1}, {I_2}, \cdots , {I_M}} \right) = {V_M}\left( {{I_{{i_1}}}, {I_{{i_2}}}, \cdots , {I_{{i_M}}}} \right), \forall \left( {{i_1}, {i_2}, \cdots , {i_M}} \right) \in {S_M} $$ where SM is the permutation group of M. To explore the permutation invariance of Recurrent-MZ (see Fig. 6), the test set's input images were randomly permuted, and fed into the Recurrent-MZ (M = 3), which was solely trained with input images sorted by z. We then quantified Recurrent-MZ outputs over all the 6 permutations of the M = 3 input images, using the average RMSE (μRMSE) and the standard deviation of the RMSE (σRMSE), calculated with respect to the ground truth image I:
$$ \begin{array}{l} {\mu _{RMSE}} = \frac{1}{6}\mathop \sum \limits_{\left( {{i_1}, {i_2}, {i_3}} \right) \in {S_3}} {\rm{RMSE}}\left( {{{\rm{V}}_{{\rm{iii}}}}\left( {{I_{{i_1}}}, {I_{{i_2}}}, {I_{{i_3}}}} \right), I} \right)\\ {\sigma _{RMSE}} = \sqrt {\frac{1}{6}\mathop \sum \limits_{\left( {{i_1}, {i_2}, {i_3}} \right) \in {S_3}} {{\left( {{\rm{RMSE}}\left( {{{\rm{V}}_{{\rm{iii}}}}\left( {{I_{{i_1}}}, {I_{{i_2}}}, {I_{{i_3}}}} \right), I} \right) - {\mu _{RMSE}}} \right)}^2}} \end{array} $$ Fig. 6 Permutation invariance of Recurrent-MZ to the input images.
Recurrent-MZ was trained with inputs (M = 3) sorted by z and tested on new samples with both inputs sorted by z as well as 6 random permutations of the same inputs to test its permutation invariance. a The input images sorted by z, and the RMSE values between the ground truth image and the corresponding nearest input image are shown. b The Recurrent-MZ outputs of the input sequence (I1, I2, I3), c the test outputs with input sequence (I2, I1, I3), the corresponding difference maps and the pixel-wise standard deviation over all the 6 random permutations, d the ground truth images obtained by mechanical scanning through the same sample, acquired with an axial spacing of 0.2 μm, e red solid line: the average RMSE of the outputs of randomly permuted input images; pink shadow: the standard deviation RMSE of the outputs of randomly permuted input images; blue solid line: the RMSE of the output of input images sorted by z; gray solid line: the RMSE value of the nearest interpolation using the input images, calculated with respect to the ground truth images. Gray dashed lines (vertical) indicate the axial positions of input images. RMSE and RMS values were calculated based on the yellow highlighted ROIs. The range of grayscale images is 255, while that of the standard variance images is 31where RMSE(I, J) gives the RMSE between image I and J. In Fig. 6e, the red line indicates the average RMSE over 6 permutations and the pink shaded region indicates the standard deviation of RMSE over these 6 permutations. RMSE and RMS values were calculated based on the yellow highlighted regions of interest (ROIs) in Fig. 6. Compared with the blue line in Fig. 6e, which corresponds to the output of the Recurrent-MZ with the inputs sorted by z, the input image permutation results highlight the success of Recurrent-MZ with different input image sequences, despite being trained solely by depth sorted inputs. In contrast, non-recurrent CNN architectures, such as 3D U-Net45, inevitably lead to input permutation instability as they require a fixed length and sorted input sequences; this failure of non-recurrent CNN architectures is illustrated in Fig. S5.
We also explored different training schemes to further improve the permutation invariance of Recurrent-MZ, including training with input images sorted in descending order by the relative distance (dz) to the output plane as well as randomly sorted input images. As shown in Fig. S6, the Recurrent-MZ trained with input images that are sorted by depth, z, achieves the best inference performance, indicated by an NRMSE of 4.03, whereas incorporating randomly ordered inputs in the training phase results in the best generalization for different input image permutations. The analyses reported in Fig. S6 further highlight the impact of different training schemes on the inference quality and the permutation invariance feature of the resulting trained Recurrent-MZ network.
-
Next, we explored to see if Recurrent-MZ framework exhibits repetition invariance. Figure 7 demonstrates the repetition invariance of Recurrent-MZ when it was repeatedly fed with input image I1. The output images of Recurrent-MZ in Fig. 7b show its consistency for 2, 4 and 6 repetitions of I1, i.e., Vⅱ(I1, I1), Vⅱ (I1, I1, I1, I1) and Vⅱ(I1, I1, I1, I1, I1, I1), which resulted in an RMSE of 12.30, 11.26, and 11.73, respectively. Although Recurrent-MZ was never trained with repeated input images, its recurrent scheme still demonstrates the correct propagation under repeated inputs of the same 2D plane. When compared with the output of Deep-Z (i.e., Deep-Z(I1)) shown in Fig. 7c, Recurrent-MZ, with a single input image or its repetitions, exhibits comparable reconstruction quality. Figure S7 also presents a similar comparison when M = 3, further supporting the same conclusion.
Fig. 7 Repetition invariance of Recurrent-MZ.
Recurrent-MZ was trained with inputs (M = 2) sorted by their relative distances (dz) to the output plane, but tested on a new sample by repeatedly feeding the input image (I1) to test its repetition invariance. a The input images and the ground truth image obtained by mechanical scanning (with an axial spacing of 0.2 μm), b the Recurrent-MZ outputs and the corresponding difference maps of repeated I1, i.e., Vⅱ(I1, I1), Vⅱ(I1, I1, I1, I1) and Vⅱ(I1, I1, I1, I1, I1, I1) as well as Vⅱ(I1, I2) and Vⅱ(I2, I1), c the outputs and corresponding difference maps of Deep-Z with a single input image (I1 or I2), and the pixel-wise average of Deep-Z(I1) and Deep-Z(I2). All RMSE values are calculated based on the region of interest (ROI) marked by the yellow box. The range of grayscale images is 255 while that of the standard variance images is 31While for a single input image (I1 or its repeats) the blind inference performance of Recurrent-MZ is on par with Deep-Z(I1), the incorporation of multiple input planes gives a superior performance to Recurrent-MZ over Deep-Z. As shown in the last two columns of Fig. 7b, by adding another depth image, I2, the output of Recurrent-MZ is significantly improved, where the RMSE decreased to 8.78; this represents a better inference performance compared to Deep-Z(I1) and Deep-Z(I2) as well as the average of these two Deep-Z outputs (see Fig. 7b, c). The same conclusion is further supported in Fig. S7b, c for M = 3, demonstrating that Recurrent-MZ is able to outperform Deep-Z even if all of its M input images are individually processed by Deep-Z and averaged, showing the superiority of the presented recurrent inference framework.
-
The presented Recurrent-MZ framework can also be applied to perform cross-modality volumetric imaging, e.g., from wide-field to confocal, where the network takes in a few wide-field 2D fluorescence images (input) to infer at its output a volumetric image stack, matching the fluorescence images of the same sample obtained by a confocal microscope; we termed this cross-modality image transformation framework as Recurrent-MZ+. To experimentally demonstrate this unique capability, Recurrent-MZ+ was trained using wide-field (input) and confocal (ground truth) image pairs corresponding to C. elegans samples (see the Materials and Methods section for details). Figure 8 and Movie S3 report blind-testing results on new images never used in the training phase. In Fig. 8, M = 3 wide-field images captured at z = 2.8, 4.8, and 6.8 μm were fed into Recurrent-MZ+ as input images and were virtually propagated onto axial planes from 0 to 9 μm with 0.2 μm spacing; the resulting Recurrent-MZ+ output images provided a very good match to the corresponding confocal 3D image stack obtained by mechanical scanning (also see Movie S3). Figure 8b further illustrates the maximum intensity projection (MIP) side views (x-z and y-z), showing the high fidelity of the reconstructed image stack by Recurrent-MZ+ with respect to the mechanical confocal scans. In contrast to the wide-field image stack of the same sample (with 46 image scans), where only a few neurons can be recognized in the MIP views with deformed shapes, the reconstructed image stack by Recurrent-MZ+ shows substantially sharper MIP views using only M = 3 input images, and also mitigates the neuron deformation caused by the elongated wide-field PSF, providing a comparable image quality with respect to the confocal microscopy image stack (Fig. 8b).
Fig. 8 Wide-field to confocal: cross-modality volumetric imaging using Recurrent-MZ+.
a Recurrent-MZ+ takes in M = 3 wide-field input images along with the corresponding DPMs, and rapidly outputs an image at the designated/desired axial plane, matching the corresponding confocal scan of the same sample plane. b Maximum intensity projection (MIP) side views (x-z and y-z) of the wide-field (46 image scans), Recurrent-MZ+ (M = 3) and the confocal ground truth image stack. Each scale bar is 2 μm. Horizontal arrows in (b) mark the axial planes of I1, I2 and I3. Also see Video S3
Recurrent-MZ based volumetric imaging of C. elegans samples
Recurrent-MZ based volumetric imaging of fluorescence nanobeads
Generalization of Recurrent-MZ to non-uniformly sampled input images
Inference stability of Recurrent-MZ
Permutation invariance of Recurrent-MZ
Repetition invariance of Recurrent-MZ
Demonstration of cross-modality volumetric imaging: wide-field to confocal
-
The C. elegans samples were firstly cultured and stained with GFP using the strain AML18. AML18 carries the genotype wtfIs3 [rab-3p: : NLS: : GFP+rab-3p: : NLS: : tagRFP] and expresses GFP and tagRFP in the nuclei of all the neurons. C. elegans samples were cultured on nematode growth medium seeded with OP50 E. Coli bacteria using standard conditions. During the imaging process, the samples were washed off the plates with M9 solution and anesthetized with 3 mM levamisole, and then mounted on slides seeded with 3% agarose.
The wide-field and confocal microscopy images of C. elegans were captured by an inverted scanning microscope (TCS SP8, Leica Microsystems), using a 63×/1.4NA objective lens (HC PL APO 63×/1.4NA oil CS2, Leica Microsystems) and a FITC filter set (excitation/emission wavelengths: 495 nm/519 nm), resulting in a DOF about 0.4 μm. A monochrome scientific CMOS camera (Leica DFC9000GTC-VSC08298) was used for wide-field imaging where each image has 1024 × 1024 pixels and 12-bit dynamic range; a photo-multiplier tube (PMT) recorded the confocal image stacks. For each FOV, 91 images with 0.2 μm axial spacing were recorded, where the starting position of the axial scan (z = 0 μm) was set on the boundary of each worm. A total of 100 FOVs were captured and exclusively divided into training, validation and testing datasets at the ratio of 41:8:1, respectively, where the testing dataset was strictly captured on distinct worms that were not used in training dataset.
The nanobead image dataset consists of wide-field microscopic images that were captured using 50 nm fluorescence beads with a Texas Red filter set (excitation/emission wavelengths: 589 nm/615 nm). The wide-field microscopy system consists of an inverted scanning microscope (TCS SP8, Leica Microsystems) and a 63×/1.4NA objective lens (HC PL APO 63×/1.4NA oil CS2, Leica Microsystems). The nanobeads were purchased from MagSphere (PSF-050NM RED), and ultrasonicated before dilution into the heated agar solution. ~1 mL diluted bead-agar solution was further mixed to break down the bead clusters and then a 2.5 µL droplet was pipetted onto a cover slip, spread and dried for imaging. Axial scanning was implemented and the system started to record images (z = 0 μm) when a sufficient number of nanobeads could be seen in the FOV. Each volume contains 101 images with 0.1 μm axial spacing. A subset of 400, 86 and 16 volumes were exclusively divided as training, validation and testing datasets.
Each captured image volume was first axially aligned using the ImageJ plugin 'StackReg'50 for correcting the lateral stage shift and stage rotation. Secondly, an image with extended depth of field (EDF) was generated for each volume, using the ImageJ plugin 'Extended Depth of Field'51. The EDF image was later used as a reference for the following image processing steps: (1) apply triangle thresholding to the EDF image to separate the background and foreground contents38, (2) draw the mean intensity from the background pixels as the shift factor, and the 99% percentile of the foreground pixels as the scale factor, (3) normalize the volume by the shift and scale factors. For Recurrent-MZ+, confocal image stacks were registered to their wide-field counterparts using the same feature-based registration method reported earlier38. Thirdly, training FOVs were cropped into small regions of 256 × 256 pixels without any overlap. Eventually, the data loader randomly selects M images from the volume with an axial spacing of Δz = 6 μm (C. elegans) and Δz = 3 μm (nanobeads) in both the training and testing phases.
-
Recurrent-MZ is based on a convolutional recurrent network52 design, which combines the advantages of both convolutional neural networks39 and recurrent neural networks in processing sequential inputs53, 54. A common design of the network is formed by an encoder-decoder structure55, 56, with the convolutional recurrent units applying to the latent domain40, 57-59. Furthermore, inspired by the success of exploiting multiscale features in image translation tasks60-62, a sequence of cascaded encoder-decoder pairs is utilized to exploit and incorporate image features at different scales from different axial positions.
As shown in Fig. 1b, the output of last encoder block xk−1 is pooled and then fed into the k-th block, which can be expressed as
$$ {x_k} = {\rm{ReLU}}\left( {{\rm{BN}}\left( {{\rm{Con}}{{\rm{v}}_{k, 2}}\left( {{\rm{ReLU}}\left( {{\rm{BN}}\left( {{\rm{Con}}{{\rm{v}}_{k, 1}}\left( {{\rm{P}}\left( {{x_{k - 1}}} \right)} \right)} \right)} \right)} \right)} \right)} \right) $$ (1) where P(·) is the 2 × 2 max-pooling operation, BN(·) is batch normalization, ReLU(·) is the rectified linear unit activation function and Convk, i(·) stands for the i-th convolution layer in the k-th encoder block. The convolution layers in all convolution blocks have a kernel size of 3 × 3, with a stride of 1, and the number of channels for Convk, 1 and Convk, 2 are 20 · 2k−2 and 20 · 2k−1, respectively. Then, xk is sent to the recurrent block, where features from the sequential input images are recurrently integrated:
$$ {s_k} = {x_k} + {\rm{Con}}{{\rm{v}}_{k, 3}}\left( {{\rm{RCon}}{{\rm{v}}_k}\left( {{x_k}} \right)} \right) $$ (2) where RConvk(·) is the convolutional recurrent layer with kernels of 3 × 3 and a stride of 1, the Convk, 3(·) is a 1 × 1 convolution layer. Finally, at the decoder part, sk is concatenated with the up-sampled output from last decoder convolution block, and fed into the k-th decoder block, so the output of k-th decoder block can be expressed as
$$ {y_k} = {\rm{ReLU}}\left( {{\rm{BN}}\left( {{\rm{Con}}{{\rm{v}}_{k, 5}}\left( {{\rm{ReLU}}\left( {{\rm{BN}}\left( {{\rm{Con}}{{\rm{v}}_{k, 4}}\left( {{\rm{I}}\left( {{y_{k - 1}}} \right) \oplus {s_k}} \right)} \right)} \right)} \right)} \right)} \right) $$ (3) where ⊕ is the concatenation operation, I(·) is the 2 × 2 up-sampling operation using nearest interpolation and Convk, i(·) are the convolution layers of the k-th decoder block.
In this work, the gated recurrent unit (GRU)63 is used as the recurrent unit, i.e., the RConv(·) layer in Eq. (2) updates ht, given the input xt, through the following three steps:
$$ f_t = \sigma \left( {W_f \ast x_t + U_f \ast h_{t - 1} + b_f} \right) $$ (4) $$ \widehat {h_t} = \tan \!{\mathrm{h}}\left( {W_h \ast x_t + U_h \ast \left( {f_t \odot h_{t - 1}} \right) + b_h} \right) $$ (5) $$ h_t = \left( {1 - f_t} \right) \odot h_{t - 1} + f_t \odot \widehat {h_t} $$ (6) where ft, ht are forget and output vectors at time step t, respectively, Wf, Wh, Uf, Uh are the corresponding convolution kernels, bf, bh are the corresponding biases, σ is the sigmoid activation function, * is the 2D convolution operation, and ⊙ is the element-wise multiplication. Compared with long short term memory (LSTM) network64, GRU entails fewer parameters but is able to achieve similar performance.
The discriminator (D) is a CNN consisting of five convolutional blocks and two dense layers. The k-th convolutional block has two convolutional layers with 20 · 2k channels. A global average pooling layer compacts each channel before the dense layers. The first dense layer has 20 hidden units with ReLU activation function and the second dense layer uses a sigmoid activation function. The GAN structure and other details of both the generator and discriminator networks are reported in Fig. S8.
-
The Recurrent-MZ was written and implemented using TensorFlow 2.0. In both training and testing phases, a DPM is automatically concatenated with the input image by the data loader, indicating the relative axial position of the input plane to the desired output plane, i.e., the input in the training phase has dimensions of M × 256 × 256 × 2. Through varying the DPMs, Recurrent-MZ learns to digitally propagate inputs to any designated plane, and thus forming an output volume with dimensions of |Z| × 256 × 256.
The training loss of Recurrent-MZ is composed of three parts: (ⅰ) pixel-wise BerHu loss65, 66, (ⅱ) multiscale structural similarity index (MSSSIM)67, and (ⅲ) the adversarial loss using the generative adversarial network (GAN)68 structure. Based on these, the total loss of Recurrent-MZ, i.e., LV, is expressed as
$$ {L_V} = \alpha {\rm{BerHu}}\left( {\hat y, y} \right) + \beta {\rm{MSSSIM}}\left( {\hat y, y} \right) + \gamma {\left[ {D\left( {\hat y} \right) - 1} \right]^2} $$ (7) $\hat y$ is the output image of the Recurrent-MZ, and y is the ground truth image for a given axial plane. α, β, γ are the hyperparameters, which were set as 3, 1 and 0.5, respectively. And the MSSSIM and BerHu losses are expressed as:
$$ \begin{array}{l} {\rm{MSSSIM}}\left( {x, y} \right) = {\left[ {\frac{{2{\mu _{{x_M}}}{\mu _{{y_M}}} + {C_1}}}{{\mu _{{x_M}}^2 + \mu _{{y_M}}^2 + {C_1}}}} \right]^{{\alpha _M}}}\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \times \mathop \prod \limits_{j = 1}^M {\left[ {\frac{{2{\sigma _{{x_j}}}{\sigma _{{y_j}}} + {C_2}}}{{\sigma _{{x_j}}^2 + \sigma _{{y_j}}^2 + {C_2}}}} \right]^{{\beta _j}}}{\left[ {\frac{{\sigma _{{x_j}{y_j}}^2 + {C_3}}}{{{\sigma _{{x_j}}}{\sigma _{{y_j}}} + {C_3}}}} \right]^{{\gamma _j}}} \end{array} $$ (8) $$ \begin{array}{l} {\mathop{\rm BerHu}\nolimits} (x, y) = \sum\limits_{\begin{array}{*{20}{c}} {m, n}\\ {|x(m, n) - y(m, n)| \le c} \end{array}} {\left| {x(m, n) - y(m, n)} \right|} \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; + \sum\limits_{\begin{array}{*{20}{c}} {m, n}\\ {|x(m, n) - y(m, n)| > c} \end{array}} {\frac{{{{[x(m, n) - y(m, n)]}^2} + {c^2}}}{{2c}}} \end{array} $$ (9) xj, yj are 2j−1 down-sampled images of x, y, respectively, $\mu _x, \, \sigma _x^2$ denote the mean and variance of x, respectively, and $\sigma _{xy}^2$ denotes the covariance between x and y. x(m, n) is the intensity value at pixel (m, n) of image x. αM, βj, γj, Ci are empirical constants67 and c is a constant set as 0.1. BerHu and MSSSIM losses provide a structural loss term, in addition to the adversarial loss, focusing on the high-level image features. The combination of SSIM or MSSSIM evaluating regional or global similarity, and a pixel-wise loss term with respect to the ground truth (such as L1, L2, Huber and BerHu) has been shown to improve network performance in image translation and restoration tasks69.
The loss for the discriminator LD is defined as:
$$ L_D = \frac{1}{2}D\left( {\hat y} \right)^2 + \frac{1}{2}\left[ {D\left( y \right) - 1} \right]^2 $$ (10) where D is the discriminator of the GAN framework. An Adam optimizer70 with an initial learning rate 10−5 was employed for stochastic optimization.
The training time on a PC with Intel Xeon W-2195 CPU, 256 GB RAM and one single NVIDIA RTX 2080 Ti graphic card is about 3 days. After optimization for mixed precision and parallel computation, the image reconstruction using Recurrent-MZ (M = 3) takes ~0.15 s for an output image of 1024 × 1024, and ~3.42s for a volume of 101 × 1024 × 1024 pixels.
-
The Deep-Z network, used for comparison purposes, is identical as in ref. 38, and was trained and tested on the same dataset as Recurrent-MZ using the same machine. The loss function, optimizer and hyperparameter settings were also identical to ref. 38. Due to the single-scan propagation of Deep-Z, the training range is $\frac{1}{M}$ of that of Recurrent-MZ, depending on the value of M used in the comparison. The reconstructed volumes over a large axial range, as presented in the manuscript, were axially stacked using M non-overlapping volumes, which were propagated from different input scans and covered $\frac{1}{M}$ of the total axial range. The Deep-Z reconstruction time for a 1024 × 1024 output image on the same machine as Recurrent-MZ is ~0.12 s.
-
For each input sequence of M × 256 × 256 × 2 (the second channel is the DPM), it was reshaped as a tensor of 256 × 256 × (2M) and fed into the 3D U-Net45. When permuting the M input scans, the DPMs always follow the corresponding images/scans. The number of channels at the last convolutional layer of each down-sampling block is 60 · 2k and the convolutional kernel is 3 × 3 × 3. The network structure is the same as reported in ref. 45. The other training settings, such as the loss function and optimizer are similar to Recurrent-MZ. The reconstruction time (M = 3) for an output image of 1024 × 1024 on the same machine (Intel Xeon W-2195 CPU, 256 GB RAM and one single NVIDIA RTX 2080 Ti graphic card) is ~0.2 s.