Useful or not, from you.
tensorflow Model not deterministic, even though os.environ['TF_DETERMINISTIC_OPS'] = '1' is set

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Pretty much the MirroredStrategy fmnist example
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): tensorflow/tensorflow:2.2.0rc2-gpu-py3
  • TensorFlow installed from (source or binary): tensorflow/tensorflow:2.2.0rc2-gpu-py3
  • TensorFlow version (use command below): tensorflow/tensorflow:2.2.0rc2-gpu-py3
  • Python version: tensorflow/tensorflow:2.2.0rc2-gpu-py3
  • CUDA/cuDNN version: tensorflow/tensorflow:2.2.0rc2-gpu-py3
  • GPU model and memory: 1050M

Describe the current behavior Model is not deterministic/reproducible. Two runs:

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 2s 0us/step
Epoch 1, Loss: 0.17844311892986298, Accuracy: 0.9466999769210815,Test Loss: 0.057941436767578125, Test Accuracy: 0.9815000295639038
Epoch 2, Loss: 0.05286668613553047, Accuracy: 0.9836500287055969,Test Loss: 0.044471099972724915, Test Accuracy: 0.9853000044822693
Epoch 3, Loss: 0.03694676235318184, Accuracy: 0.9883000254631042,Test Loss: 0.034947194159030914, Test Accuracy: 0.9897000193595886
Epoch 4, Loss: 0.028592929244041443, Accuracy: 0.9910500049591064,Test Loss: 0.027234185487031937, Test Accuracy: 0.9907000064849854
Epoch 5, Loss: 0.022629836574196815, Accuracy: 0.9927666783332825,Test Loss: 0.029115190729498863, Test Accuracy: 0.9904000163078308
Epoch 6, Loss: 0.0172086451202631, Accuracy: 0.9944999814033508,Test Loss: 0.027797872200608253, Test Accuracy: 0.9902999997138977
Epoch 7, Loss: 0.013981950469315052, Accuracy: 0.9956499934196472,Test Loss: 0.02764272689819336, Test Accuracy: 0.9909999966621399
Epoch 8, Loss: 0.01210874691605568, Accuracy: 0.9961333274841309,Test Loss: 0.035009630024433136, Test Accuracy: 0.9896000027656555
Epoch 9, Loss: 0.008961305022239685, Accuracy: 0.9971666932106018,Test Loss: 0.034057389944791794, Test Accuracy: 0.9905999898910522
Epoch 10, Loss: 0.00800476036965847, Accuracy: 0.9972166419029236,Test Loss: 0.033878158777952194, Test Accuracy: 0.9900000095367432
GPU Run Time: 70.80781483650208 seconds
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 2s 0us/step
Epoch 1, Loss: 0.1761329025030136, Accuracy: 0.9478499889373779,Test Loss: 0.05224931612610817, Test Accuracy: 0.9835000038146973
Epoch 2, Loss: 0.05251472815871239, Accuracy: 0.9836666584014893,Test Loss: 0.04059470072388649, Test Accuracy: 0.9860000014305115
Epoch 3, Loss: 0.03771379590034485, Accuracy: 0.98785001039505,Test Loss: 0.03189479187130928, Test Accuracy: 0.9894000291824341
Epoch 4, Loss: 0.027971116825938225, Accuracy: 0.9912333488464355,Test Loss: 0.03176414594054222, Test Accuracy: 0.9890000224113464
Epoch 5, Loss: 0.022653400897979736, Accuracy: 0.9925000071525574,Test Loss: 0.03643624112010002, Test Accuracy: 0.9876999855041504
Epoch 6, Loss: 0.01727919466793537, Accuracy: 0.9942166805267334,Test Loss: 0.02887595444917679, Test Accuracy: 0.9901000261306763
Epoch 7, Loss: 0.01397143118083477, Accuracy: 0.9957500100135803,Test Loss: 0.03118096850812435, Test Accuracy: 0.9905999898910522
Epoch 8, Loss: 0.01202292088419199, Accuracy: 0.9961333274841309,Test Loss: 0.03164077177643776, Test Accuracy: 0.9909999966621399
Epoch 9, Loss: 0.008715414442121983, Accuracy: 0.9971333146095276,Test Loss: 0.04146642982959747, Test Accuracy: 0.9896000027656555
Epoch 10, Loss: 0.008586470037698746, Accuracy: 0.9969000220298767,Test Loss: 0.033046264201402664, Test Accuracy: 0.9902999997138977
GPU Run Time: 72.08828902244568 seconds

Describe the expected behavior I expect the model to be reproducible with the same loss, accuracy etc Standalone code to reproduce the issue

#!/usr/bin/env python 
import tensorflow as tf
import numpy as np
import argparse
import time
import random
import os

from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras import Model


def random_seed(seed):
    os.environ['PYTHONHASHSEED'] = str(seed) # Python general
    np.random.seed(seed)
    random.seed(seed) # Python random
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

# Not yet using click due to Docker issues
parser = argparse.ArgumentParser(description='Tensorflow entry point')
parser.add_argument('--epochs', type=int, default=10)
parser.add_argument('--seed', type=int, default=0)
args = parser.parse_args()

# Detect GPUs
print(f'Num GPUs Available: {len(tf.config.experimental.list_physical_devices("GPU"))}')

# Load MNIST
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Adding a dimension to the array -> new shape == (28, 28, 1), since the first layer in our model is a convolutional
# layer and it requires a 4D input (batch_size, height, width, channels).
# batch_size dimension will be added later on.
train_images = train_images[..., None]
test_images = test_images[..., None]

# Normalizing the images to [0, 1] range.
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)

# Use MirroredStrategy for multi GPU support
# If the list of devices is not specified in the`tf.distribute.MirroredStrategy` constructor, it will be auto-detected.
strategy = tf.distribute.MirroredStrategy()

BUFFER_SIZE = len(train_images)
BATCH_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

# Batch and distribute data
train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE) 
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE) 
train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset)

# Fix seeds
random_seed(0)

# Define model
def create_model():
    model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10)
    ])

    return model

# Define Loss and accuracyc metrics
with strategy.scope():
    # Set reduction to `none` so reduction can be done afterwards and divide by global batch size.
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True,
        reduction=tf.keras.losses.Reduction.NONE)
    def compute_loss(labels, predictions):
        per_example_loss = loss_object(labels, predictions)

        return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)

    test_loss = tf.keras.metrics.Mean(name='test_loss')

    train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
    test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='test_accuracy')


# Define model, optimizer, training- and test step
with strategy.scope():
  model = create_model()
  optimizer = tf.keras.optimizers.Adam()

  def train_step(inputs):
    images, labels = inputs

    with tf.GradientTape() as tape:
        predictions = model(images, training=True)
        loss = compute_loss(labels, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    train_accuracy.update_state(labels, predictions)

    return loss 

  def test_step(inputs):
    images, labels = inputs

    predictions = model(images, training=False)
    t_loss = loss_object(labels, predictions)
    test_loss.update_state(t_loss)
    test_accuracy.update_state(labels, predictions)


with strategy.scope():
  # `run` replicates the provided computation and runs it with the distributed input.
  @tf.function
  def distributed_train_step(dataset_inputs):
    per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
    return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)
 
  @tf.function
  def distributed_test_step(dataset_inputs):
    return strategy.run(test_step, args=(dataset_inputs,))

  gpu_runtime = time.time()
  for epoch in range(args.epochs):
    # TRAIN LOOP
    total_loss = 0.0
    num_batches = 0
    for dist_dataset in train_dist_dataset:
      total_loss += distributed_train_step(dist_dataset)
      num_batches += 1
    train_loss = total_loss / num_batches

    # TEST LOOP
    for dist_dataset in test_dist_dataset:
      distributed_test_step(dist_dataset)

    print(f'Epoch {epoch + 1}, Loss: {train_loss}, Accuracy: {train_accuracy.result()},'
          f'Test Loss: {test_loss.result()}, Test Accuracy: {test_accuracy.result()}')

    # Reset states
    test_loss.reset_states()
    train_accuracy.reset_states()
    test_accuracy.reset_states()

  print(f'GPU Run Time: {str(time.time() - gpu_runtime)} seconds')

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

def random_seed(seed):
    os.environ['PYTHONHASHSEED'] = str(seed) # Python general
    np.random.seed(seed)
    random.seed(seed) # Python random
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

I guess this should cover everything?

The code is currently running on a SINGLE GPU, even though I'm planning to run it on several GPUs.

That's a useful answer
Without any help

I'm now actively working on this issue ...