Optical Flow | RAFT: Recurrent All-Paris Field Transforms | ECCV2020

Optical Flow | RAFT: Recurrent All-Paris Field Transforms | ECCV2020

  • The article was transferred to the WeChat public account "Machine Learning Alchemy"
  • Author: Alchemy Brother (Authorized)
  • Author contact information: WeChat cyx645016617 (communication is welcome)
  • Paper name: "RAFT: Recurrent All-Pairs Field Transforms for Optical Flow"
  • The paper code has been made public: github.com/princeton-v...

"Introduction": This article combines the official code to give an explanation of the model structure.

Model structure

  • Including Feature Encoder, Context Encoder;
  • After that is the loop structure
class RAFT ( nn.Module ): def __init__ ( self, args ): super (RAFT, self).__init__() self.args = args if args.small: self.hidden_dim = hdim = 96 self.context_dim = cdim = 64 args.corr_levels = 4 args.corr_radius = 3 else : self.hidden_dim = hdim = 128 self.context_dim = cdim = 128 args.corr_levels = 4 args.corr_radius = 4 if 'Dropout' not in self.args: self.args.dropout = 0 IF 'alternate_corr' not in self.args: self.args.alternate_corr = False # feature network, context network, and update block if args.small: self.fnet = SmallEncoder(output_dim= 128 , norm_fn = 'instance' , dropout=args.dropout) self.cnet = SmallEncoder(output_dim=hdim+cdim, norm_fn = 'none' , dropout=args.dropout) self.update_block = SmallUpdateBlock(self.args, hidden_dim=hdim) else : self.fnet = BasicEncoder(output_dim = 256 , norm_fn= 'instance' , dropout=args.dropout) self.cnet = BasicEncoder(output_dim=hdim+cdim, norm_fn = 'batch' , dropout=args.dropout) self.update_block = BasicUpdateBlock(self.args, hidden_dim=hdim) def freeze_bn ( self ): for m in self.modules(): if isinstance (m, nn.BatchNorm2d): m. eval () def initialize_flow ( self, img ): """ Flow is represented as difference between two coordinate grids flow = coords1-coords0""" N, C, H, W = img.shape coords0 = coords_grid(N, H//8 , W//8 ).to(img.device) coords1 = coords_grid(N, H//8 , W//8 ).to(img.device) # optical flow computed as difference: flow = coords1-coords0 return coords0, coords1 def upsample_flow ( self, flow, mask ): """ Upsample flow field [H/8, W/8, 2] -> [H, W, 2] using convex combination """ N, _, H, W = flow.shape mask = mask.view(N, 1 , 9 , 8 , 8 , H, W) mask = torch.softmax(mask, dim = 2 ) up_flow = F.unfold( 8 * flow, [ 3 , 3 ], padding = 1 ) up_flow = up_flow.view(N, 2 , 9 , 1 , 1 , H, W) up_flow = torch. sum (mask * up_flow, dim = 2 ) up_flow = up_flow.permute( 0 , 1 , 4 , 2 , 5 , 3 ) return up_flow.reshape(N, 2 , 8 *H, 8 *W) def forward ( self, image1, image2, iters= 12 , flow_init= None , upsample= True , test_mode= False ): """ Estimate optical flow between pair of frames """ image1 = 2 * (image1/255.0 ) -1.0 image2 = 2 * (image2/255.0 ) -1.0 image1 = image1.contiguous() image2 = image2.contiguous() hdim = self.hidden_dim cdim = self.context_dim # run the feature network with autocast(enabled=self.args.mixed_precision): fmap1, fmap2 = self.fnet([image1, image2]) fmap1 = fmap1. float () fmap2 = fmap2. float () if self.args.alternate_corr: corr_fn = AlternateCorrBlock(fmap1, fmap2, radius=self.args.corr_radius) else : corr_fn = CorrBlock(fmap1, fmap2, radius=self.args.corr_radius) # run the context network with autocast(enabled=self.args.mixed_precision): cnet = self.cnet(image1) net, inp = torch.split(cnet, [hdim, cdim], dim = 1 ) net = torch.tanh(net) inp = torch.relu(inp) coords0, coords1 = self.initialize_flow(image1) if flow_init is not None : coords1 = coords1 + flow_init flow_predictions = [] for itr in range (iters): coords1 = coords1.detach() corr = corr_fn(coords1) # index correlation volume flow = coords1-coords0 with autocast(enabled=self.args.mixed_precision): net, up_mask, delta_flow = self.update_block(net, inp, corr, flow) # F(t+1) = F(t) +/Delta(t) coords1 = coords1 + delta_flow # upsample predictions if up_mask is None : # upflow8 uses linear interpolation to upsample 8 times. flow_up = upflow8(coords1-coords0) else : flow_up = self.upsample_flow(coords1-coords0, up_mask) flow_predictions.append(flow_up) if test_mode: return coords1-coords0, flow_up return flow_predictions Copy code

You can see that the steps of the model running are:

  • Enter two pictures image1 and image2;
  • The two pictures are input into fnet together, and the feature maps fmap1 and fmap2 of the two pictures are obtained;
  • Input fmap1 and fmap2 into CorrBlock to get the correlation feature matrix corr of the two; this CorrBlock needs to be built with cuda programming, and the author has provided the corresponding cpp and cu files;
  • image1 obtains two characteristics of net and inp through context network (cnet), and the structure of cnet and fnet is basically the same;
  • Then it initializes the optical flow and the corresponding grid, and then enters the loop part;
  • Put net, inp, corr, and flow into the updateblock every time, and calculate net, up_mask, and delta_flow;

Encoder

The author provides a BasicEncoder, and kindly provides a smallEncoder for friends who don t have enough video memory. Here we only explain the structure of BasicEncoder:

class BasicEncoder ( nn.Module ): def __init__ ( self, output_dim = 128 , norm_fn = 'batch' , dropout = 0.0 ): super (BasicEncoder, self).__init__() self.norm_fn = norm_fn if self.norm_fn == 'group' : self.norm1 = nn.GroupNorm(num_groups = 8 , num_channels = 64 ) elif self.norm_fn == 'batch' : self.norm1 = nn.BatchNorm2d( 64 ) elif self.norm_fn == 'instance' : self.norm1 = nn.InstanceNorm2d( 64 ) elif self.norm_fn == 'none' : self.norm1 = nn.Sequential() self.conv1 = nn.Conv2d( 3 , 64 , kernel_size = 7 , stride = 2 , padding = 3 ) self.relu1 = nn.ReLU(inplace = True ) self.in_planes = 64 self.layer1 = self._make_layer( 64 , stride = 1 ) self.layer2 = self._make_layer( 96 , stride = 2 ) self.layer3 = self._make_layer( 128 , stride = 2 ) # output convolution self.conv2 = nn.Conv2d( 128 , output_dim, kernel_size = 1 ) self.dropout = None if dropout> 0 : self.dropout = nn.Dropout2d(p=dropout) for m in self.modules(): if isinstance (m, nn.Conv2d): nn.init.kaiming_normal_(m.weight, mode = 'fan_out' , nonlinearity= 'relu' ) elif isinstance (m, (nn.BatchNorm2d, nn.InstanceNorm2d, nn.GroupNorm)): if m.weight is not None : nn.init.constant_(m.weight, 1 ) if m.bias is not None : nn.init.constant_(m.bias, 0 ) def _make_layer ( self, dim, stride = 1 ): layer1 = ResidualBlock(self.in_planes, dim, self.norm_fn, stride=stride) layer2 = ResidualBlock(dim, dim, self.norm_fn, stride = 1 ) layers = (layer1, layer2) self.in_planes = dim return nn.Sequential(*layers) def forward ( self, x ): # The pictures entered here are two pictures, two pictures that need to be registered is_list = isinstance (x, tuple ) or isinstance (x, list ) if is_list: batch_dim = x[ 0 ].shape[ 0 ] x = torch.cat(x, dim = 0 ) x = self.conv1(x) # This is a simple convolutional layer, the number of amplification channels is from 3 to 64 x = self.norm1(x) # The default is batchnorm x = self.relu1(x) x = self.layer1(x) # is composed of two residual blocks. x = self.layer2(x) x = self.layer3(x) x = self.conv2(x) IF self.training and self.dropout IS not None : x = self.dropout(x) if is_list: x = torch.split(x, [batch_dim, batch_dim], dim = 0 ) return the X- Copy the code
  • The picture in BasicEncoder is down-sampled 3 times, which is one-eighth of the original;
  • Except for the first and last layers, which are simple convolutional layers, everything else is composed of residual modules, and each residual module contains two convolutional layers;

updateblock

class BasicMotionEncoder ( nn.Module ): def __init__ ( self, args ): super (BasicMotionEncoder, self).__init__() cor_planes = args.corr_levels * ( 2 *args.corr_radius + 1 )** 2 self.convc1 = nn.Conv2d(cor_planes, 256 , 1 , padding = 0 ) self.convc2 = nn.Conv2d( 256 , 192 , 3 , padding = 1 ) self.convf1 = nn.Conv2d( 2 , 128 , 7 , padding = 3 ) self.convf2 = nn.Conv2d( 128 , 64 , 3 , padding = 1 ) = nn.Conv2d self.conv ( 64 + 192 , 128 - 2 , . 3 , padding = . 1 ) def forward ( self, flow, corr ): cor = F.relu(self.convc1(corr)) cor = F.relu(self.convc2(cor)) flo = F.relu(self.convf1(flow)) flo = F.relu(self.convf2(flo)) cor_flo = torch.cat([cor, flo], dim = 1 ) out = F.relu(self.conv(cor_flo)) return torch.cat([out, flow], dim = 1 ) class SepConvGRU ( nn.Module ): def __init__ ( self, hidden_dim = 128 , input_dim = 192 + 128 ): super (SepConvGRU, self).__init__() self.convz1 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, ( 1 , 5 ), padding=( 0 , 2 )) self.convr1 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, ( 1 , 5 ), padding=( 0 , 2 )) self.convq1 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, ( 1 , 5 ), padding=( 0 , 2 )) self.convz2 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, ( 5 , 1 ), padding=( 2 , 0 )) self.convr2 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, ( 5 , 1 ), padding=( 2 , 0 )) self.convq2 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, ( 5 , 1 ), padding=( 2 , 0 )) def forward ( self, h, x ): # horizontal hx = torch.cat([h, x], dim = 1 ) z = torch.sigmoid(self.convz1(hx)) r = torch.sigmoid(self.convr1(hx)) q = torch.tanh(self.convq1(torch.cat([r*h, x], dim = 1 ))) h = ( 1 -z) * h + z * q # vertical hx = torch.cat([h, x], dim = 1 ) z = torch.sigmoid(self.convz2(hx)) r = torch.sigmoid(self.convr2(hx)) q = torch.tanh(self.convq2(torch.cat([r*h, x], dim = 1 ))) h = ( 1 -z) * h + z * q return h class FlowHead ( nn.Module ): def __init__ ( self, input_dim = 128 , hidden_dim = 256 ): super (FlowHead, self).__init__() self.conv1 = nn.Conv2d(input_dim, hidden_dim, 3 , padding = 1 ) self.conv2 = nn.Conv2d(hidden_dim, 2 , 3 , padding = 1 ) self.relu = nn.ReLU(inplace = True ) def forward ( self, x ): return self.conv2(self.relu(self.conv1(x))) class BasicUpdateBlock ( nn.Module ): def __init__ ( self, args, hidden_dim = 128 , input_dim = 128 ): super (BasicUpdateBlock, self).__init__() self.args = args self.encoder = BasicMotionEncoder(args) self.gru = SepConvGRU(hidden_dim=hidden_dim, input_dim = 128 +hidden_dim) self.flow_head = FlowHead(hidden_dim, hidden_dim = 256 ) self.mask = nn.Sequential( nn.Conv2d( 128 , 256 , 3 , padding = 1 ), nn.ReLU(inplace = True ), nn.Conv2d( 256 , 64 * 9 , 1 , padding = 0 )) def forward ( self, net, inp, corr, flow, upsample = True ): motion_features = self.encoder(flow, corr) inp = torch.cat([inp, motion_features], dim = 1 ) net = self.gru(net, inp) delta_flow = self.flow_head(net) Scale mask to balence Gradients # mask = .25 * self.mask (NET) return NET, mask, delta_flow copy the code
  • BasicMotionEncoder
    : The input correlation matrix and optical flow do feature fusion;
  • SepConvGRU
    : This is interesting, it is the GRU structure in convolution.
    • There are two inputs here, net and inp;

    • The two inputs are spliced together and placed in the convolutional layer to get the output of the update and reset gates;

    • Then use reset*hidden to splice x to get a new hidden;

    • Then update the hidden layer according to the weight of the update.

    • In other words, the net variable in the model is actually a hidden variable in the GRP cyclic network;

summary

  • I learned the GRU structure of convolution. I think this GRU structure can be combined with many places.