WENYUAN LI / HOMEPAGE

A workflow for Git and GitHub

2018-09-09T00:00:00-07:00

This is a re-blog from my labmates NICHOLAS J. MATIASZ. It aims to help people get familiar with Git and GitHub workflow.

Although version control is useful for all software development, team-based development particularly highlights the need for tools like Git and GitHub. Team-based development can get complicated, though, when each teammate has unique preferences for version control tools and workflows. In this post, I present the Git/GitHub workflow that I currently use, and I share my impressions from using it daily. In my experience, this workflow offers a sweet spot with respect to the time it requires and the benefit it delivers. This workflow makes me more productive, and—when done right—it blurs the line between version control and project management. Essential parts of developers’ work, version control tools and workflows should be included in all computer science curricula.

Git/GitHub workflow

On GitHub, create an issue (e.g., #17) for a specific feature, bugfix, etc.
$ git checkout master
$ git pull
$ git checkout -b <branch_name> (e.g., 17_fix_login_form )
Commit code with descriptive commit messages.
$ git push origin <branch_name>
On GitHub, create a pull request.
In the pull request’s description, write “Resolves #.”
On GitHub, merge the pull request.

Notes

This workflow improves some psychological aspects of programming. By creating an issue on GitHub before you start to code, you are forced to articulate every planned task—an absolute requirement for technical work. As one benefit of this strategy, when you create an issue, you define a clear scope for your work. Defining a clear scope helps you to avoid straying from your goal while you code. As I navigate through a repository, I’m sometimes guilty of thinking, “Oh, hey—that’s broken, too.” A few minutes later, I’m changing some CSS when I originally intended to fix a database query. Before writing any code, you also benefit from describing your plan in words. If you can’t explain what you’re doing in plain language, you shouldn’t start to code.

To help you convey your ideas, GitHub’s interface allows you to paste images and code snippets into each description box. I often use this feature because a picture or code snippet will sometimes remind me of a task faster than words alone. Github’s text fields can also parse GitHub Flavored Markdown, a simple syntax for styling your text (e.g., with bold or italics).

Note that I like to start each branch name with its corresponding issue’s number. This way, I immediately know every branch’s goal, even if I haven’t worked on it for a while. Similarly, writing “Resolves #” in each pull request’s description explicitly ties the pull request to its issue. This latter method has a convenient side effect: GitHub will [*automatically close*](https://blog.github.com/2013-05-14-closing-issues-via-pull-requests/) the issue once you merge the pull request.

Following this workflow yields an elaborate history of your development activity. Such a history removes part of the burden of having to remember where everything is in your codebase. Here’s an example: One of my projects uses JavaScript for zooming on an SVG element in HTML. Some time ago, I created a branch and pull request to adjust the minimum and maximum levels of zoom allowed for this element. If, months or even years later, I want to change these zoom levels, I don’t have to spend time figuring out how I first did it. I can just search my repository with the term “zoom,” and GitHub will direct me to the exact commit that recorded the change. GitHub’s intuitive visualizations of changes between commits will even direct me to the exact line(s) of the file that I changed. This situation happens frequently; my workflow helps me to avoid the frustration of solving the same problem twice.

Now that I’ve used this workflow for a while, it feels taboo for me to switch to a new task on an existing branch. I prefer not to muddy my commit logs. If I want to switch tasks, I need to create a new branch. But I can’t name my branch until I know the issue number. And I won’t know the issue number until I create an issue. For those (hopefully fleeting) moments of laziness that developers know well, motivation is built right into this workflow. To follow it is to perform hygienic version control techniques.

Tensorflow Template: A proposal for good practice using tensorflow

2018-09-07T00:00:00-07:00

This article serves as a proposal that people can use for fast prototyping machine learning models in Tensorflow. You can find the code here. If you find it useful, don’t forget to star the repo and let more people to know.

The principle behind this design is trying to isolate each stage in machine learning modeling so that modifying each module will note affect others. In other words, people can easily use the same developed model in their own dataset, or use the same dataset in different models.

This project is more a proposal than a definitive guide. However I feel that it should cover most of the cases in machine learning when doing my own coding.

The overall folder architecture

Dataset: used for store and explore the new dataset for your own problem. You can convert your dataset to tfrecord in it.
Inputpipeline: used for read in the data from tfrecord or other sources and parser the data to feed into the NN. The return arguments usually are: iterator, input_data, target(if it is for supervised learning).
Model: used for create the NN model.
Training: used for training the NN model.
Testing: used for testing and post-processing the results.
Deploy: used for freeze the model and serve the tensorflow API.

Other folders and files are:

Plots: folder that stores pics
README file: Github markdown.
.gitignore: used to customize the content for git synchronization.

More Details

Dataset:
- utils.py: provides a variety of functions that can be used for general data processing, including those convert image data and csv files to tfrecord.
- utils_dataset_spec.py: should be used to store the pre-processing functions that specific for the dataset.

Inputpipeline:

ProstateDataSet.py: creates a dataset object. It should be modified to your own dataset. A typical data input pipeline includes: read in the data, parser the data, preprocessing the data, shuffle and repeat the data, batch the data up, make the data iterator.
inputpipeline.py: provides some functions that can be used in the data input pipeline.
input_source.py: shows several example that tensorflow can use for data input, such as input from numpy, input from numpy as placeholder, input from tfrecord, etc.

 def input_from_numpy(image, label):
     image = tf.convert_to_tensor(image, dtype=tf.int32, name="image")
     label = tf.convert_to_tensor(label, dtype = tf.int32, name="label")
     dataset = tf.data.Dataset.from_tensor_slices(
             {"input": image,
              "target": label})
     return dataset
    
 def input_from_numpy_as_placeholder(image, label):
     input_placeholder = tf.placeholder(image.dtype, image.shape)
     target_placeholder = tf.placeholder(label.dtype, label.shape)
     dataset = tf.data.Dataset.from_tensor_slices((input_placeholder, \
                                                   target_placeholder))
     return dataset
    
 def input_from_tfrecord():
     filenames = tf.placeholder(tf.string, shape=[None])
     # make filenames as placeholder for training and validating purpose
     dataset = tf.data.TFRecordDataset(filenames)
     return dataset

Model:

model_base.py: provides a series of building blocks that you might use in your NN, such as relu, leakyrelu, fully_connected layer etc. The model_base object will be inherited by the main NN model.

 def _relu(self, x):
     return tf.nn.relu(x)
    
 def _leakyrelu(self, x, leak=0.2, name="lrelu"):
     with tf.name_scope(name):
         f1 = 0.5 * (1 + leak)
         f2 = 0.5 * (1 - leak)
     return f1 * x + f2 * tf.abs(x)

VGG_16.py: constructs the main model. A forward_pass method should be implemented within this object.

 def forward_pass(self, x):
     with tf.name_scope('Conv_Block_0'):
         x = self._conv_batch_relu(x, filters = self._filters[0], \
                               kernel_size = 3, strides = (1,1))
         x = self._max_pool(x, pool_size = 2)
        
     with tf.name_scope('Conv_Block_1'):
         x = self._conv_batch_relu(x, filters = self._filters[1], \
                               kernel_size = 3, strides = (1,1))
                
     with tf.name_scope('Fully_Connected'):
         with tf.name_scope('Tensor_Flatten'):
             x = tf.reshape(x, shape = [self._batch_size, -1])
         x = self._fully_connected(x, self._num_classes)
     return x

Training:
- Saver.py: creates a saver object that save and restore training weights in tensorflow.
- Summary.py: creates a summary object that store the data in the training process. Data including scalar, image, histogram, graph, etc. can be utilized by tenorboard.
- train_base.py: a base function that can be inherited by Train.py. This base function includes different optimizer, metrics, etc.
- Train.py: a main function that trains the model.
- utils.py: utility function that being used by other training functions.
Testing:
- eval_base.py: a base function that can be inherited by Evaler.py. This base function includes different metrics, etc.
- Evaler.py: a main function that evaluates the model.
- utils.py: utility functions that being used by other evaluation functions.
Deploy:
- deploy_base.py: a base function that can be inherited by Deploy.py. This base function includes import_meta_graph, extend_meta_graph, freeze_mode, etc.
- Deploy.py: a main function that deploys the model.
- construct_deploy_model.py: construct the model for production use. Put placeholder as an input interface.
- model_inspect.py: functions that can be used to inspect your trained ckpt file.

Examples of using this template:

Useful links:

TensorFlow: A proposal of good practices for files, folders and models architecture

Mastering markdown in GitHub