Python: Find Amount of Handwriting in Video Python: Find Amount of Handwriting in Video numpy numpy

Python: Find Amount of Handwriting in Video


You can identify the space taken by hand-writing by masking the pixels from the template, and then do the same for the difference between further frames and the template. You can use dilation, opening, and thresholding for this.

Let's start with your template. Let's identify the parts we will mask:

import cv2import numpy as nptemplate = cv2.imread('template.jpg')

enter image description here

Now, let's broaden the occupied pixels to make a zone that we will mask (hide) later:

template = cv2.cvtColor(template, cv2.COLOR_BGR2GRAY)kernel = np.ones((5, 5),np.uint8)dilation = cv2.dilate(255 - template, kernel,iterations = 5)

enter image description hereThen, we will threshold to turn this into a black and white mask:

_, thresh = cv2.threshold(dilation,25,255,cv2.THRESH_BINARY_INV)

enter image description here

In later frames, we will subtract this mask from the picture, by turning all these pixels to white. For instance:

import numpy as npimport cv2vidcap = cv2.VideoCapture('0_0.mp4')success,image = vidcap.read()count = 0frames = []while count < 500:  frames.append(image)  success,image = vidcap.read()  count += 1mask = np.where(thresh == 0)example = frames[300]example[mask] = [255, 255, 255]cv2.imshow('', example)cv2.waitKey(0)

enter image description here

Now, we will create a function that will return the difference between the template and a given picture. We will also use opening to get rid of the left over single pixels that would make it ugly.

def difference_with_mask(image):    grayscale = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)    kernel = np.ones((5, 5), np.uint8)    dilation = cv2.dilate(255 - grayscale, kernel, iterations=5)    _, thresh = cv2.threshold(dilation, 25, 255, cv2.THRESH_BINARY_INV)    thresh[mask] = 255    closing = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)    return closingcv2.imshow('', difference_with_mask(frames[400]))cv2.waitKey(0)

enter image description here

To address the fact that you don't want to have the hand detected as hand-writing, I suggest that instead of using the mask for every individual frame, you use the 95th percentile of the 15 last 30th frame... hang on. Look at this:

results = []for ix, frame in enumerate(frames):    if ix % 30 == 0:        history.append(frame)    results.append(np.quantile(history, 0.95, axis=0))    print(ix)

Now, the example frame becomes this (the hand is removed because it wasn't mostly present in the 15 last 30th frames):

enter image description here

As you can see a little part of the hand-writing is missing. It will come later, because of the time-dependent percentile transformation we're doing. You'll see later: in my example with frame 18,400, the text that is missing in the image above is present. Then, you can use the function I gave you and this will be the result:

enter image description here

And here we go! Note that this solution, which doesn't include the hand, will take longer to compute because there's a few calculations needing to be done. Using just an image with no regard to the hand would calculate instantly, to the extent that you could probably run it on your webcam feed in real time.

Final Example:

Here's the frame 18,400:

enter image description here

Final image:

enter image description here

You can play with the function if you want the mask to wrap more thinly around the text:

enter image description here

Full code:

import osimport numpy as npimport cv2vidcap = cv2.VideoCapture('0_0.mp4')success,image = vidcap.read()count = 0from collections import dequeframes = deque(maxlen=700)while count < 500:  frames.append(image)  success,image = vidcap.read()  count += 1template = cv2.imread('template.jpg')template = cv2.cvtColor(template, cv2.COLOR_BGR2GRAY)kernel = np.ones((5, 5),np.uint8)dilation = cv2.dilate(255 - template, kernel,iterations = 5)cv2.imwrite('dilation.jpg', dilation)cv2.imshow('', dilation)cv2.waitKey(0)_, thresh = cv2.threshold(dilation,25,255,cv2.THRESH_BINARY_INV)cv2.imwrite('thresh.jpg', thresh)cv2.imshow('', thresh)cv2.waitKey(0)mask = np.where(thresh == 0)example = frames[400]cv2.imwrite('original.jpg', example)cv2.imshow('', example)cv2.waitKey(0)example[mask] = 255cv2.imwrite('example_masked.jpg', example)cv2.imshow('', example)cv2.waitKey(0)def difference_with_mask(image):    grayscale = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)    kernel = np.ones((5, 5), np.uint8)    dilation = cv2.dilate(255 - grayscale, kernel, iterations=5)    _, thresh = cv2.threshold(dilation, 25, 255, cv2.THRESH_BINARY_INV)    thresh[mask] = 255    closing = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)    return closingcv2.imshow('', difference_with_mask(frames[400]))cv2.waitKey(0)masked_example = difference_with_mask(frames[400])cv2.imwrite('masked_example.jpg', masked_example)from collections import dequehistory = deque(maxlen=15)results = []for ix, frame in enumerate(frames):    if ix % 30 == 0:        history.append(frame)    results.append(np.quantile(history, 0.95, axis=0))    print(ix)    if ix > 500:        breakcv2.imshow('', frames[400])cv2.waitKey(0)cv2.imshow('', results[400].astype(np.uint8))cv2.imwrite('percentiled_frame.jpg', results[400].astype(np.uint8))cv2.waitKey(0)cv2.imshow('', difference_with_mask(results[400].astype(np.uint8)))cv2.imwrite('final.jpg', difference_with_mask(results[400].astype(np.uint8)))cv2.waitKey(0)


You could try to make a template before detection which you could use to deduct it on the current frame of the video. One way you could make such a template is to iterate through every pixel of the frame and look-up if it has a higher value (white) in that coordinate than the value that is stored in the list.

Here is an example of such a template from your video by iterating through the first two seconds:

enter image description here

Once you have that it is simple to detect the text. You can use the cv2.absdiff() function to make difference of template and frame. Here is an example:

enter image description here

Once you have this image it is trivial to search for writting (threshold + contour search or something similar).

Here is an example code:

import numpy as npimport cv2cap = cv2.VideoCapture('0_0.mp4')  # read videobgr = cap.read()[1]  # get first frameframe = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)  # transform to grayscaletemplate = frame.copy()  # make a copy of the grayscaleh, w = frame.shape[:2]  # height, widthmatrix = []  # a list for [y, x] coordinares# fill matrix with all coordinates of the image (height x width)for j in range(h):    for i in range(w):        matrix.append([j, i])fps = cap.get(cv2.CAP_PROP_FPS)  # frames per second of the videoseconds = 2  # How many seconds of the video you wish to look the template fork = seconds * fps  # calculate how many frames of the video is in that many secondsi = 0  # some iterator to count the frameslowest = []  # list that will store highest values of each pixel on the fram - that will build our template# store the value of the first frame - just so you can compare it in the next stepfor j in matrix:    y = j[0]    x = j[1]    lowest.append(template[y, x])# loop through the number of frames calculated beforewhile(i < k):    bgr = cap.read()[1]  # bgr image    frame = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)  # transform to grayscale    # iterate through every pixel (pixels are located in the matrix)    for l, j in enumerate(matrix):        y = j[0]  # x coordinate        x = j[1]  # y coordinate        temp = template[y, x]  # value of pixel in template        cur = frame[y, x]  # value of pixel in the current frame        if cur > temp:  # if the current frame has higher value change the value in the "lowest" list            lowest[l] = cur    i += 1  # increment the iterator    # just for vizualization    cv2.imshow('frame', frame)    if cv2.waitKey(1) & 0xFF == ord('q'):        breaki = 0  # new iteratir to increment position in the "lowest" listtemplate = np.ones((h, w), dtype=np.uint8)*255  #  new empty white image# iterate through the matrix and change the value of the new empty white image to that value# in the "lowest" listfor j in matrix:    template[j[0], j[1]] = lowest[i]    i += 1# just for visualization - templatecv2.imwrite("template.png", template)cv2.imshow("template", template)cv2.waitKey(0)cv2.destroyAllWindows()counter = 0  # counter of countours: logicaly if the number of countours would# rapidly decrease than that means that a new template is in ordermean_compare = 0  # this is needed for a simple color checker if the contour is# the same color as the oders# this is the difference between the frame of the video and created templatewhile(cap.isOpened()):    bgr = cap.read()[1]  # bgr image    frame = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)  # grayscale    img = cv2.absdiff(template, frame)  # resulted difference    thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]  # thresholded image    kernel = np.ones((5, 5), dtype=np.uint8)  # simple kernel    thresh = cv2.dilate(thresh, kernel, iterations=1)  # dilate thresholded image    cnts, h = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)  # contour search    if len(cnts) < counter*0.5 and counter > 50:  # check if new template is in order        # search for new template again        break    else:        counter = len(cnts) # update counter        for cnt in cnts:  # iterate through contours            size = cv2.contourArea(cnt)  # size of contours - to filter out noise            if 20 < size < 30000:  # noise criterion                mask = np.zeros(frame.shape, np.uint8)  # empry mask - needed for color compare                cv2.drawContours(mask, [cnt], -1, 255, -1)  # draw contour on mask                mean = cv2.mean(bgr, mask=mask)  # the mean color of the contour                if not mean_compare:  # first will set the template color                    mean_compare = mean                else:                    k1 = 0.85  # koeficient how much each channels value in rgb image can be smaller                    k2 = 1.15 # koeficient how much each channels value in rgb image can be bigger                    # condition                    b = bool(mean_compare[0] * k1 < mean[0] < mean_compare[0] * k2)                    g = bool(mean_compare[1] * k1 < mean[1] < mean_compare[1] * k2)                    r = bool(mean_compare[2] * k1 < mean[2] < mean_compare[2] * k2)                    if b and g and r:                        cv2.drawContours(bgr, [cnt], -1, (0, 255, 0), 2)  # draw on rgb image    # just for visualization    cv2.imshow('img', bgr)    if cv2.waitKey(1) & 0xFF == ord('s'):        cv2.imwrite(str(j)+".png", img)    if cv2.waitKey(1) & 0xFF == ord('q'):        break# release the video object and destroy windowcap.release()cv2.destroyAllWindows()

One possible result with a simple size and color filter:

enter image description here

NOTE: This template search algorithm is very slow because of the nested loops and can probably be optimized to make it faster - you need a little more math knowledge than me. Also, you will need to make a check if the template changes in the same video - I'm guessing that shouldn't be too difficult.

A simpler idea on how to make it a bit faster is to resize the frames to let's say 20% and make the same template search. After that resize it back to the original and dilate the template. It will not be as nice of a result but it will make a mask on where the text and lines of the template are. Then simply draw it over the frame.


I don't think you really need the code in this case and it would be rather long if you did. But here's an algorithm to do it.

Use OpenCV's EAST (Efficient Accurate Scene Text detector) model at the beginning to establish the starting text on the slide. That gives you a bounding box(es) of the initial percentage of the slide covered with slide text as opposed to handwritten explanatory text.

Every, say 1-5 seconds (people don't write all that fast), compare that baseline image with the current image and the previous image.

  • If the current image has more text than the previous image but the initial bounding boxes are NOT the same, you have a new and rather busy slide.

  • If the current image has more text than the previous image but the initial bounding boxes are ARE the same, more text is being added.

  • If the current image had less text than the previous image but the initial bounding boxes are NOT the same, you again have a new slide -- only, not busy and with space like the last one to write.

  • If the current image has less text than the previous image but the initial bounding boxes are ARE the same, you either have a duplicate slide with what will presumably be more text or the teacher is erasing a section to continue, or modify their explanation. Meaning, you'll need some way of addressing this.

When you have a new slide, take the previous image, and compare the bounding boxes of all text, subtracting the boxes for the initial state.

Computationally, this isn't going to be cheap (you certainly won't be doing this life, at least not for a number of years) but it's robust, and sampling the text every so many seconds of time will help.

Personally, I would approach this as an ensemble. That is an initial bounding box then look at the color of the text. If you can get away with the percentage of different color text, do. And when you can't, you'll still be good.