3 PRELIMINARIES

3.1 Notations & Parameters

The parameters defined for training the model are:

3.2 xView Data set

Over 1 million objects in 60 classes are spread out throughout over 1,400 km2 of footage, making xView one of the largest and most diversified object recognition datasets accessible. Imagery with a better resolution is available in the xView dataset, which was compiled using data gathered from WorldView-3 satellites at a ground sample distance of 0.3m. There are a total of 60 classes in the dataset. Naming a few of them include ’Fixed-wing Aircraft’, ’Small Aircraft’, ’Cargo Plane’, ’Helicopter’, ’Passenger Vehicle’, ’Small Car’, ’Pickup Truck’, ’Utility Truck’,’Truck’ etc.

3.3 Single Stage Object Detectors

Object Detection in One Step models are a type of object detection models that only require a single pass through the detection process, as opposed to the two passes required by two-stage models. Typically, inference times are reduced in these models. Both the bounding boxes and the class probabilities within them are predicted by a single convolutional network.

3.4 YOLO V 5

The YOLO v5 has the same three essential components as any other single-stage object detector, model backbone, model head and model neck. Model Backbone is mainly used to extract important features from the given input image.[21][22][23]. The Cross Stage Partial Networks constitute the backbone of YOLO v5, which is utilised to extract highly informative features from an input picture.
Most of the time, Model Neck is used to make feature pyramids. Feature pyramids make it easier for models to generalise well about scaling of objects. With different sizes and scales, it helps to find the same thing. Feature pyramids are very helpful, and they help models do well with data they haven’t seen before. Other models, like FPN, BiFPN, PANet, etc. use different types of feature pyramid techniques. PANet is utilised as the neck in YOLO v5 to obtain the feature pyramids. Model Head is mostly employed in the latter stage of detection. It creates class probabilities, objectness scores, and bounding boxes in the form of a final output vector by applying anchor boxes to the features.

4. PROPOSED TECHNIQUE

4.1 Labelling Of Data

If an annotation’s bounding box was divided over tiles, the ability to name the tiled pictures was built into the tiling script so that the photos could still be seen as a whole. The images are split in the ratio of 90:10 for our training and validation set respectively. The images from both the sets were further tiled as per memory allocation of both our models as 512x512. The annotations are split with the dimensions according to the image splits and scaled relatively for making it compatible with the YOLO framework. Each training iteration / experiment consisted of 30 epochs each to maintain the consistency.

Parameter Training & Tuning

Exp 1- python train.py –img 512 –batch 12 –noval –epochs 30 –data data/custom.yaml –weights yolov5m.pt –cfg models/yolov5m_custom.yaml –hyp data/hyps/hyp.scratch-med.yaml –device 0 –name ./experiment_x

Annotation of Image

Figure 1 shows the sample of an annotated image for the model. The steps nvolved in the annotation of the im age are as under:
Step1: Download the xView dataset from https://challenge.xviewdataset.org.
Step 2: Split all the images in the ratio of 90:10 for our training and validation set respectively.
Step 3: Images from both the sets were further tiled as per memory allocation of models, i.e 512x512 for Yolo v5.
Step 4: The annotations were converted from GeoJSON Polygon objects to YOLO format using a script. We also split the annotations dimensions according to the image splits.
Step 5: The tiled dataset was further used to experiment and train our YOLO models. Each training iteration / experiment consisted of 30 epochs each to maintain the consistency between all models.