premanthcharan
/

Image_Captioning_Model

vision-encoder-decoder

image-captioning

Model card Files Files and versions

premanth15 commited on Nov 16, 2024

Commit

612092f

·

verified ·

1 Parent(s): b2ca582

Update README.md

Files changed (1) hide show

README.md +6 -12

README.md CHANGED Viewed

@@ -150,9 +150,8 @@ model (CPTR) using an encoder-decoder transformer [[1]](#1). The source image is
 to the transformer encoder in sequence patches. Hence, one can treat the image
 captioning problem as a machine translation task.
-<img
-src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/Encoder-Decoder.png"
-width="80%" padding="100px 100px 100px 10px">
 Figure 1: Encoder Decoder Architecture
@@ -183,9 +182,8 @@ The encoder side deals solely with the image part, where it is beneficial to
 exploit the relative position of the features we have. Refer to Figure 2 for
 the model architecture.
-<img
-src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/Architectures.png"
-width="80%" padding="100px 100px 100px 10px">
 Figure 2: Model Architecture
@@ -344,9 +342,7 @@ The reason for overfitting may be due to the following reasons:
   4. Unsuitable hyperparameters
-| <img src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/LossChart.png"/> | <img src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/Bleu4Chart.png"> |
-| :--: | :--: |
-| Figure 3: Loss Curve | Figure 4: Bleu-4 score curv |
 ### Inference Output
@@ -359,9 +355,7 @@ distribution of the lengths is positively skewed. More specifically, the
 maximum caption length generated by the model (21 tokens) accounts for 98.66%
 of the lengths in the training set. See “code/experiment.ipynb Section 1.3”.
-<img
-src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/lens.png"
-padding="100px 100px 100px 10px">
 Figure 5: Generated caption's lengths distribution

 to the transformer encoder in sequence patches. Hence, one can treat the image
 captioning problem as a machine translation task.
+![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/672cd2eafa7f9a2a4711d3bc/NBP0ONvIs02htFwzD39z7.jpeg)
 Figure 1: Encoder Decoder Architecture
 exploit the relative position of the features we have. Refer to Figure 2 for
 the model architecture.
+![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/672cd2eafa7f9a2a4711d3bc/CUSlU9R2oTeYCohHnzOuB.jpeg)
 Figure 2: Model Architecture
   4. Unsuitable hyperparameters
+![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/672cd2eafa7f9a2a4711d3bc/VzxSQfSGDYlU5gY6mZ6nX.jpeg)
 ### Inference Output
 maximum caption length generated by the model (21 tokens) accounts for 98.66%
 of the lengths in the training set. See “code/experiment.ipynb Section 1.3”.
+![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/672cd2eafa7f9a2a4711d3bc/2IBBqt-G1d2WlDZ1rXpCF.jpeg)
 Figure 5: Generated caption's lengths distribution