Notebook

Dataset processing and analysis¶

Importing and verifying the dataset:¶

ImageHash is a perceptual hashing library that allows us to detect if there are any very similar images in the dataset regardless of their physical size and minor color related differences.

Out[6]:

	Class	Duplicate Count	Total Images	Proportion
0	full	0	23086	0.0
1	Total	0	23086	0.0

Running on 16 workers
Total images: 23086

Processing images: 100%|██████████| 23086/23086 [05:18<00:00, 72.47it/s]

Total processing time: 320.58 seconds

Image color summary¶

There seem to be no grayscale images and all images have 3 color channels.

  Cell In[107], line 3
    row_col=None, suptitle=None, xlim=(6.5, None), binwidth=0.1, xlim=(0, None)
                                                                      ^
SyntaxError: invalid syntax. Perhaps you forgot a comma?

Out[17]:

	image_path	variance	unique_colors	entropy	brisque_score	laplacian_variance	fft_blur_score	luminance	luminance_bin	skin_tone	age	gender	age_group	age_bin_raw	Images
0	../dataset/full/10_0_0_20170110220033115.jpg.c...	1477.253495	6267	6.857390	33.980056	256.865812	2.312982	203.8070	3	12.5248	10	0	0-18	0-10	0
1	../dataset/full/10_0_0_20170110224406532.jpg.c...	2452.172032	8298	7.718125	33.397515	244.865678	2.826604	141.7135	2	23.6788	10	0	0-18	0-10	0
2	../dataset/full/10_0_0_20170110220255346.jpg.c...	2980.936287	8942	7.736862	44.824772	123.788397	2.063477	158.8874	3	25.4196	10	0	0-18	0-10	0
3	../dataset/full/10_0_0_20170110220251986.jpg.c...	3365.068846	6339	7.209920	24.517992	657.658092	3.654595	130.6373	2	20.3080	10	0	0-18	0-10	0
4	../dataset/full/10_0_0_20170110220403810.jpg.c...	4118.893420	8065	7.896404	52.822707	74.278110	2.606479	122.2249	2	21.0340	10	0	0-18	0-10	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
23081	../dataset/full/9_1_2_20170104020210475.jpg.ch...	1676.861665	5791	7.260590	37.093527	176.641079	2.029646	138.5114	2	23.7608	9	1	0-18	0-10	0
23082	../dataset/full/9_1_2_20161219204347420.jpg.ch...	1255.620365	7693	7.232986	42.996096	49.222689	1.349526	83.3686	1	23.2144	9	1	0-18	0-10	0
23083	../dataset/full/9_1_4_20170103200814791.jpg.ch...	3325.250201	8696	7.875873	11.624793	914.503642	3.635523	145.6209	2	7.2624	9	1	0-18	0-10	0
23084	../dataset/full/9_1_3_20161219225144784.jpg.ch...	1996.379638	6084	7.345491	55.754715	46.323105	1.265786	86.5876	1	42.3900	9	1	0-18	0-10	0
23085	../dataset/full/9_1_4_20170103213057382.jpg.ch...	2170.575589	8720	7.753345	31.614764	439.525016	2.402710	157.0024	3	0.4552	9	1	0-18	0-10	0

23086 rows × 15 columns

Age and Gender Distribution¶

/tmp/ipykernel_2896/1480114290.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=image_entropy_summary, x='age', shade=True)

No description has been provided for this image

The distribution of ages in the dataset doesn't seem to be inline with general demographic trends in most countries:

Newborns and young children and young working age people between 20-40 are disproportionally overrepresented.
There are relatively few samples of teenagers and those above 50-60

The uneven distribution will likely impact the model's performance and generalization capabilities across different age groups so that's something we'll need to pay attention to and find ways to handle it if that's the case.

Gender Balance and Distribution¶

While the balance between and male and female samples is relatively acceptable (53:47) we can see that their distribution across different age groups is quite different:

Out[20]:

<seaborn.axisgrid.FacetGrid at 0x7f64d0b79d80>

(1 = Female)

Out[22]:

	Male	Female	All
Count	12069.00	11017.00	23086.00
Prop.	0.52	0.48	1.00
Mean	35.65	30.62	33.25
Median	34.00	26.00	29.00
Mode	26.00	26.00	26.00
Std Dev	19.72	19.69	19.86
IQR	25.00	16.00	22.00
5th Percentile	1.00	2.00	2.00
25th Percentile	25.00	21.00	23.00
75th Percentile	50.00	37.00	45.00
95th Percentile	70.00	72.00	71.00
Minimum	1.00	1.00	1.00
Maximum	110.00	116.00	116.00
Skewness	0.28	1.03	0.62
Kurtosis	-0.19	1.32	0.32

On average males in the photographs seem to be significantly older, at least in the middle of the range (25-50th percentiles). Above a certain age (~70) the proportion of females increase significantly. This again, raises certain issues and is something we'll need to pay close attention to when evaluating our model.

Out[32]:

	Male	Female	Total
age_bin_raw
0-10	1509	1638	3147
10-20	672	952	1624
20-30	3223	4339	7562
30-40	2408	1828	4236
40-50	1417	640	2057
50-60	1500	650	2150
60-70	754	378	1132
70-80	406	247	653
80-90	168	274	442
90-inf	12	71	83

Out[102]:

Text(0.5, 1.02, 'Distribution of Age Groups by Gender (Fem = 1)')

Image Analysis¶

We'll perform an in-depth analysis of some key characteristics, like:

Luminance distribution
Color variance and distribution
Image entropy
Image quality (using BRISQUE, FFT, Laplacian variance)

We want to make sure that we have a comprehensive understanding of our dataset since that will impact our preprocessing (selection of transformation and augmentation techniques) and other decisions.

Additionally, we'll use a combination of these metrics to improve the robustness of your evaluation pipeline:

Luminance and color information is used to assess the model's performance over different skin tone ranges.
Image quality analysis will allow use to eliminate or at least identify invalid (i.e. extremely blurry or cropped images) and measure their impact on overall performance.

Color Variance and Entropy¶

Average variance of color channels in the all images:
- Variance = 0: All pixels in the image have the same color.
- High Variance: Indicates images with diverse color pixels.
Number of unique colors in each image
Entropy (shannon_entropy).
- Scale: 0 to log2(N), where N is the number of possible pixel values (0 to 8 for 256 grayscale values).
  - Min Entropy = 0: Perfectly uniform image (single color).
  - High Entropy: Indicates images with a wide variety of colors and patterns.

Out[45]:

	Male	Female	All
Count	12069.00	11017.00	23086.00
Prop.	0.52	0.48	1.00
Mean	7.52	7.59	7.55
Median	7.57	7.64	7.61
Mode	4.28	5.60	4.28
Std Dev	0.27	0.25	0.26
IQR	0.33	0.30	0.32
5th Percentile	7.02	7.11	7.06
25th Percentile	7.38	7.47	7.42
75th Percentile	7.71	7.77	7.74
95th Percentile	7.86	7.89	7.87
Minimum	4.28	5.60	4.28
Maximum	7.97	7.97	7.97
Skewness	-1.43	-1.37	-1.40
Kurtosis	5.10	3.11	4.31

Out[44]:

	Male	Female	All
Count	12069.00	11017.00	23086.00
Prop.	0.52	0.48	1.00
Mean	2548.27	3013.95	2770.50
Median	2338.08	2781.21	2540.88
Mode	201.43	293.90	201.43
Std Dev	1195.63	1397.19	1316.42
IQR	1525.39	1851.76	1701.91
5th Percentile	1006.69	1139.91	1063.60
25th Percentile	1672.27	1973.63	1803.40
75th Percentile	3197.67	3825.39	3505.31
95th Percentile	4815.17	5662.61	5273.66
Minimum	201.43	293.90	201.43
Maximum	9816.23	10944.40	10944.40
Skewness	1.05	0.85	0.98
Kurtosis	1.55	0.82	1.20

While male and female images have comparable overall color complexity or information content (entropy) higher variance in female images indicates that the colors in these images are more spread out from the mean color.

e.g. females image might contain a wide range of colors (high variance) in a balanced, evenly distributed manner (similar entropy to male images). For instance, a colorful floral dress with many different hues, but well-distributed throughout the image.

This would raise a few questions that could influence or preprocessing pipeline and the model itself:

Difference in color variance between male and female images could become a strong predictive feature for gender classification. However, the model, might become overly reliant on color variance, potentially misclassifying males with high color variance or females with low color variance.
While this affect won't be noticeable when testing on a sample of the same dataset (or is likely to improve the models performance) it might mean that the model might perform worse in real world conditions or different datasets because a part of its decision-making is based not core facial attributes but clothing, cosmetics and other external factors (assuming our hypothesis is correct).

We'll try to handle this by including various augmentation techniques that add color jitter to individual samples and even remove all color information from images (however we'd need to use a different dataset to fully verify this)

/tmp/ipykernel_2896/290011491.py:23: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  for (row_val, row_data), ax in zip(data.groupby(row_col), g.axes.flat):

We can see similar differences when comparing different age groups as well.

Skin Color Estimation¶

Additionally, we'll try to determine the skin color of the subjects so that we could later measure whether that has an impact on the performance of our model.

We've attempted to use various heuristics (or their combination) for this, however we've found that just using luminance directly provides the most predictable and reasonable useful results:

thresh: 60.112435, filtered_df: 231

thresh: 194.07173500000016, filtered_df: 231

Measuring Image Quality¶

The quality and validity of the data we're using also has a significant effect (even if it's not-necessarily easy to estimate when using the same dataset for evaluation).

While the UTK dataset is relatively high quality it still contains some invalid images (and some probably mislabeled ones, but we'll get to that later)

A no-reference image quality assessment method. Uses scene statistics of locally normalized luminance coefficients to quantify possible losses of "naturalness" in the image due to distortions. Operates in the spatial domain.

Bassically it allows us to detect very blurry images:

thresh: 66.22156698932034, filtered_df: 35

While these images seem mostly valid (i.e. contain human faces) we can see that BRISQUE wuold allow to filter out the images which have a very poor quality and would be too hard to classify. Also depending on production use cases it would be possible to just indicate to the user which images to classify or not.

Examples of High BRISQUE Images¶

thresh: -3.152889781340932, filtered_df: 35

Laplacian Variance¶

A measure of image sharpness/blurriness. Uses the Laplacian operator to compute the second derivative of the image. Measures the variance of the Laplacian-filtered image.

thresh: 17.4395013321, filtered_df: 35

Laplacian Variance seems to correlate very highly with BRISQUE , bassically allow us to filter out very similar images.

FFT-based Blur Detection¶

thresh: 0.8144524239654852, filtered_df: 35

FFT seems to be somewhat too agressive for our purposes, it assigns very low scores even with images with reasonably discernible faces.

Feature Correlation¶

All the three new metrics are strongly correlated to each other just proving that they more or less measure the same thing (blurriness and amount of detail)

Color Chanel Distribution by Class¶

These plots show the normalized intensity (0 - 255) distributions of color channel by class. The Y show the normalized frequency (density) relative to all color channels (based on highest individual value for any channel).

The charts are made by generating a histogram for each image, normalizing it (normalization process maintains the shape of the histogram, meaning the relative distribution of pixel intensities is preserved. All histograms in the class are then averaged.