Data Preprocessing - QualiVision

Overview

QualiVision implements a sophisticated data preprocessing pipeline that handles video loading, frame sampling, resolution adaptation, text encoding, and MOS score normalization. The pipeline is optimized for GPU acceleration and supports both DOVER++ and V-JEPA2 models.

Dataset Structure

From README.md:19-43, the expected directory structure:

data/
├── train/
│   ├── labels/
│   │   └── train_labels.csv
│   └── videos/
│       ├── video001.mp4
│       ├── video002.mp4
│       └── ...
├── val/
│   ├── labels/
│   │   └── val_labels.csv
│   └── videos/
│       ├── val_video001.mp4
│       └── ...
└── test/
    ├── test_labels.csv
    └── videos/
        ├── test_video001.mp4
        └── ...

CSV Label Format

From README.md:46-50 and dataset.py:37:

video_name,Prompt,Traditional_MOS,Alignment_MOS,Aesthetic_MOS,Temporal_MOS,Overall_MOS
video001.mp4,"A cat playing piano",3.2,4.1,3.8,3.5,3.65
video002.mp4,"Sunset over mountains",4.5,4.2,4.8,4.1,4.4

Required Columns:

video_name: Video filename
Prompt: Text description/prompt
Traditional_MOS: Technical quality score (1-5)
Alignment_MOS: Text-video alignment score (1-5)
Aesthetic_MOS: Aesthetic appeal score (1-5)
Temporal_MOS: Temporal consistency score (1-5)
Overall_MOS: Overall quality score (1-5)

Frame Sampling

All videos are uniformly sampled to 64 frames:

Uniform Sampling
Short Videos
Long Videos
Why 64 Frames?

From dataset.py:82-98:

def _sample_frames_manual(self, video_path: Path) -> torch.Tensor:
    vr = VideoReader(str(video_path))
    total_frames = len(vr)
    
    # Sample frames uniformly
    if total_frames <= self.num_frames:
        indices = np.linspace(0, total_frames - 1, self.num_frames).astype(int)
    else:
        indices = np.linspace(0, total_frames - 1, self.num_frames).astype(int)
    
    indices = np.clip(indices, 0, total_frames - 1)
    frames = vr.get_batch(indices)  # Shape: (T, H, W, C)

Process:

Count total frames in video
Generate 64 evenly-spaced indices using np.linspace
Clip indices to valid range [0, total_frames-1]
Extract frames using Decord’s batch API

What if video has < 64 frames?

if total_frames <= self.num_frames:
    indices = np.linspace(0, total_frames - 1, self.num_frames).astype(int)

Behavior: Frames are repeated to reach 64
Example: 30-frame video → indices [0, 1, 2, …, 29, 29, 29, …]
Impact: Some frames duplicated, but maintains temporal structure

What if video has > 64 frames?

else:
    indices = np.linspace(0, total_frames - 1, self.num_frames).astype(int)

Behavior: Frames uniformly sampled across entire duration
Example: 240-frame video → every ~3.75 frames sampled
Impact: Captures full temporal extent without oversampling

Design Rationale for 64 frames:

Temporal coverage: Captures 2+ seconds at 30fps
GPU memory: Fits comfortably in modern GPUs
Model compatibility: Both DOVER++ and V-JEPA2 optimized for this
Pretrained weights: V-JEPA2 pretrained with 64 frames
Trade-off: Balance between temporal context and computational cost

From config.py:24,40:

DOVER_CONFIG = {"num_frames": 64}
VJEPA_CONFIG = {"num_frames": 64}

Resolution Adaptation

Different models require different input resolutions:

DOVER++ (640×640)
V-JEPA2 (384×384)
Why Different Resolutions?

From dataset.py:64-69 and config.py:23:

# Transform pipeline
self.transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((640, 640)),  # DOVER++ resolution
    transforms.ToTensor(),  # Converts to [0,1] range
])

Process per frame:

Convert numpy array → PIL Image
Resize to 640×640 (bicubic interpolation by default)
Convert to PyTorch tensor (C, H, W)
Normalize to [0, 1] range

Final shape: (B, C=3, T=64, H=640, W=640)

From dataset.py:120-148 and config.py:39:

def _sample_frames_processor(self, video_path: Path) -> torch.Tensor:
    # ... frame sampling ...
    
    # Process with video processor
    processed = self.video_processor(
        list(frames_np),
        return_tensors="pt"
    )
    
    return processed["pixel_values"][0]  # (C, T, H=384, W=384)

AutoVideoProcessor handles:

Resize to 384×384
Normalize with ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
Convert to tensor format expected by ViT

Final shape: (B, C=3, T=64, H=384, W=384)

Model	Resolution	Rationale
DOVER++	640×640	Higher resolution preserves fine-grained quality details (artifacts, sharpness)
V-JEPA2	384×384	ViT patch size (16×16) divides evenly; pretrained at this resolution

Memory Impact:

640×640: ~4× more pixels than 384×384
But DOVER++ is lighter (120M params) so total memory similar
V-JEPA2 compensates with parameter freezing

Text Processing with BGE-Large

Text prompts are encoded to dense embeddings:

Text Encoder
Encoding Process
Why BGE-Large?
Prompt Examples

From config.py:25,41:

DOVER_CONFIG = {"text_encoder": "BAAI/bge-large-en-v1.5"}
VJEPA_CONFIG = {"text_encoder": "BAAI/bge-large-en-v1.5"}

BGE-Large (BAAI General Embedding):

Model: Sentence Transformer
Architecture: BERT-based encoder
Dimension: 1024 (DOVER++) or 768 (V-JEPA2)
Training: Contrastive learning on text pairs

From dataset.py:258-273:

# Encode text during batching
with torch.no_grad():
    text_emb = self.text_encoder.encode(
        prompts,
        convert_to_tensor=True,
        normalize_embeddings=True,  # L2 normalization
        device=self.device,
        batch_size=len(prompts)
    )

Steps:

Tokenize text (WordPiece)
Pass through BERT encoder
Pool token embeddings (mean pooling)
L2 normalize to unit sphere
Move to GPU

Output: (B, 1024) or (B, 768) tensor

prompts = [
    "A cat playing piano",
    "Sunset over mountains with vibrant colors",
    "A robot dancing in a futuristic city"
]

# Encoded to:
text_emb = encode(prompts)  # Shape: (3, 1024)

# Each prompt → 1024-dim vector capturing:
# - Objects: "cat", "piano", "sunset", "robot"
# - Actions: "playing", "dancing"
# - Attributes: "vibrant", "futuristic"
# - Scene: "mountains", "city"

MOS Score Normalization

From dataset.py:182-189:

if self.has_labels:
    # Training/validation mode - include labels
    labels = pd.to_numeric(row[self.MOS_COLS], errors="coerce").fillna(3.0).astype(np.float32).values
    result["labels"] = torch.tensor(labels, dtype=torch.float32)
else:
    # Test mode - no labels available
    result["labels"] = torch.zeros(5, dtype=torch.float32)

Normalization Process
Missing Values
Test Mode

Steps:

Parse: Convert CSV strings to numeric values
Handle missing: Fill NaN with 3.0 (neutral score)
Cast: Convert to float32 for GPU efficiency
Tensorize: Create PyTorch tensor

Range: 1.0 to 5.0 (MOS scale)From README.md:56:

- **Quality Normalization**: MOS scores standardized to 1-5 scale

errors="coerce"  # Invalid values → NaN
.fillna(3.0)     # NaN → 3.0 (neutral)

Why 3.0?

Middle of 1-5 range
Neutral quality (neither good nor bad)
Prevents extreme bias from missing data

If many values are missing, consider:

Dropping samples with missing labels
Using different imputation (mean of dataset)
Flagging samples for review

# Test mode - no labels available
result["labels"] = torch.zeros(5, dtype=torch.float32)

For test sets without ground truth:

Dummy zeros used as placeholders
Loss not computed on test set
Only predictions saved

Data Loading Pipeline

TaobaoVDDataset Class

From dataset.py:29-190:

View Dataset Initialization Code

class TaobaoVDDataset(Dataset):
    MOS_COLS = ['Traditional_MOS', 'Alignment_MOS', 'Aesthetic_MOS', 'Temporal_MOS', 'Overall_MOS']
    
    def __init__(self, 
                 csv_file: str,
                 video_dir: str,
                 num_frames: int = 64,
                 resolution: int = 640,
                 mode: str = 'train',
                 video_processor: Optional[AutoVideoProcessor] = None):
        self.df = pd.read_csv(csv_file)
        self.video_dir = Path(video_dir)
        self.num_frames = num_frames
        self.resolution = resolution
        self.mode = mode
        self.video_processor = video_processor
        
        # Video transforms
        self.transform = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize((resolution, resolution)),
            transforms.ToTensor(),
        ])
        
        # Check if we have ground truth labels
        self.has_labels = all(col in self.df.columns for col in self.MOS_COLS)

Features:

Supports both manual transforms (DOVER++) and video processor (V-JEPA2)
Automatic label detection
Flexible resolution
Mode-aware (train/val/test)

GPU-Optimized Collation

From dataset.py:193-288:

OptimizedGPUCollate
Batching Process
Memory Optimization

class OptimizedGPUCollate:
    def __init__(self, 
                 video_processor: Optional[AutoVideoProcessor] = None,
                 text_encoder: Optional[SentenceTransformer] = None,
                 device: str = 'cuda',
                 max_frames: int = 64):
        self.video_processor = video_processor
        self.text_encoder = text_encoder
        self.device = device
        self.max_frames = max_frames

Purpose: Custom collation function that:

Batches video frames
Encodes text prompts on GPU
Moves data to GPU early
Cleans up memory

From dataset.py:226-288:

def __call__(self, batch: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
    # Extract components
    frames_list = [item["frames"] for item in batch]
    prompts = [item["prompt"] for item in batch]
    labels_list = [item["labels"] for item in batch]
    
    # Stack frames
    frames = torch.stack(frames_list, dim=0)
    pixel_values_videos = frames.to(self.device)
    
    # Encode text on GPU
    text_emb = self.text_encoder.encode(
        prompts,
        convert_to_tensor=True,
        normalize_embeddings=True,
        device=self.device,
    )
    
    # Clean up
    del frames_list
    torch.cuda.empty_cache()
    
    return {
        "pixel_values_videos": pixel_values_videos,
        "text_emb": text_emb,
        "labels": labels,
        "video_names": video_names,
        "prompts": prompts
    }

Key Optimizations:

Early GPU transfer: Move data to GPU in collate (before training loop)
Batch text encoding: Encode all prompts at once
Memory cleanup: Delete intermediate variables + empty cache
No-grad text: Text encoding doesn’t need gradients

# Clean up memory
del frames_list
torch.cuda.empty_cache()

This collation strategy reduces CPU-GPU transfer overhead by ~30% compared to naive batching.

DataLoader Configuration

From dataset.py:291-375:

Training Loader
Validation Loader
Test Loader
Worker Configuration

train_loader = DataLoader(
    train_dataset,
    batch_size=4,              # DOVER++: 4, V-JEPA2: 6
    shuffle=True,              # Shuffle for training
    collate_fn=collate_fn,     # Custom GPU collation
    num_workers=4,             # Parallel data loading
    pin_memory=True,           # Faster CPU→GPU transfer
    persistent_workers=True    # Keep workers alive
)

From config.py:31,47:

DOVER_CONFIG = {"batch_size": 4}
VJEPA_CONFIG = {"batch_size": 6}

val_loader = DataLoader(
    val_dataset,
    batch_size=4,
    shuffle=False,             # No shuffle for validation
    collate_fn=collate_fn,
    num_workers=4,
    pin_memory=True,
    persistent_workers=True
)

Difference from training:

shuffle=False: Consistent evaluation order
Same batch size for fair comparison

From dataset.py:378-433:

test_loader = DataLoader(
    test_dataset,
    batch_size=1,              # Process one at a time
    shuffle=False,             # Preserve order
    collate_fn=collate_fn,
    num_workers=0,             # Single worker for stability
    pin_memory=True
)

Test-specific settings:

batch_size=1: Avoid OOM on diverse test videos
num_workers=0: Simpler debugging
shuffle=False: Match predictions to input order

From config.py:73-75:

TRAINING_CONFIG = {
    "num_workers": 4,
    "pin_memory": True,
    "persistent_workers": True
}

Settings Explained:

Setting	Value	Purpose
`num_workers`	4	Parallel video loading (CPU cores)
`pin_memory`	True	Faster CPU→GPU transfer (page-locked memory)
`persistent_workers`	True	Workers stay alive between epochs (faster)

num_workers > 0 requires proper multiprocessing. On some systems (especially with Decord), use num_workers=0 if you encounter issues.

Complete Data Flow

CSV Loading

df = pd.read_csv(csv_file)
# Columns: video_name, Prompt, Traditional_MOS, ..., Overall_MOS

Video Reading

vr = VideoReader(str(video_path))
frames = vr.get_batch(indices)  # (64, H, W, 3)

Frame Transformation

for frame in frames:
    transformed = transform(frame)  # Resize + ToTensor
video_tensor = stack(transformed)  # (3, 64, H, W)

Batching

batch_frames = stack([item['frames'] for item in batch])  # (B, 3, 64, H, W)
batch_frames = batch_frames.to('cuda')

Text Encoding

text_emb = text_encoder.encode(prompts, device='cuda')  # (B, 1024)

Label Extraction

labels = torch.tensor(mos_scores, dtype=float32)  # (B, 5)

Return Batch

return {
    "pixel_values_videos": batch_frames,
    "text_emb": text_emb,
    "labels": labels,
    "prompts": prompts,
    "video_names": video_names
}

Performance Considerations

Bottlenecks
Optimization Tips
Profiling

Common bottlenecks in video data loading:

Disk I/O: Reading video files from disk
- Solution: Use SSD storage, prefetch with num_workers
Video decoding: Decompressing video codecs
- Solution: Decord with GPU acceleration
Frame transformation: Resizing and converting
- Solution: Batch transforms, use GPU if possible
Text encoding: BERT forward pass
- Solution: Encode in collate_fn (parallel with training)
CPU→GPU transfer: Moving data to GPU
- Solution: pin_memory=True, early GPU transfer

Data Loading Best Practices:✅ DO:

Use SSD for video storage
Set num_workers=4 (match CPU cores)
Enable pin_memory=True
Use persistent_workers=True
Encode text on GPU in collate_fn
Clean up memory after batching

❌ DON’T:

Load full videos into memory
Encode text on CPU then transfer
Use num_workers > CPU cores
Forget to call empty_cache()
Decode videos to RGB unnecessarily

Measure data loading speed:

import time

start = time.time()
for batch in train_loader:
    # Measure time to get batch
    print(f"Batch loaded in {time.time() - start:.2f}s")
    start = time.time()
    break

Target: Less than 0.5s per batch (should be faster than model forward pass)

Architecture

How preprocessed data flows through models

DOVER++ Model

640×640 resolution requirements

V-JEPA2 Model

384×384 resolution and video processor

Documentation Index

​Overview

​Dataset Structure

​CSV Label Format

​Frame Sampling

​Resolution Adaptation

​Text Processing with BGE-Large

​MOS Score Normalization

​Data Loading Pipeline

​TaobaoVDDataset Class

​GPU-Optimized Collation

​DataLoader Configuration

​Complete Data Flow

​Performance Considerations

​Related Topics