The following is one of possibly many posts I'll be using to consolidate contributions to LinkedIn's Quora-esque collaborative articles wherein contributors can answer questions and respond to/upvote other answers. I really like the feature, especially the fact that each contribution is capped to 750 characters (so no redundancy).
For some questions, I have added footnotes that go beyond my 750 character contribution to the topic itself. I appreciate that I can tell my initial thoughts on LinkedIn, and expand upon them here.
I will be testing out the feature for the next upcoming weeks, and consolidate all my answers I have written through LinkedIn as blog posts on my personal website every Sunday.
End user feedback for system upgrades
Q: End-users are unhappy with system upgrades. How can you address their feedback effectively?
The most significant appreciation for an Operating System is its seamless integration, i.e. allowing users to perform tasks without interference. Ensuring OS upgrades don’t compromise this functionality is crucial.
The difference between minimum and recommended requirements to run an Operating System is often significant. Developers should ensure the minimum meets the following criteria: a. It’s not significantly behind the average system. b. It doesn’t severely limit the device’s capabilities.
For example, iOS 9 on the iPhone 4S was excessively slow and unusable. Conversely, Windows 11 with TPM 2.0 as a minimum requirement excluded many recent devices with all the necessary capabilities.
Programming Language Wars
Q: Developers clash over preferred programming languages. Which one will reign supreme in your coding standards?
"When you're given a hammer, everything looks like a nail."
In software development, there is no one-size-fits-all programming language—it’s essential to understand the specific needs of a project and choose the technology that best aligns with those goals.
The languages I primarily use are C (or C++) and Rust. Rust brings advanced memory safety, a better developer experience, and stronger security, while C offers the benefit of mature tooling due to its long history. However, for tasks like data analysis, languages like Python or R are preferred by data scientists for their specialized libraries.
Ultimately, understanding the project needs and selecting the right tools is key to success.
(ref. to Footnote 1)
Optimizing During Peak Usage Periods
Q: You're facing peak usage periods. How do you keep system performance strategies in check?
Peak usage demands proactive, layered strategies. First, robust monitoring is essential: real-time metrics on CPU, memory, network, and disk I/O provide early warnings. Load balancing distributes traffic, preventing single-point overloads. Caching frequently accessed data minimizes database hits.
Scaling strategies are critical. Horizontal scaling adds servers; vertical scaling upgrades existing ones.
Auto-scaling, based on predefined thresholds, ensures responsiveness. Code optimization and database indexing improve efficiency. Finally, regular load testing simulates peak conditions, validating strategies and identifying bottlenecks before they impact users.
ELI5: Software Updates
Q: You need to explain software updates to a non-technical team. How do you make it clear and concise?
Think of software updates like home maintenance. They patch leaky roofs (bugs), add new rooms (features), and install stronger locks (security). This keeps our digital home safe and running smoothly.
Updates prevent digital mishaps, and reduce IT downtime. Less downtime means less frustration and more time for your work. We schedule these updates at times that minimize disruption, like doing yard work on a quiet afternoon, so your workflow stays consistent.
We believe in transparency. We'll always let you know when updates are coming and what they'll do. We're working to make our systems reliable, so you can focus on your tasks without worrying about technical hiccups.
OSes: Stability vs. Feature Support
Q: You're delaying features to keep system stability intact. How do you handle user dissatisfaction?
From a Linux perspective, various distributions provide distinct release cycles to address such scenarios. For instance, Ubuntu releases an initial version and a Long Term Support (LTS) release midway through the initial release’s cycle.
On the other hand, distributions such as Fedora, Arch, or Alpine are specifically known for being at the forefront of every new feature.
Tech-savvy individuals and developers are frequently encouraged to utilize distributions with novel features to identify and incorporate breaking changes promptly during the development process. Conversely, mission-critical equipment and servers are often advised to operate on LTS releases to guarantee the resolution of all bugs before any updates are deployed.
Onboarding for a Blockchain Project
Q: You're onboarding clients to a blockchain project. How do you address their data privacy concerns?
Many mistakenly associate blockchain exclusively with cryptocurrencies, leading to data privacy concerns. To address this, we must:
- Separate blockchain from cryptocurrency, explaining its versatile uses beyond digital currency, such as supply chain tracking, secure voting systems, and immutable document verification.
- Emphasize encryption and anonymization, ensuring sensitive data is encrypted and masked while maintaining data integrity.
- Highlight permissioned blockchains’ role in controlling data visibility and enforcing privacy rules.
- Emphasize blockchain’s audit trails for verifiable and immutable data transaction histories.
- Illustrate secure data management with real-world examples.
Data Scraping for Real-Time Datasets
Q: You're prioritizing accuracy in real-time data processing. How do you tackle performance issues?
Striking a balance between caching and updates is crucial for maintaining performance. Crawlers can’t be 100% precise and also be working on full capacity, so prioritizing critical data streams is essential.
For example, during one of my team projects at UMass: ASSERT: AI-Supported Smart Electricity Restoration Tool, we faced challenges working with Open Government Data (https://suobset.github.io/assert). We created diverse crawlers and scripts, including mouse bots and JavaScript scripts that update cached data dynamically when a specified percentage changes.
We also used load and latency balancing techniques to optimize long-term data scraping and server usage, with vast terabytes of data. This is crucial to not overwhelm servers.
(Ref. to Footnote 2)
Footnotes
- I experimented with loading the handwriting recognition MNIST database on C vs. Python, to showcase how awful C is for a data science purpose. Do keep in mind all of this is brute force and incredibly surface level:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import mnist
# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# Basic analysis: Display the shape and a sample image
print("Training images shape:", train_images.shape)
print("Training labels shape:", train_labels.shape)
plt.imshow(train_images[0], cmap='gray')
plt.title(f"Label: {train_labels[0]}")
plt.show()
# Flatten and normalize the images
train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255
# Basic analysis: Display the min and max pixel values after normalization.
print("Min pixel value: ", np.min(train_images))
print("Max pixel value: ", np.max(train_images))
# Basic analysis: Display the count of each label.
unique_labels, counts = np.unique(train_labels, return_counts=True)
print("Label counts:", dict(zip(unique_labels, counts)))
The Python code leverages popular libraries like numpy
, matplotlib
, and tensorflow
to load the MNIST dataset and quickly perform basic operations, such as reshaping the data, normalizing pixel values, and visualizing a sample image. Python's rich ecosystem of libraries, such as TensorFlow and Keras, significantly simplifies these tasks.
Benefits of Python for Data Science:
- Ease of use: Python is known for its simplicity and readability, making it an ideal choice for rapid development.
- Extensive libraries: With libraries like NumPy, Pandas, Matplotlib, and TensorFlow, Python is well-suited for data manipulation, analysis, and machine learning.
- Community support: The vast number of resources and tutorials available online helps developers quickly find solutions to common problems.
- Integration with ML frameworks: Python integrates seamlessly with powerful machine learning libraries, such as TensorFlow, Keras, and PyTorch.
Cons of Python:
- Slower execution: Python's performance may not be ideal for tasks requiring high computational efficiency, especially for large datasets or complex algorithms.
- Memory consumption: Python's high-level nature can lead to higher memory usage, which might not be optimal in resource-constrained environments.
- Global Interpreter Lock (GIL): For multi-threaded programs, Python's GIL can limit performance, particularly when using CPU-bound tasks.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
// Function to reverse bytes (MNIST data is big-endian)
uint32_t reverseInt(uint32_t i) {
unsigned char c1, c2, c3, c4;
c1 = i & 0xFF;
c2 = (i >> 8) & 0xFF;
c3 = (i >> 16) & 0xFF;
c4 = (i >> 24) & 0xFF;
return ((uint32_t)c1 << 24) + ((uint32_t)c2 << 16) + ((uint32_t)c3 << 8) + c4;
}
int main() {
// File paths for MNIST data
const char *image_file = "train-images.idx3-ubyte";
const char *label_file = "train-labels.idx1-ubyte";
FILE *img_fp = fopen(image_file, "rb");
FILE *lbl_fp = fopen(label_file, "rb");
if (!img_fp || !lbl_fp) {
perror("Error opening files");
return 1;
}
// Read image header
uint32_t magic_img, num_images, rows, cols;
fread(&magic_img, sizeof(magic_img), 1, img_fp);
fread(&num_images, sizeof(num_images), 1, img_fp);
fread(&rows, sizeof(rows), 1, img_fp);
fread(&cols, sizeof(cols), 1, img_fp);
magic_img = reverseInt(magic_img);
num_images = reverseInt(num_images);
rows = reverseInt(rows);
cols = reverseInt(cols);
printf("Image Magic Number: %u\n", magic_img);
printf("Number of Images: %u\n", num_images);
printf("Rows: %u\n", rows);
printf("Columns: %u\n", cols);
// Read label header
uint32_t magic_lbl, num_labels;
fread(&magic_lbl, sizeof(magic_lbl), 1, lbl_fp);
fread(&num_labels, sizeof(num_labels), 1, lbl_fp);
magic_lbl = reverseInt(magic_lbl);
num_labels = reverseInt(num_labels);
printf("Label Magic Number: %u\n", magic_lbl);
printf("Number of Labels: %u\n", num_labels);
if (num_images != num_labels) {
fprintf(stderr, "Image and label count mismatch!\n");
fclose(img_fp);
fclose(lbl_fp);
return 1;
}
// Read and analyze the first image
uint8_t *image_data = (uint8_t *)malloc(rows * cols);
uint8_t label;
if (!image_data) {
perror("Memory allocation failed");
fclose(img_fp);
fclose(lbl_fp);
return 1;
}
fread(image_data, sizeof(uint8_t), rows * cols, img_fp);
fread(&label, sizeof(uint8_t), 1, lbl_fp);
printf("First image label: %u\n", label);
// Basic analysis: min/max
int min = 255, max = 0;
for (int i = 0; i < rows * cols; i++) {
if (image_data[i] < min) min = image_data[i];
if (image_data[i] > max) max = image_data[i];
}
printf("Min pixel: %d, Max pixel: %d\n", min, max);
// basic label counting.
int labelCounts[10] = {0};
fseek(lbl_fp,8,SEEK_SET); //reset file pointer to beginning of label data.
for(int i = 0; i < num_labels; i++){
fread(&label, sizeof(uint8_t), 1, lbl_fp);
labelCounts[label]++;
}
for(int i = 0; i < 10; i++){
printf("Count of label %d: %d\n", i, labelCounts[i]);
}
free(image_data);
fclose(img_fp);
fclose(lbl_fp);
return 0;
}
In contrast, the C code demonstrates a more hands-on approach to loading the MNIST dataset. It involves directly reading and parsing binary files, implementing byte reversal for big-endian data, and manually allocating memory for the image data. While C offers more control over memory and performance, the complexity of the code quickly increases for even basic operations.
Benefits of C:
- Performance: C typically offers faster execution times, as it is a compiled language with lower-level access to system resources.
- Memory control: C allows for fine-grained memory management, which can be crucial for memory-intensive tasks.
- Optimization: Developers can optimize their code to run efficiently on specific hardware by controlling memory allocation, data structures, and processing logic.
Cons of C:
- Complexity: C requires more boilerplate code to perform even simple tasks. For example, loading the MNIST dataset in C involves managing memory allocation and dealing with low-level file handling.
- Lack of high-level abstractions: Unlike Python, C does not provide built-in functions or libraries for high-level data manipulation, visualization, or machine learning, making it much harder to implement data science workflows.
- Error-prone: Without the safety checks provided by higher-level languages, C programs are more prone to errors such as memory leaks, buffer overflows, and pointer issues.
- Longer development time: Due to its low-level nature, C requires more effort to achieve the same results as Python, which can hinder productivity and lead to longer development cycles.
Ultimately, while C may be more suited for performance-critical applications, Python's high-level nature and vast ecosystem make it the preferred language for data science tasks.
There is no "one perfect programming language".
- Most of our data crawlers and scrapers for the ASSERT project were written on JavaScript using DOM manipulation to simulate clicks or scrape download URLs. Except for one: a GIS-based tool that was written on and served using PHP. This other tool had no properly labeled buttons, and there were no ways to download everything except by creating a bot that would take over the mouse, click on each option (through coordinates on the screen) and then click the download button. The code for this
MouseBot
is linked here