YCombinator-Scraper


CI/CD
Docs
Package
Meta

YCombinator-Scraper provides a web scraping tool for extracting data from Workatastartup website. The package uses Selenium and BeautifulSoup to navigate through the pages and extract information.

Documentation: https://nneji123.github.io/ycombinator-scraper

Source Code: https://github.com./nneji123/ycombinator-scraper

Sponsor

Proxycurl APIs

Scrape public LinkedIn profile data at scale with Proxycurl APIs.

Scraping Public profiles are battle tested in court in HiQ VS LinkedIn case.
GDPR, CCPA, SOC2 compliant.
High rate limit - 300 requests/minute.
Fast - APIs respond in ~2s.
Fresh data - 88% of data is scraped real-time, other 12% are not older than 29 days.
High accuracy.
Tons of data points returned per profile

Built for developers, by developers.

Features

Web Scraping Capabilities:
- Extract detailed information about companies, including name, description, tags, images, job links, and social media links.
- Scrape job-specific details such as title, salary range, tags, and description.
Founder and Company Data Extraction:
- Obtain information about company founders, including name, image, description, linkedIn profile, and optional email addresses.
Headless Mode:
- Run the scraper in headless mode to perform web scraping without displaying a browser window.
Configurability:
- Easily configure scraper settings such as login credentials, logs directory, automatic install of webdriver based on browser with webdriver-manager package and using environment variables or a configuration file.
Command-Line Interface (CLI):
- Command-line tools to perform various scraping tasks interactively or in batch mode.
Data Output Formats:
- Save scraped data in JSON or CSV format, providing flexibility for further analysis or integration with other tools.
Caching Mechanism:
- Implement a caching feature to store function results for a specified duration, reducing redundant web requests and improving performance.
Docker Support:
- Package the scraper as a Docker image, enabling easy deployment and execution in containerized environments or run the prebuilt docker image docker pull nneji123/ycombinator_scraper.

Requirements

Python 3.9+
Chrome or Chromium browser installed.

Installation

$ pip install ycombinator-scraper
$ ycscraper --help

# Output
YCombinator-Scraper Version 0.7.0
Usage: python -m ycombinator_scraper [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  login
  scrape-company
  scrape-founders
  scrape-job
  version

With Docker

$ git clone https://github.com./Nneji12/ycombinator-scraper
$ cd ycombinator-scraper
$ docker build -t your_name/scraper_name . # e.g docker build -t nneji123/ycombinator_scraper .
$ docker run nneji123/ycombinator_scraper python -m ycombinator_scraper --help

Dependencies

click: Enables the creation of a command-line interface for interacting with the scraper tool.
beautifulsoup4: Facilitates the parsing and extraction of data from HTML and XML in the web scraping process.
loguru: Provides a robust logging framework to track and manage log messages generated during the scraping process.
pandas: Utilized for the manipulation and organization of data, particularly in generating CSV files from scraped information.
pathlib: Offers an object-oriented approach to handle file system paths, contributing to better file management within the project.
pydantic: Used for data validation and structuring the models that represent various aspects of scraped data.
pydantic-settings: Extends Pydantic to enhance the management of settings in the project.
selenium: Employs browser automation for web scraping, allowing interaction with dynamic web pages and extraction of information.

Usage

With CLI

ycscraper scrape-company --company-url https://www.workatastartup.com/companies/example-inc

This command will scrape data for the specified company and save it in the default output format (JSON).

With Library

from ycombinator_scraper import Scraper

scraper = Scraper()
company_data = scraper.scrape_company_data("https://www.workatastartup.com/companies/example-inc")
print(company_data.model_dump_json(by_alias=True,indent=2))

Pydantic is used under the hood so methods like model_dump_json are available for all the scraped data.

You can view more examples here: Examples

Contribution

We welcome contributions from the community! To contribute to this project, follow the steps below.

Setting Up Development Environment

Gitpod

You can use Gitpod, a free online VS Code-like environment, to quickly start contributing.

Local Setup

Clone the repository:

git clone https://github.com./nneji123/ycombinator-scraper.git
cd ycombinator-scraper

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Running Tests

Make sure to run tests before submitting a pull request.

pip install -r requirements-test.txt
pytest tests

Installing Documentation Requirements

If you make changes to documentation, install the necessary dependencies:

pip install -r requirements-docs.txt
mkdocs serve

Setting Up Pre-Commit Hooks

We use pre-commit to ensure code quality. Install it by running:

pip install pre-commit
pre-commit install

Now, pre-commit will run automatically before each commit to check for linting and other issues.

Submitting a Pull Request

Fork the repository and create a new branch for your contribution:
```
git checkout -b feature-or-fix-branch
```

Make your changes and commit them:

git add .
git commit -am "Your meaningful commit message"

Push the changes to your fork:
```
git push origin feature-or-fix-branch
```
Open a pull request on GitHub. Provide a clear title and description of your changes.

Documentation

The documentation is made with Material for MkDocs and is hosted by GitHub Pages.

License

YCombinator-Scraper is distributed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
.github		.github
docs		docs
tests		tests
ycombinator_scraper		ycombinator_scraper
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements-docs.txt		requirements-docs.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YCombinator-Scraper

Sponsor

Features

Requirements

Installation

With Docker

Dependencies

Usage

With CLI

With Library

Contribution

Setting Up Development Environment

Gitpod

Local Setup

Running Tests

Installing Documentation Requirements

Setting Up Pre-Commit Hooks

Submitting a Pull Request

Documentation

License

About

Releases

Languages

License

Nneji123/ycombinator-scraper

Folders and files

Latest commit

History

Repository files navigation

YCombinator-Scraper

Sponsor

Features

Requirements

Installation

With Docker

Dependencies

Usage

With CLI

With Library

Contribution

Setting Up Development Environment

Gitpod

Local Setup

Running Tests

Installing Documentation Requirements

Setting Up Pre-Commit Hooks

Submitting a Pull Request

Documentation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages