Read shapefile from HDFS with geopandas

I have a shapefile on my HDFS and I would like to import it in my Jupyter Notebook with geopandas (version 0.8.1).
I tried the standard read_file() method but it does not recognize the HDFS directory; instead I believe it searches in my local directory, as I made a test with the local directory and reads the shapefile correctly.

This is the code I used:

import geopandas as gpd

shp = gpd.read_file('hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp')

and the error I obtained:

---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
fiona/_shim.pyx in fiona._shim.gdal_open_vector()

fiona/_err.pyx in fiona._err.exc_wrap_pointer()

CPLE_OpenFailedError: hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp: No such file or directory

During handling of the above exception, another exception occurred:

DriverError                               Traceback (most recent call last)
<ipython-input-17-3118e740e4a9> in <module>
----> 2 shp = gpd.read_file('hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp' class="ansi-blue-fg">)
      3 print(shp.shape)
      4 shp.head(3)

/opt/venv/geocoding/lib/python3.6/site-packages/geopandas/io/file.py in _read_file(filename, bbox, mask, rows, **kwargs)
     94 
     95     with fiona_env():
---> 96         with reader(path_or_bytes, **kwargs) as features:
     97 
     98             # In a future Fiona release the crs attribute of features will

/opt/venv/geocoding/lib/python3.6/site-packages/fiona/env.py in wrapper(*args, **kwargs)
    398     def wrapper(*args, **kwargs):
    399         if local._env:
--> 400             return f(*args, **kwargs)
    401         else:
    402             if isinstance(args[0], str):

/opt/venv/geocoding/lib/python3.6/site-packages/fiona/__init__.py in open(fp, mode, driver, schema, crs, encoding, layer, vfs, enabled_drivers, crs_wkt, **kwargs)
    255         if mode in ('a', 'r'):
    256             c = Collection(path, mode, driver=driver, encoding=encoding,
--> 257                            layer=layer, enabled_drivers=enabled_drivers, **kwargs)
    258         elif mode == 'w':
    259             if schema:

/opt/venv/geocoding/lib/python3.6/site-packages/fiona/collection.py in __init__(self, path, mode, driver, schema, crs, encoding, layer, vsi, archive, enabled_drivers, crs_wkt, ignore_fields, ignore_geometry, **kwargs)
    160             if self.mode == 'r':
    161                 self.session = Session()
--> 162                 self.session.start(self, **kwargs)
    163             elif self.mode in ('a', 'w'):
    164                 self.session = WritingSession()

fiona/ogrext.pyx in fiona.ogrext.Session.start()

fiona/_shim.pyx in fiona._shim.gdal_open_vector()

DriverError: hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp: No such file or directory



So, I was wondering whether it is actually possible to read a shapefile, stored in HDFS, with geopandas. If yes, how?

Answers

If someone is still looking for an answer to this question, I managed to find a workaround.

First of all, you need a .zip file which contains all the data related to your shapefile (.shp, .shx, .dbf, ...). Then, we use pyarrow to establish a connection to HDFS and fiona to read the zipped shapefile.

Package versions I'm using:

  • pyarrow==2.0.0
  • fiona==1.8.18

The code:

# import packages
import pandas as pd
import geopandas as gpd
import fiona
import pyarrow

# establish a connection to HDFS
fs = pyarrow.hdfs.connect()

# read zipped shapefile
with fiona.io.ZipMemoryFile(fs.open('hdfs://my_hdfs_directory/my_zipped_shapefile.zip')) as z:
    with z.open('my_shp_file_within_zip.shp') as collection:
        gdf = gpd.GeoDataFrame.from_features(collection)
        print(gdf.shape)

Posted on by Ric S