Skip to content

Loaders

lahuta.core._loaders

Classes to load and manage biological structure data.

The two main classes are GemmiLoader and TopologyLoader, each designed to work with a specific library, Gemmi and MDAnalysis respectively, to load and handle structure data.

Both loaders extend from an abstract base class BaseLoader which outlines the necessary methods and properties all loader classes should have.

The module also provides a unified way to manage atoms, residues, and chains information through the ARC class instance. It includes various methods for querying the data and converting them into other common bioinformatics and chemoinformatics formats.

Classes:

Name Description
BaseLoader

Abstract base class that provides a template for the loader classes.

GemmiLoader

Class to load and manage biological structure data using the Gemmi library.

TopologyLoader

Class to load and manage biological structure data using the MDAnalysis library.

BaseLoader

Bases: ABC

Abstract base class providing a blueprint for loading and handling biological structure data.

BaseLoader forms the basis for other loader classes in the lahuta library, dedicated to handling different formats of biological structural data. It includes the initialization of atomic, residue, and chain data as well as their transformation into different formats.

The class takes a file path as an input and reads the biological structure data. It offers various properties to access the data, as well as methods to convert the loaded data into different output formats. It also provides an iterator to iterate over atoms, residues, and chains.

Parameters:

Name Type Description Default
file_path str

The file path from which to load the biological structure data.

required

Attributes:

Name Type Description
file_path str

The file path from which the biological structure data is loaded.

_chains Optional[Chains]

An instance of Chains class containing chain data.

_residues Optional[Residues]

An instance of Residues class containing residue data.

_atoms Optional[Atoms]

An instance of Atoms class containing atom data.

structure

A structure object containing biological structure data, currently initialized to None.

arc Optional[ARC]

An instance of ARC class providing combined access to atoms, residues, and chains data.

Source code in lahuta/core/_loaders.py
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
class BaseLoader(ABC):
    """Abstract base class providing a blueprint for loading and handling biological structure data.

    `BaseLoader` forms the basis for other loader classes in the lahuta library, dedicated
    to handling different formats of biological structural data. It includes the initialization
    of atomic, residue, and chain data as well as their transformation into different formats.

    The class takes a file path as an input and reads the biological structure data. It offers
    various properties to access the data, as well as methods to convert the loaded data into
    different output formats. It also provides an iterator to iterate over atoms, residues, and chains.

    Args:
        file_path (str): The file path from which to load the biological structure data.

    Attributes:
        file_path (str): The file path from which the biological structure data is loaded.
        _chains (Optional[Chains]): An instance of Chains class containing chain data.
        _residues (Optional[Residues]): An instance of Residues class containing residue data.
        _atoms (Optional[Atoms]): An instance of Atoms class containing atom data.
        structure: A structure object containing biological structure data, currently initialized to None.
        arc (Optional[ARC]): An instance of ARC class providing combined access to atoms, residues, and chains data.
    """

    def __init__(self, file_path: str):
        self.file_path = file_path
        self._chains = None
        self._residues = None
        self._atoms = None

        self.structure = None
        self.arc: Optional[ARC] = None

    @property
    def n_atoms(self) -> int:
        """The number of atoms in the loaded biological structure data."""
        return len(self.atoms)

    @property
    def chains(self) -> Chains:
        """The chains in the loaded biological structure data."""
        if self.arc is None:
            raise ValueError("arc has not been initialized")
        return self.arc.chains

    @property
    def residues(self) -> Residues:
        """The residues in the loaded biological structure data."""
        if self.arc is None:
            raise ValueError("arc has not been initialized")
        return self.arc.residues

    @property
    def atoms(self) -> Atoms:
        """The atoms in the loaded biological structure data."""
        if self.arc is None:
            raise ValueError("arc has not been initialized")
        return self.arc.atoms

    @overload
    def to(self, fmt: Literal["mda"]) -> AtomGroupType:
        ...

    @overload
    def to(self, fmt: Literal["mol"]) -> MolType:
        ...

    def to(self, fmt: Literal["mol", "mda"]) -> MolType | AtomGroupType:
        """Convert the loaded biological structure data into a different format.

        Args:
            fmt (str): The format to which the biological structure data should be converted.

        Returns:
            MolType | AtomGroupType: The biological structure data in the specified format.

        Raises:
            ValueError: If the specified format is not supported.
        """
        method_str = f"to_{fmt}"
        if hasattr(self, method_str):
            return getattr(self, method_str)()  # type: ignore

        raise ValueError(f"Object type {fmt} is not supported")

    @abstractmethod
    def to_mda(self) -> AtomGroupType:
        """Convert the loaded biological structure data into an MDAnalysis AtomGroup object."""

    @abstractmethod
    def to_mol(self) -> MolType:
        """Convert the loaded biological structure data into an OpenBabel Mol object."""

    def __iter__(self) -> Iterator[Atoms | Residues | Chains]:
        """Iterate over atoms, residues, and chains."""
        yield self.atoms
        yield self.residues
        yield self.chains

n_atoms property

n_atoms

The number of atoms in the loaded biological structure data.

chains property

chains

The chains in the loaded biological structure data.

residues property

residues

The residues in the loaded biological structure data.

atoms property

atoms

The atoms in the loaded biological structure data.

to

to(fmt)

Convert the loaded biological structure data into a different format.

Parameters:

Name Type Description Default
fmt str

The format to which the biological structure data should be converted.

required

Returns:

Type Description
MolType | AtomGroupType

MolType | AtomGroupType: The biological structure data in the specified format.

Raises:

Type Description
ValueError

If the specified format is not supported.

Source code in lahuta/core/_loaders.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def to(self, fmt: Literal["mol", "mda"]) -> MolType | AtomGroupType:
    """Convert the loaded biological structure data into a different format.

    Args:
        fmt (str): The format to which the biological structure data should be converted.

    Returns:
        MolType | AtomGroupType: The biological structure data in the specified format.

    Raises:
        ValueError: If the specified format is not supported.
    """
    method_str = f"to_{fmt}"
    if hasattr(self, method_str):
        return getattr(self, method_str)()  # type: ignore

    raise ValueError(f"Object type {fmt} is not supported")

to_mda abstractmethod

to_mda()

Convert the loaded biological structure data into an MDAnalysis AtomGroup object.

Source code in lahuta/core/_loaders.py
121
122
123
@abstractmethod
def to_mda(self) -> AtomGroupType:
    """Convert the loaded biological structure data into an MDAnalysis AtomGroup object."""

to_mol abstractmethod

to_mol()

Convert the loaded biological structure data into an OpenBabel Mol object.

Source code in lahuta/core/_loaders.py
125
126
127
@abstractmethod
def to_mol(self) -> MolType:
    """Convert the loaded biological structure data into an OpenBabel Mol object."""

GemmiLoader

Bases: BaseLoader

Class for loading biological structure data using the gemmi library.

GemmiLoader is a subclass of BaseLoader and provides the implementation to read, convert, and manage biological structure data specifically from the gemmi library. The class takes a file path as an input and reads the biological structure data either in PDB format or in the format specified in the file. It also offers methods to convert the loaded data into different output formats.

Parameters:

Name Type Description Default
file_path str

The file path from which to load the biological structure data.

required
is_pdb bool

A flag to indicate whether the file is in PDB format. Defaults to False.

False

Attributes:

Name Type Description
structure

A gemmi structure object containing biological structure data.

arc Optional[ARC]

An instance of ARC class providing combined access to atoms, residues, and chains data.

ag AtomGroupType

An MDAnalysis AtomGroup object created from the loaded data.

Source code in lahuta/core/_loaders.py
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
class GemmiLoader(BaseLoader):
    """Class for loading biological structure data using the gemmi library.

    `GemmiLoader` is a subclass of `BaseLoader` and provides the implementation to read,
    convert, and manage biological structure data specifically from the gemmi library.
    The class takes a file path as an input and reads the biological structure data either
    in PDB format or in the format specified in the file. It also offers methods to convert
    the loaded data into different output formats.

    Args:
        file_path (str): The file path from which to load the biological structure data.
        is_pdb (bool): A flag to indicate whether the file is in PDB format. Defaults to False.

    Attributes:
        structure: A gemmi structure object containing biological structure data.
        arc (Optional[ARC]): An instance of ARC class providing combined access to atoms,
                             residues, and chains data.
        ag (AtomGroupType): An MDAnalysis AtomGroup object created from the loaded data.
    """

    def __init__(self, file_path: str, is_pdb: bool = False):
        super().__init__(file_path)
        if is_pdb:
            structure: Any = gemmi.read_pdb(self.file_path)  # type: ignore
            block: Any = structure.make_mmcif_document().sole_block()
        else:
            block: Any = gemmi.cif.read(self.file_path).sole_block()  # type: ignore
            structure: Any = gemmi.make_structure_from_block(block)  # type: ignore

        self.structure = structure
        atom_site_data: dict[str, Any] = block.get_mmcif_category("_atom_site.")

        self.arc = ARC(self, atom_site_data)
        self.arc.atoms.coordinates = self.extract_positions(atom_site_data)

        self.ag: AtomGroupType = self._create_mda()

    def extract_positions(self, atom_site_data: dict[str, Any]) -> NDArray[np.float32]:
        """Extract the coordinates from the atom_site data.

        Args:
            atom_site_data (dict[str, Any]): The atom_site data from the gemmi structure object.

        Returns:
            NDArray[np.float32]: The coordinates of the atoms in the structure.

        Raises:
            ValueError: If the atom_site data does not contain the required information.
        """
        coords_array: NDArray[np.float32] = np.zeros((self.n_atoms, 3), dtype=np.float32)
        coords_array[:, 0] = atom_site_data.get("Cartn_x")
        coords_array[:, 1] = atom_site_data.get("Cartn_y")
        coords_array[:, 2] = atom_site_data.get("Cartn_z")

        return coords_array

    def _create_mda(self) -> AtomGroupType:
        """Create an MDAnalysis AtomGroup object from the loaded data.

        Returns:
            AtomGroupType: An MDAnalysis AtomGroup object.
        """
        # Create a structured array to ensure unique values for each combination of resname, resid, and chain_id
        assert self.arc is not None, "arc has not been initialized"
        struct_arr = np.rec.fromarrays(  # type: ignore
            [self.arc.residues.resnames, self.arc.residues.resids, self.arc.chains.ids],
            names=str("resnames, resids, chain_ids"),
        )

        # Use factorize to get the labels and unique values
        resindices, uniques = pd.factorize(struct_arr)

        resnames, resids, chain_ids = (uniques["resnames"], uniques["resids"], uniques["chain_ids"])

        # Create a new Universe
        uv: UniverseType = mda.Universe.empty(
            n_atoms=self.arc.atoms.ids.size,
            n_residues=uniques.size,
            n_segments=chain_ids.size,
            atom_resindex=resindices,
            residue_segindex=chain_ids,
            trajectory=True,
        )

        # Add topology attributes
        uv.add_TopologyAttr("names", self.arc.atoms.names)
        uv.add_TopologyAttr("type", self.arc.atoms.types)
        uv.add_TopologyAttr("elements", self.arc.atoms.elements)
        uv.add_TopologyAttr("resnames", resnames)
        uv.add_TopologyAttr("resids", resids)
        uv.add_TopologyAttr("segids", chain_ids)
        uv.add_TopologyAttr("ids", self.arc.atoms.ids)

        uv.atoms.positions = self.arc.atoms.coordinates

        return uv.atoms

    def to_mda(self) -> AtomGroupType:
        """Convert the loaded biological structure data into an MDAnalysis AtomGroup object."""
        return self.ag

    def to_mol(self) -> MolType:
        """Convert the loaded biological structure data into an OpenBabel Mol object."""
        assert self.arc is not None, "arc has not been initialized"
        assert self.structure is not None, "structure should not be None"
        obmol = OBMol()
        obmol.create_mol(
            self.arc,
            self.structure.connections,
        )
        assert obmol.mol is not None
        return obmol.mol

extract_positions

extract_positions(atom_site_data)

Extract the coordinates from the atom_site data.

Parameters:

Name Type Description Default
atom_site_data dict[str, Any]

The atom_site data from the gemmi structure object.

required

Returns:

Type Description
NDArray[float32]

NDArray[np.float32]: The coordinates of the atoms in the structure.

Raises:

Type Description
ValueError

If the atom_site data does not contain the required information.

Source code in lahuta/core/_loaders.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def extract_positions(self, atom_site_data: dict[str, Any]) -> NDArray[np.float32]:
    """Extract the coordinates from the atom_site data.

    Args:
        atom_site_data (dict[str, Any]): The atom_site data from the gemmi structure object.

    Returns:
        NDArray[np.float32]: The coordinates of the atoms in the structure.

    Raises:
        ValueError: If the atom_site data does not contain the required information.
    """
    coords_array: NDArray[np.float32] = np.zeros((self.n_atoms, 3), dtype=np.float32)
    coords_array[:, 0] = atom_site_data.get("Cartn_x")
    coords_array[:, 1] = atom_site_data.get("Cartn_y")
    coords_array[:, 2] = atom_site_data.get("Cartn_z")

    return coords_array

to_mda

to_mda()

Convert the loaded biological structure data into an MDAnalysis AtomGroup object.

Source code in lahuta/core/_loaders.py
233
234
235
def to_mda(self) -> AtomGroupType:
    """Convert the loaded biological structure data into an MDAnalysis AtomGroup object."""
    return self.ag

to_mol

to_mol()

Convert the loaded biological structure data into an OpenBabel Mol object.

Source code in lahuta/core/_loaders.py
237
238
239
240
241
242
243
244
245
246
247
def to_mol(self) -> MolType:
    """Convert the loaded biological structure data into an OpenBabel Mol object."""
    assert self.arc is not None, "arc has not been initialized"
    assert self.structure is not None, "structure should not be None"
    obmol = OBMol()
    obmol.create_mol(
        self.arc,
        self.structure.connections,
    )
    assert obmol.mol is not None
    return obmol.mol

TopologyLoader

Bases: BaseLoader

Class for loading and managing biological structure data using the MDAnalysis library.

TopologyLoader is a subclass of BaseLoader and provides the implementation to read, convert, and manage biological structure data specifically from the MDAnalysis library. This class accepts one or more file paths and loads the structure data into an MDAnalysis Universe. It also offers methods to convert the loaded data into different output formats.

Parameters:

Name Type Description Default
*paths str

One or more file paths from which to load the biological structure data.

()

Attributes:

Name Type Description
ag AtomGroupType

An MDAnalysis AtomGroup object created from the loaded data.

arc Optional[ARC]

An instance of ARC class providing combined access to atoms, residues, and chains data.

Source code in lahuta/core/_loaders.py
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
class TopologyLoader(BaseLoader):
    """Class for loading and managing biological structure data using the MDAnalysis library.

    `TopologyLoader` is a subclass of `BaseLoader` and provides the implementation to read,
    convert, and manage biological structure data specifically from the MDAnalysis library.
    This class accepts one or more file paths and loads the structure data into an
    MDAnalysis Universe. It also offers methods to convert the loaded data into different output formats.

    Args:
        *paths (str): One or more file paths from which to load the biological structure data.

    Attributes:
        ag (AtomGroupType): An MDAnalysis AtomGroup object created from the loaded data.
        arc (Optional[ARC]): An instance of ARC class providing combined access to atoms,
                             residues, and chains data.

    """

    def __init__(self, *paths: str):
        file_path: str = paths[0]
        super().__init__(file_path)
        universe = mda.Universe(self.file_path)
        self.ag: AtomGroupType = universe.atoms
        assert self.ag is not None
        if len(paths) > 1:
            self.ag.universe.load_new(paths[1:], format=None, in_memory=False)

        self.arc = ARC(self, self.ag)  # positions are set when using mda.Universe

    def to_mda(self) -> AtomGroupType:
        """Convert the loaded biological structure data into an MDAnalysis AtomGroup object."""
        return self.ag

    def to_mol(self) -> MolType:
        """Convert the loaded biological structure data into an OpenBabel Mol object."""
        assert self.arc is not None, "arc has not been initialized"
        obmol = OBMol()
        obmol.create_mol(
            self.arc,
            self.structure,
        )
        assert obmol.mol is not None
        return obmol.mol

    @classmethod
    def from_mda(cls, ag: AtomGroupType) -> "TopologyLoader":
        """Create a new instance of the TopologyLoader class from an MDAnalysis AtomGroup object.

        Args:
            ag (AtomGroupType): An MDAnalysis AtomGroup object.

        Returns:
            TopologyLoader: A new instance of the TopologyLoader class with the AtomGroup data copied.
        """
        top_loader = cls.__new__(cls)
        top_loader.ag = ag.copy()
        top_loader.ag._u = ag.universe.copy()  # noqa: SLF001
        top_loader.structure = None

        top_loader.arc = ARC(top_loader, top_loader.ag)

        return top_loader

to_mda

to_mda()

Convert the loaded biological structure data into an MDAnalysis AtomGroup object.

Source code in lahuta/core/_loaders.py
279
280
281
def to_mda(self) -> AtomGroupType:
    """Convert the loaded biological structure data into an MDAnalysis AtomGroup object."""
    return self.ag

to_mol

to_mol()

Convert the loaded biological structure data into an OpenBabel Mol object.

Source code in lahuta/core/_loaders.py
283
284
285
286
287
288
289
290
291
292
def to_mol(self) -> MolType:
    """Convert the loaded biological structure data into an OpenBabel Mol object."""
    assert self.arc is not None, "arc has not been initialized"
    obmol = OBMol()
    obmol.create_mol(
        self.arc,
        self.structure,
    )
    assert obmol.mol is not None
    return obmol.mol

from_mda classmethod

from_mda(ag)

Create a new instance of the TopologyLoader class from an MDAnalysis AtomGroup object.

Parameters:

Name Type Description Default
ag AtomGroupType

An MDAnalysis AtomGroup object.

required

Returns:

Name Type Description
TopologyLoader TopologyLoader

A new instance of the TopologyLoader class with the AtomGroup data copied.

Source code in lahuta/core/_loaders.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
@classmethod
def from_mda(cls, ag: AtomGroupType) -> "TopologyLoader":
    """Create a new instance of the TopologyLoader class from an MDAnalysis AtomGroup object.

    Args:
        ag (AtomGroupType): An MDAnalysis AtomGroup object.

    Returns:
        TopologyLoader: A new instance of the TopologyLoader class with the AtomGroup data copied.
    """
    top_loader = cls.__new__(cls)
    top_loader.ag = ag.copy()
    top_loader.ag._u = ag.universe.copy()  # noqa: SLF001
    top_loader.structure = None

    top_loader.arc = ARC(top_loader, top_loader.ag)

    return top_loader